Improve Twinkle service stability by Yunnglin · Pull Request #230 · modelscope/twinkle

Yunnglin · 2026-06-20T14:23:08Z

Summary

Fix completion_mask padding so legal inputs no longer raise KeyError.
Add a FastAPI deployment-level exception boundary that returns 500 with traceback while keeping the server alive.
Preserve full traceback in task failure payloads.
Replace tail -f-based liveness in megatron/run.sh with a singleton watchdog/supervisor that restarts server/Ray locally.
Simplify online startup by setting default MODELSCOPE_CACHE and TWINKLE_WORK_DIR inside run.sh.
Add explicit restart mode via ./run.sh --restart or TWINKLE_RUN_EXISTING_ACTION=restart.

Root Cause

completion_mask was a valid input field but was missing from InputProcessor.padding_map, so normal padding could fail with KeyError. Separately, run.sh only stayed alive through tail -f run.log, so server/Ray failures could leave the process model fragile or fake-alive.

Startup / Restart

Default startup can now be shortened to:

bash /twinkle/cookbook/client/server/megatron/run.sh

For a code update that should actively restart an already-running service:

bash /twinkle/cookbook/client/server/megatron/run.sh --restart

or:

TWINKLE_RUN_EXISTING_ACTION=restart bash /twinkle/cookbook/client/server/megatron/run.sh

Default duplicate execution remains safe: without restart mode, a second run.sh exits instead of disrupting the live service.

Validation

bash -n cookbook/client/server/megatron/run.sh
bash cookbook/client/server/megatron/run.sh --help
git diff --check
/Users/yunlin/miniconda3/envs/twinkle/bin/python -m pre_commit run --files ...
/Users/yunlin/miniconda3/envs/twinkle/bin/python -m pytest tests/processor/test_processor.py tests/server/utils/test_task_errors.py tests/server/utils/test_task_queue_mixin.py tests/server/test_deployment_exception_boundary.py tests/server/test_app_builders_characterization.py tests/server/model -q
Local Ray Serve destructive smoke: completion_mask payload succeeded; repeated exception requests returned 500 + traceback; the same replica remained healthy and RUNNING.

Notes

Full megatron run.sh startup was not executed locally because it can stop local Ray/vLLM/Redis/LGTM processes and consume GPU resources. First online rollout should set bounded restart env vars before enabling unlimited supervisor restarts.

gemini-code-assist

Code Review

This pull request introduces a watchdog supervisor script for Megatron, adds a completion mask to the input processor, and implements unhandled exception handling middleware in the FastAPI server alongside task queue error formatting. The review feedback highlights several critical issues: the exception-handling middleware is registered outermost, which interferes with telemetry and header injection; returning raw tracebacks to clients poses a security risk; the hardcoded lock file path in /tmp can cause permission conflicts in multi-user environments; spawning Python for health checks and date for time tracking in the shell script is highly inefficient; and a task failure helper is bypassed in the worker, causing code duplication.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Copilot

Pull request overview

This PR improves Twinkle’s service stability by (1) fixing padding for a valid input field (completion_mask), (2) ensuring unhandled FastAPI route exceptions return a structured 500 response (with traceback) rather than destabilizing the replica, and (3) making the Megatron service startup more resilient via a container entrypoint watchdog and a single-instance/restart mechanism in run.sh.

Changes:

Add completion_mask to InputProcessor.padding_map and cover it with a unit test to prevent padding-time KeyError.
Add a deployment-level exception boundary middleware that converts unhandled route exceptions into 500 JSON responses containing the traceback.
Improve operational robustness of Megatron service startup by introducing entrypoint.sh supervision and enhancing run.sh with single-instance + restart-request handling.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/twinkle/processor/base.py`	Adds `completion_mask` to padding defaults to prevent padding failures.
`tests/processor/test_processor.py`	Adds regression coverage for `completion_mask` padding; adjusts multimodal assertion.
`src/twinkle/server/deployment.py`	Adds an unhandled-exception middleware boundary returning `500` with traceback.
`tests/server/test_deployment_exception_boundary.py`	Validates the exception boundary returns tracebacks and continues serving; checks replica header behavior.
`tests/server/test_app_builders_characterization.py`	Updates expected middleware ordering to include the new exception boundary.
`src/twinkle/server/utils/task_errors.py`	Introduces a shared helper for standard task failure payloads.
`src/twinkle/server/utils/task_queue/worker.py`	Uses `task_error_payload` for failed task persistence.
`src/twinkle/server/utils/task_queue/mixin.py`	Uses `task_error_payload` for background-task failure persistence.
`tests/server/utils/test_task_errors.py`	Adds tests ensuring task failure payloads preserve traceback strings.
`cookbook/client/server/megatron/run.sh`	Replaces tail-based liveness with a supervised foreground wait loop and adds restart/single-instance mechanics.
`cookbook/client/server/megatron/entrypoint.sh`	Adds a container entrypoint watchdog that restarts `run.sh` based on Ray + HTTP health checks.

Improve Twinkle service stability

b48446b

gemini-code-assist Bot reviewed Jun 20, 2026

View reviewed changes

Yunnglin added 9 commits June 20, 2026 22:29

Add run.sh restart mode

24c11b4

Harden megatron run lifecycle

a2c2880

Clarify megatron run startup requirements

6999000

Keep supervisor alive during run restart

ff493f9

Add megatron container entrypoint supervisor

3f6a9b4

Simplify megatron container entrypoint

5a81d33

Move megatron health watchdog to entrypoint

4d9fe75

Address PR review feedback

92793e1

Sync Megatron launcher scripts

5cfa97f

Yunnglin marked this pull request as ready for review June 22, 2026 03:20

Copilot AI review requested due to automatic review settings June 22, 2026 03:20

Copilot started reviewing on behalf of Yunnglin June 22, 2026 03:20 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Comment thread cookbook/client/server/megatron/run.sh

Comment thread cookbook/client/server/megatron/run.sh

Comment thread src/twinkle/server/deployment.py

Address service stability review comments

1bca58f

tastelikefeet approved these changes Jun 22, 2026

View reviewed changes

Yunnglin merged commit ca5cb73 into main Jun 22, 2026
2 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Twinkle service stability#230

Improve Twinkle service stability#230
Yunnglin merged 11 commits into
mainfrom
fix/service-stability

Yunnglin commented Jun 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Yunnglin commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Startup / Restart

Validation

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yunnglin commented Jun 20, 2026 •

edited

Loading