fix: remove overly restrictive Ray log rotation config#223
Conversation
RAY_ROTATION_MAX_BYTES=1024 was causing worker tracebacks to be truncated in production, making it impossible to diagnose the NCCL timeout root cause. Ray defaults (50MB, 5 backups) are appropriate for production workloads.
There was a problem hiding this comment.
Code Review
This pull request removes the Ray log rotation environment variables RAY_ROTATION_MAX_BYTES and RAY_ROTATION_BACKUP_COUNT from the Megatron server run script (cookbook/client/server/megatron/run.sh). I have no feedback to provide as there are no review comments and the changes are straightforward.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Pull request overview
This PR removes an overly restrictive Ray log rotation configuration from the Megatron server startup script so Ray’s default rotation can retain complete worker tracebacks (improving diagnosability of production failures like NCCL timeouts).
Changes:
- Removes
RAY_ROTATION_MAX_BYTES=1024andRAY_ROTATION_BACKUP_COUNT=1exports from the Megatronrun.shstartup script. - Relies on Ray’s default log rotation settings to avoid truncating/losing important error context.
Summary
RAY_ROTATION_MAX_BYTES=1024andRAY_ROTATION_BACKUP_COUNT=1from the Megatron server startup scriptBackground
During investigation of a production outage (NCCL collective timeout → SIGABRT on all 4 worker ranks), the only evidence of the original Python exception in
forward_backwardwas a fragmented stack trace in the aggregatedrun.log. The individualworker-*.errfiles were empty or contained only the final NCCL abort message because 1KB rotation was discarding the full traceback before it could be captured.Test plan