Skip to content

fix: remove overly restrictive Ray log rotation config#223

Open
Yunnglin wants to merge 1 commit into
mainfrom
fix/server-config-log-rotation
Open

fix: remove overly restrictive Ray log rotation config#223
Yunnglin wants to merge 1 commit into
mainfrom
fix/server-config-log-rotation

Conversation

@Yunnglin

Copy link
Copy Markdown
Collaborator

Summary

  • Remove RAY_ROTATION_MAX_BYTES=1024 and RAY_ROTATION_BACKUP_COUNT=1 from the Megatron server startup script
  • The 1KB log rotation limit was causing worker tracebacks to be truncated in production, making it impossible to diagnose the root cause of NCCL timeout crashes
  • Ray defaults (50MB max, 5 backups) are appropriate for production workloads with long-running processes

Background

During investigation of a production outage (NCCL collective timeout → SIGABRT on all 4 worker ranks), the only evidence of the original Python exception in forward_backward was a fragmented stack trace in the aggregated run.log. The individual worker-*.err files were empty or contained only the final NCCL abort message because 1KB rotation was discarding the full traceback before it could be captured.

Test plan

  • Deploy with the updated script and verify worker log files contain complete tracebacks under error conditions
  • Confirm Ray log directory disk usage remains within acceptable bounds with default rotation settings

RAY_ROTATION_MAX_BYTES=1024 was causing worker tracebacks to be truncated
in production, making it impossible to diagnose the NCCL timeout root cause.
Ray defaults (50MB, 5 backups) are appropriate for production workloads.
Copilot AI review requested due to automatic review settings June 12, 2026 03:50

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the Ray log rotation environment variables RAY_ROTATION_MAX_BYTES and RAY_ROTATION_BACKUP_COUNT from the Megatron server run script (cookbook/client/server/megatron/run.sh). I have no feedback to provide as there are no review comments and the changes are straightforward.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes an overly restrictive Ray log rotation configuration from the Megatron server startup script so Ray’s default rotation can retain complete worker tracebacks (improving diagnosability of production failures like NCCL timeouts).

Changes:

  • Removes RAY_ROTATION_MAX_BYTES=1024 and RAY_ROTATION_BACKUP_COUNT=1 exports from the Megatron run.sh startup script.
  • Relies on Ray’s default log rotation settings to avoid truncating/losing important error context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants