Add a check that the checkpoint directory can be written to.#3317
Add a check that the checkpoint directory can be written to.#3317igorts-git wants to merge 1 commit intomainfrom
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
c6a7ef6 to
1125324
Compare
|
🤖 Hi @igorts-git, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This Pull Request successfully introduces a validation step to ensure the GCS checkpoint directory exists and is writable, which effectively addresses the issue of training pods hanging. The core logic handles the GCS paths and permission verification correctly and includes solid unit test coverage.
🔍 General Feedback
- The approach to write a temporary file for permission checking is robust.
- Consider addressing the minor edge case with
p.partsfor rootgs://paths to prevent unhandled exceptions. - The unit tests are comprehensive, but replacing the hardcoded system path (
/sys) with a mocked read-only directory will make them more portable and reliable across different test environments.
d8c8418 to
643b632
Compare
richjames0
left a comment
There was a problem hiding this comment.
Thanks! I have a small request - while we're here could we not use p as a variable name? Aside from feeling hacky p is a command in pdb so this adds friction to debugging :)
643b632 to
1ab1578
Compare
Done! |
|
Do we need to update the docs/faq about this error and link to steps to be done to fix this error when it happens ? Can we do quick check and fail fast with maxtext launch, as it will eventually fail and will need to be fixed and tried again? |
Description
Fix training pod hanging when the
base_output_directoryis set to a bucket that does not exist or not-writable.Now it will raise an exception.
We use epath.Path.mkdir from etils to create the output dir. Unfortunately, that library sometimes hangs if we call
epath.Path.mkdir("gs://no_such_bucket/some/dir", exist_ok=True, parents=True). In other instances it simply returns with no error message.In this PR we are adding a helper function
gcs_utils.mkdir_and_check_permissionsthat validates that the bucket exists and is writable before constructing aCheckpointManagerobject to avoid hanging.If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/476521717 (Maxtext container doesn't fail or provide enough logs when its not able to write to base_output_directory bucket)
Note: I am not quite sure if checking for writability of the checkpoint directory is always the right thing. Maybe in the inference-only scenario it is not appropriate. I would love to hear reviewer's opinion about this.
Tests
Added a unit test.
Ran train.py with an incorrect GS bucket: http://screen/A8kJj8xgFLxEXWb
Ran train.py with a correct GS bucket: http://screen/73657hoP2e2vLhW
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.