Megatron-Bridge r0.3.0 enhancement by juntaowww · Pull Request #830 · NVIDIA/cloudai

juntaowww · 2026-03-11T09:08:57Z

Summary

This PR introduces several enhancements to the CloudAI Megatron Bridge workload and Slurm infrastructure:

Node placement propagation to Megatron Bridge: The resolved node list from CloudAI's scenario config is now forwarded to Megatron Bridge's setup_experiment.py via --additional_slurm_params nodelist=.... This ensures that when users specify explicit partition:group:count node specs in the scenario, the exact placement is correctly defined inside the inner Slurm job. For example, in a scenario file:
```
[[Tests]]
id = "test_id"
num_nodes = "8"
exclude_nodes = ["node01"]
nodes = [
  "partition:group1:4",
  "partition:group2:4",
]
```
CloudAI resolves these partition:group specs into concrete node names, excludes node01, and then selects count many nodes out of the pool, and passes the final list as --additional_slurm_params 'nodelist=node02,node03,...' so Megatron-Bridge schedules on exactly those nodes.
GPU resource requesting: gpus_per_node is now forwarded as --additional_slurm_params gpus-per-node=N alongside the nodelist. When both are present, they are semicolon-separated in a single argument per Megatron-Bridge's expected format. For example, setting gpus_per_node = 4 with a nodelist produces:
```
--additional_slurm_params 'gpus-per-node=4;nodelist=node02,node03'
```
This ensures the GPU resources are always requested, as some Slurm systems may not automatically allocate them.
Git submodule initialization: If init_submodules = True in GitRepo, then both SlurmInstaller and KubernetesInstaller now run git submodule update --init --recursive after cloning git repos. This is required for repos like Megatron-Bridge that bundle Megatron-LM as a submodule. Failures clean up the cloned directory.

Example:
```
[[git_repos]]
url = "https://github.com/NVIDIA-NeMo/Megatron-Bridge.git"
commit = "r0.3.0"
init_submodules = true
```
Optional container mount for Megatron Bridge: When mount_as is set on the git repo entry (e.g. mount_as = "/opt/Megatron-Bridge"), the locally cloned repo is bind-mounted into the container via -cm /path/to/clone:/opt/Megatron-Bridge, overriding the image's built-in copy. Omitting mount_as uses the container's built-in copy as-is. Example test config:
```
[[git_repos]]
url = "https://github.com/NVIDIA-NeMo/Megatron-Bridge.git"
commit = "r0.3.0"
init_submodules = true
mount_as = "/opt/Megatron-Bridge"   # remove this line to use the container's copy
```
Qwen recipe update: Updated the GB200 Qwen 30B recipe to use tag r0.3.0, container nvcr.io#nvidia/nemo:26.02, added PYTHONPATH for the bundled Megatron-LM submodule.

Test Plan

Corresponding unit tests + manual runs

Additional Notes

Submodule init is essential for the Qwen recipe as for Megatron-Bridge r0.3.0, which sets PYTHONPATH = "/opt/Megatron-Bridge/3rdparty/Megatron-LM:${PYTHONPATH}" to pick up the bundled Megatron-LM.

coderabbitai · 2026-03-11T09:09:49Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds per-test exclude_nodes support across SLURM node resolution and command generation; introduces optional git submodule initialization in installers with cleanup on failure; adjusts Megatron-Bridge repo mounting and launcher Slurm params; updates configs and adds tests for these behaviors.

Changes

Cohort / File(s)	Summary
Configuration updates `conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml`, `conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml`, `conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml`, `conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml`, `conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml`	Bumped git commit and container image tags; added `init_submodules = true` to git repo stanzas and added `extra_env_vars.PYTHONPATH` entries; small header year update.
Test run model & parser `src/cloudai/_core/test_scenario.py`, `src/cloudai/models/scenario.py`, `src/cloudai/test_scenario_parser.py`	Added `exclude_nodes` field to TestRun/TestRunModel and propagate it from parsed models into created TestRun instances.
SLURM core system `src/cloudai/systems/slurm/slurm_system.py`	Added `exclude_nodes` parameter across node-resolution APIs; filter excluded hostnames during grouping, parsing, and allocation; include RESERVED in available-state handling and raise error if exclusions yield empty resolved node lists.
SLURM command generation `src/cloudai/systems/slurm/slurm_command_gen_strategy.py`, `src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`	Emit `#SBATCH --exclude` when appropriate; extend node-spec cache to include excluded nodes; parse extra srun args into Slurm params; add `--additional_slurm_params` with gpus-per-node, nodelist/exclude, and parsed srun args; conditional container bind-mount of Megatron-Bridge when `mount_as` set.
Installers: git + submodules `src/cloudai/_core/installables.py`, `src/cloudai/systems/slurm/slurm_installer.py`, `src/cloudai/systems/kubernetes/kubernetes_installer.py`	Added `init_submodules: bool` to `GitRepo`; introduced `_init_submodules(path: Path) -> InstallStatusResult` and invoke it after checkout when requested; on failure log, remove cloned repo, and return error.
Megatron-Bridge selection & ref data `src/cloudai/workloads/megatron_bridge/megatron_bridge.py`, `tests/ref_data/megatron-bridge.sbatch`	Stop forcing `mount_as` override when selecting repo (keep repo's mount_as); updated sbatch reference to include conditional Megatron-Bridge bind-mount and additional Slurm params.
Tests added/updated `tests/systems/slurm/test_allocation.py`, `tests/systems/slurm/test_system.py`, `tests/systems/slurm/test_command_gen_strategy.py`, `tests/test_git_repo_installer.py`, `tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`	Added tests covering node exclusion (single/multiple), RESERVED-state allocation, parse/empty-spec error cases, `--exclude` and `--additional_slurm_params` generation, mount_as behavior, Slurm srun-args parsing, and git submodule init success/failure and cleanup.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibbled nodes and left a few behind,
I fetched curled submodules, tidy and kind.
Mounts snug like carrots, mapped just right,
Slurm learns new params and jobs take flight.
Hoppity tests pass under moonlit night.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Megatron-Bridge r0.3.0 enhancement' accurately summarizes the primary focus of the changeset—updating Megatron-Bridge to r0.3.0 and adding related enhancements.
Description check	✅ Passed	The PR description is comprehensive and directly related to the changeset, detailing node placement propagation, GPU resource requesting, git submodule initialization, container mount functionality, and recipe updates.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

greptile-apps · 2026-03-11T09:14:15Z

Greptile Summary

This PR enhances the Megatron-Bridge r0.3.0 integration in CloudAI by propagating node placement (nodelist=) and GPU resources (gpus-per-node=) to Megatron-Bridge's inner Slurm job via --additional_slurm_params, adding optional git submodule update --init --recursive support to both SlurmInstaller and KubernetesInstaller, introducing optional mount_as-driven container bind-mounting for the git repo, and updating the GB200/B200/GB300/H100 Qwen-30B recipes to the r0.3.0 tag and nvcr.io#nvidia/nemo:26.02 image.

Key changes:

GitRepo gains init_submodules and mount_as fields; all existing configs default to False/None so there are no breaking changes.
_parse_srun_args_as_slurm_params now correctly handles standalone boolean flags (e.g. --exclusive), fixing a silent-discard bug.
get_nodes_by_spec error message is improved to avoid falsely implying exclusions caused an empty node list.
Logic gap in both installers: the "repo already exists" early-return path in _install_one_git_repo does not call _init_submodules, even when item.init_submodules = True. Users who previously installed Megatron-Bridge__r0.3.0 without the flag and now re-run cloudai install will receive a successful result while 3rdparty/Megatron-LM remains an unpopulated directory, causing a runtime ModuleNotFoundError rather than a clear installer error.
gpus_per_node is appended to additional_slurm_params unconditionally when truthy, bypassing the fields_set guard that all other launcher flags respect.

Confidence Score: 3/5

Safe to merge for fresh installs; existing installs with the same commit string that now need init_submodules will silently fail at runtime.
The new features work correctly for first-time installs and the test coverage is solid. The submodule-init gap in the "already exists" code path of both installers is a logic bug that can cause silent, hard-to-diagnose runtime failures when users upgrade an existing configuration. This lowers confidence until that path is fixed.
src/cloudai/systems/slurm/slurm_installer.py and src/cloudai/systems/kubernetes/kubernetes_installer.py — specifically the early-return block in _install_one_git_repo that skips _init_submodules for existing repos.

Important Files Changed

Filename	Overview
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py	Core strategy refactored to propagate resolved `nodelist` and `gpus-per-node` to Megatron-Bridge via `--additional_slurm_params`; boolean-flag parsing, `exclude_nodes` fallback, and `mount_as` container-mount logic are all correctly implemented.
src/cloudai/systems/slurm/slurm_installer.py	Adds `_init_submodules` and calls it during fresh clone, but the "repo already exists" early-return path skips submodule init even when `init_submodules = true`, causing silent failures for users upgrading an existing install.
src/cloudai/systems/kubernetes/kubernetes_installer.py	Same `_init_submodules` integration as `SlurmInstaller`; inherits the same gap where an already-existing repo silently skips submodule initialization on subsequent `cloudai install` runs.
src/cloudai/_core/installables.py	Cleanly adds `init_submodules: bool = False` and `mount_as: Optional[str] = None` fields to `GitRepo`; both default to backward-compatible values.
src/cloudai/systems/slurm/slurm_system.py	Error message in `get_nodes_by_spec` improved to avoid misleading "after applying exclude_nodes=None" wording; logic is unchanged and correct.
tests/test_git_repo_installer.py	New parametrized tests cover both Slurm and Kubernetes installers for submodule init, cleanup-on-failure, and skip-when-not-requested scenarios; good coverage.
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py	Comprehensive tests for new features: `gpus_per_node` via `additional_slurm_params`, boolean-flag parsing, `mount_as`, srun-args forwarding; all tests appear correct.
conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml	Updated commit tag `v0.3.0` → `r0.3.0`, container image `26.02.00` → `26.02`, added `init_submodules = true` and `PYTHONPATH` env var; changes align with PR description.

Sequence Diagram

sequenceDiagram
    participant User as CloudAI User
    participant Installer as SlurmInstaller/K8sInstaller
    participant GitRepo as Git Repository
    participant MBridge as MegatronBridgeCommandGen
    participant SetupExp as setup_experiment.py
    participant Slurm as Slurm Scheduler

    User->>Installer: cloudai install
    Installer->>GitRepo: git clone + checkout r0.3.0
    alt init_submodules=true and fresh clone
        Installer->>GitRepo: git submodule update --init --recursive
    end
    Installer-->>User: install complete

    User->>MBridge: cloudai run
    MBridge->>MBridge: get_cached_nodes_spec()
    note over MBridge: resolves partition:group:count specs<br/>and applies exclude_nodes
    alt explicit nodes resolved
        MBridge->>MBridge: append nodelist=node1,node2 to params
    else only exclude_nodes set
        MBridge->>MBridge: append exclude=node1 to params
    end
    MBridge->>MBridge: append gpus-per-node=N to params
    MBridge->>SetupExp: python setup_experiment.py -gn N -cm /clone:/opt/Megatron-Bridge --additional_slurm_params params
    SetupExp->>Slurm: sbatch inner training job with constraints
    Slurm-->>MBridge: Submitted batch job ID
    MBridge-->>User: job tracked by CloudAI

Comments Outside Diff (2)

src/cloudai/systems/slurm/slurm_installer.py, line 208-215 (link)

Submodule init silently skipped for already-existing repos

When a repo already exists at the expected path and its commit is verified, the method returns early — without ever checking item.init_submodules. This means a user who:
1. Previously cloned Megatron-Bridge__r0.3.0 without init_submodules = true
2. Subsequently adds init_submodules = true to the config and re-runs cloudai install
...will receive a successful install result while 3rdparty/Megatron-LM remains an empty directory, leading to a confusing ModuleNotFoundError at runtime rather than an actionable installer error.

The same applies to kubernetes_installer.py lines 153–162.

A minimal fix would add an _init_submodules call in the "already exists" fast-path when the flag is set:
```
if repo_path.exists():
    verify_res = self._verify_commit(item.commit, repo_path)
    if not verify_res.success:
        return verify_res
    if item.init_submodules:
        res = self._init_submodules(repo_path)
        if not res.success:
            return res
    item.installed_path = repo_path
    msg = f"Git repository already exists at {repo_path}."
    logging.debug(msg)
    return InstallStatusResult(True, msg)
```
src/cloudai/systems/kubernetes/kubernetes_installer.py, line 153-162 (link)

Same submodule init gap as SlurmInstaller

The "repo already exists" early-return path here (lines 153–162) has the same issue as in SlurmInstaller._install_one_git_repo: item.init_submodules is never consulted, so running cloudai install a second time on a repo that was initially cloned without submodules will silently skip submodule initialization even if init_submodules = true is now set.

Please apply the same fix as described in the SlurmInstaller comment — check item.init_submodules before returning in the early-exit path.

_{Last reviewed commit: b6a5dcf}

greptile-apps · 2026-03-11T09:14:19Z

src/cloudai/systems/slurm/slurm_installer.py

+    def _init_submodules(self, path: Path) -> InstallStatusResult:
+        logging.debug(f"Initializing submodules in {path}")
+        submodule_cmd = ["git", "submodule", "update", "--init", "--recursive"]
+        result = subprocess.run(submodule_cmd, cwd=str(path), capture_output=True, text=True)
+        if result.returncode != 0:
+            return InstallStatusResult(False, f"Failed to initialize submodules: {result.stderr}")
+        return InstallStatusResult(True)
+


Identical _init_submodules method duplicated across installers

The _init_submodules implementation in SlurmInstaller (lines 300–307) is byte-for-byte identical to the one added in KubernetesInstaller (lines 247–254). Since both classes already inherit from BaseInstaller, the method should be lifted there to avoid diverging implementations in the future.

# in BaseInstaller (cloudai/_core/base_installer.py) def _init_submodules(self, path: Path) -> InstallStatusResult: logging.debug(f"Initializing submodules in {path}") submodule_cmd = ["git", "submodule", "update", "--init", "--recursive"] result = subprocess.run(submodule_cmd, cwd=str(path), capture_output=True, text=True) if result.returncode != 0: return InstallStatusResult(False, f"Failed to initialize submodules: {result.stderr}") return InstallStatusResult(True)

This also applies to src/cloudai/systems/kubernetes/kubernetes_installer.py at lines 247–254.

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml (1)
28-40: ⚠️ Potential issue | 🟠 Major

Do not leave hf_token empty in this test recipe.

This updated recipe still ships with hf_token = "", which makes the config unusable unless callers patch the TOML out-of-band. Please wire this to a user-supplied secret/input, or explicitly exempt this scenario from the non-empty validation path. Based on learnings: For Megatron Bridge test TOML configurations under conf/experimental/megatron_bridge/test/, hf_token must be non-empty; tests should fail validation if hf_token is left empty to ensure users supply their own Hugging Face token.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml`
around lines 28 - 40, The test TOML leaves the [cmd_args] key hf_token empty
which breaks validation; change the recipe to require a user-supplied token by
wiring hf_token to a runtime secret/input (e.g., use a placeholder token
variable like HF_TOKEN referenced by the deployment system) or update the config
validation logic to reject empty hf_token for Megatron Bridge test configs;
specifically modify the [cmd_args] hf_token entry and/or the validation path
that checks test recipes under conf/experimental/megatron_bridge/test to enforce
non-empty hf_token so tests fail if not provided.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/systems/slurm/slurm_system.py`:
- Around line 722-725: The current logic in
parse_node_list/parse_nodes/get_nodes_by_spec can produce an empty node list
after applying exclude_nodes and then fall back to an unconstrained allocation;
modify the code so that after filtering expanded_nodes = [n for n in
expanded_nodes if n not in exclude_nodes] you detect if expanded_nodes is empty
and immediately raise a clear exception (e.g., ValueError/RuntimeError) that
includes the offending node_spec and exclude_nodes, and ensure get_nodes_by_spec
propagates this error instead of returning (num_nodes, []). This prevents
SlurmCommandGenStrategy from seeing an empty nodelist and emitting -N for
unconstrained placement; alternatively, if a non-exceptional path is required,
return a sentinel that preserves the exclusion constraint and is handled by
SlurmCommandGenStrategy rather than an empty list fallback.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 323-351: The tests are using the defaults from the test factory so
they pass regardless of the override plumbing; update the failing tests to use
non-default values so the overrides are actually exercised: in
test_mount_as_adds_repo_to_container_mounts call make_test_run with a distinct
mount_as (e.g. "/opt/custom-megatron") and assert the wrapper_content contains
f"-cm {repo_path.absolute()}:/opt/custom-megatron" (use the same mount_as
string), in test_gpus_per_node_passed_as_additional_slurm_param pass
cmd_args_overrides={"gpus_per_node": 2} (or another non-default number) and
assert "gpus-per-node=2" appears in wrapper_content; keep using
MegatronBridgeSlurmCommandGenStrategy, make_test_run, and _wrapper_content to
locate where to change the values.

---

Outside diff comments:
In `@conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml`:
- Around line 28-40: The test TOML leaves the [cmd_args] key hf_token empty
which breaks validation; change the recipe to require a user-supplied token by
wiring hf_token to a runtime secret/input (e.g., use a placeholder token
variable like HF_TOKEN referenced by the deployment system) or update the config
validation logic to reject empty hf_token for Megatron Bridge test configs;
specifically modify the [cmd_args] hf_token entry and/or the validation path
that checks test recipes under conf/experimental/megatron_bridge/test to enforce
non-empty hf_token so tests fail if not provided.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b086916f-352d-4c4c-b94f-f9fa9c3846a0

📥 Commits

Reviewing files that changed from the base of the PR and between dd076e8 and e14fa6f.

📒 Files selected for processing (15)

conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml
src/cloudai/_core/test_scenario.py
src/cloudai/models/scenario.py
src/cloudai/systems/kubernetes/kubernetes_installer.py
src/cloudai/systems/slurm/slurm_command_gen_strategy.py
src/cloudai/systems/slurm/slurm_installer.py
src/cloudai/systems/slurm/slurm_system.py
src/cloudai/test_scenario_parser.py
src/cloudai/workloads/megatron_bridge/megatron_bridge.py
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
tests/ref_data/megatron-bridge.sbatch
tests/systems/slurm/test_allocation.py
tests/systems/slurm/test_system.py
tests/test_git_repo_installer.py
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

src/cloudai/systems/slurm/slurm_system.py

tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 323-334: The test function
test_mount_as_adds_repo_to_container_mounts declares an unused tmp_path fixture
parameter; remove tmp_path from the test signature so it becomes def
test_mount_as_adds_repo_to_container_mounts(self, configured_slurm_system:
SlurmSystem, make_test_run: Callable[..., TestRun]) -> None: and update any
references if the test was called elsewhere (none expected) to eliminate the
unused parameter warning; locate this function by name in
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py and adjust
the signature accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d59eb2c8-9f04-4000-8d8d-493717387607

📥 Commits

Reviewing files that changed from the base of the PR and between e14fa6f and f26fe7b.

📒 Files selected for processing (6)

src/cloudai/systems/slurm/slurm_command_gen_strategy.py
src/cloudai/systems/slurm/slurm_system.py
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
tests/systems/slurm/test_command_gen_strategy.py
tests/systems/slurm/test_system.py
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

coderabbitai · 2026-03-11T10:18:16Z

tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

+    def test_mount_as_adds_repo_to_container_mounts(
+        self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path
+    ) -> None:
+        tr = make_test_run(mount_as="/opt/custom-megatron", output_subdir="out_mount")
+        tdef = cast(MegatronBridgeTestDefinition, tr.test)
+        repo_path = tdef.megatron_bridge_repo.installed_path
+        assert repo_path is not None
+
+        cmd_gen = MegatronBridgeSlurmCommandGenStrategy(configured_slurm_system, tr)
+        wrapper_content = self._wrapper_content(cmd_gen)
+        assert f"-cm {repo_path.absolute()}:/opt/custom-megatron" in wrapper_content
+


🧹 Nitpick | 🔵 Trivial

Remove unused tmp_path parameter.

The tmp_path fixture is declared but not used in this test method.

🧹 Suggested fix

def test_mount_as_adds_repo_to_container_mounts( - self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path + self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun] ) -> None:

🧰 Tools

🪛 Ruff (0.15.5)

[warning] 324-324: Unused method argument: tmp_path

(ARG002)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py` around lines 323 - 334, The test function test_mount_as_adds_repo_to_container_mounts declares an unused tmp_path fixture parameter; remove tmp_path from the test signature so it becomes def test_mount_as_adds_repo_to_container_mounts(self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]) -> None: and update any references if the test was called elsewhere (none expected) to eliminate the unused parameter warning; locate this function by name in tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py and adjust the signature accordingly.

greptile-apps · 2026-03-11T10:22:10Z

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

+        _, node_list = self.get_cached_nodes_spec()
+        if node_list:
+            nodelist_str = ",".join(node_list)
+            additional_slurm_params.append(f"nodelist={nodelist_str}")
+
+        if additional_slurm_params:
+            parts.extend(["--additional_slurm_params", shlex.quote(" ".join(additional_slurm_params))])


exclude_nodes not forwarded to inner Megatron-Bridge job when nodes is empty

When test_run.nodes = [] (no explicit node list) and exclude_nodes is set, get_cached_nodes_spec() returns (num_nodes, []) — so node_list is empty, and no nodelist= entry is added to additional_slurm_params. The outer launcher sbatch does get #SBATCH --exclude=<excluded> (from _append_nodes_related_directives), but the inner training job submitted by setup_experiment.py receives no exclusion constraint and can freely schedule on the excluded nodes.

The PR description's documented use-case always pairs nodes = ["partition:group:count", ...] with exclude_nodes, so the resolved concrete nodelist is forwarded correctly. But when a user sets only exclude_nodes without explicit nodes (relying on Slurm's automatic allocation), the inner training job silently ignores the exclusion.

If setup_experiment.py supports an exclude= key in --additional_slurm_params, the fix would be:

_, node_list = self.get_cached_nodes_spec() if node_list: additional_slurm_params.append(f"nodelist={','.join(node_list)}") elif self.test_run.exclude_nodes: # Forward exclusion to the inner job when no explicit nodelist is resolved additional_slurm_params.append(f"exclude={self.test_run.exclude_nodes}")

At a minimum, the behaviour should be documented so users are aware of this limitation.

greptile-apps · 2026-03-11T10:48:17Z

src/cloudai/systems/slurm/slurm_system.py

+        elif nodes:
+            raise ValueError(
+                f"Node specifications {nodes} resolved to an empty node list after applying "
+                f"exclude_nodes={exclude_nodes}. Cannot fall back to unconstrained allocation."
+            )


Misleading error message when exclude_nodes is None

When parse_nodes returns an empty list because nodes are unavailable (e.g., all DRAIN/DOWN) and no exclusions were requested (exclude_nodes=None), the error message reads "after applying exclude_nodes=None", which implies the exclusions caused the problem when no exclusions were actually applied. This can mislead users who are debugging node availability issues.

Consider making the exclusion clause conditional:

Suggested change

elif nodes:

raise ValueError(

f"Node specifications {nodes} resolved to an empty node list after applying "

f"exclude_nodes={exclude_nodes}. Cannot fall back to unconstrained allocation."

)

elif nodes:

extra = f" after excluding {exclude_nodes}" if exclude_nodes else ""

raise ValueError(

f"Node specifications {nodes} resolved to an empty node list{extra}. "

f"Cannot fall back to unconstrained allocation."

)

podkidyshev · 2026-03-11T12:03:38Z

src/cloudai/models/scenario.py

+    exclude_nodes: str | None = Field(
+        default=None,
+        description="Comma-separated hostnames to exclude from the resolved node list.",
+    )


I would rather move this field up to the nodes definition above (they are about the same thing) and make it of type list[str] explicitly (just like nodes). wdyt?

not having None (just relying on the empty default list) I believe will also make node-parsing code a bit cleaner

would also be nice to have some example in the description how real values here may look like

no problem, have made it consistent with nodes, also descriptions are added

podkidyshev · 2026-03-11T12:15:04Z

src/cloudai/systems/slurm/slurm_command_gen_strategy.py

            return self._node_spec_cache[cache_key]

-        self._node_spec_cache[cache_key] = self.system.get_nodes_by_spec(self.test_run.nnodes, self.test_run.nodes)
+        excluded = set(parse_node_list(self.test_run.exclude_nodes)) if self.test_run.exclude_nodes else None


why does it suddenly becomes a set? I'd suggest to consistently use list[str] (as suggested in the other comment) and deduplicate in-place if needed inside functions where duplicates can actually arise

have cleanup this part, now all the handling logic is in SlurmSystem.get_nodes_by_spec.

src/cloudai/systems/slurm/slurm_system.py

podkidyshev · 2026-03-11T12:28:33Z

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

+            additional_slurm_params.append(f"exclude={self.test_run.exclude_nodes}")
+
+        if additional_slurm_params:
+            parts.extend(["--additional_slurm_params", shlex.quote(" ".join(additional_slurm_params))])


did you test the space separator? the docs say one should use semicolon as separator for multiple slurm parameters that may contain commas

Good point, this may not be fully tested with edge cases, have changed to follow the guidance

podkidyshev · 2026-03-11T12:31:17Z

src/cloudai/systems/kubernetes/kubernetes_installer.py

                rmtree(repo_path)
            return res

+        res = self._init_submodules(repo_path)


it doesn't feel right to me running this code for every git repo

let's maybe add a flag into GitRepo dataclass like init_submodules: bool = False and use this flag here to conditionally run submodules?

good point, have made this optional

podkidyshev · 2026-03-11T12:36:53Z

conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml

+detach = false
+
+[extra_env_vars]
+PYTHONPATH = "/opt/Megatron-Bridge/3rdparty/Megatron-LM:${PYTHONPATH}"


this is very nice that you discovered how to properly mount the repo into container, good job 💪

would you mind updating other configs in conf/experimental/megatron_bridge/test/? 🙏

no problem, other configs have also been updated

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/cloudai/_core/installables.py (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Fix copyright year to resolve CI failure.

The pipeline reports that the copyright year is invalid. It should be 2025-2026 to match the header pattern.

🔧 Proposed fix

-# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/_core/installables.py` at line 2, Update the file header line
that currently reads "Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All
rights reserved." to use the required year range "2025-2026" so it matches the
project's header pattern; locate the top-of-file copyright string in
src/cloudai/_core/installables.py and replace the single year with "2025-2026".

♻️ Duplicate comments (1)

tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py (1)

323-325: ⚠️ Potential issue | 🟡 Minor

Remove unused tmp_path fixture from test signature.

tmp_path is unused in this method (Line 324), and keeping it triggers ARG002 noise.

🧹 Proposed fix

 def test_mount_as_adds_repo_to_container_mounts(
-    self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun], tmp_path: Path
+    self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]
 ) -> None:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py` around
lines 323 - 325, The test function test_mount_as_adds_repo_to_container_mounts
currently declares an unused tmp_path fixture which triggers ARG002; remove
tmp_path from the function signature so it reads def
test_mount_as_adds_repo_to_container_mounts(self, configured_slurm_system:
SlurmSystem, make_test_run: Callable[..., TestRun]) -> None:, and ensure there
are no remaining references to tmp_path in the body of that test (adjust imports
or fixtures if needed).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_git_repo_installer.py`:
- Around line 281-300: The tests were reformatted by ruff-format; run the
formatter and pre-commit hooks to fix CI failures by running the project's
formatting hooks (e.g., pre-commit run --all-files or ruff format) and commit
the changes; ensure the test functions
(test_submodules_skipped_when_not_requested, test_submodules_run_when_requested)
and the installer methods referenced (installer._install_one_git_repo,
installer._init_submodules, installer._clone_repository,
installer._checkout_commit) remain unchanged in behavior after formatting so the
assertions and mocks still match the corrected file.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 362-370: The test test_test_run_extra_srun_args_forwarded
currently only checks for the existence of "--additional_slurm_params" which can
be present by default; update it to assert that the value derived from
TestRun.extra_srun_args is actually forwarded by
MegatronBridgeSlurmCommandGenStrategy. Specifically, set tr.extra_srun_args =
"--exclusive" (already done), call _wrapper_content(cmd_gen) as before, and
replace the loose assert with one that checks the wrapper_content contains the
forwarded flag and value (e.g., that the string representing the additional
slurm params includes "--exclusive" or an exact parameter encoding produced by
the command generator) so the test fails if extra_srun_args is ignored;
reference the test function name, the TestRun.extra_srun_args field, the
MegatronBridgeSlurmCommandGenStrategy constructor, and the _wrapper_content
helper to locate the code.
- Around line 335-343: Replace the brittle blanket check for "-cm" in
test_no_mount_as_skips_repo_container_mount with a precise assertion that the
Megatron-Bridge repo mount is not present: remove or change the assert "-cm" not
in wrapper_content to assert that the specific mount target or mount flag for
the repo (for example a string containing ':/opt/Megatron-Bridge' or the '-v
...:/opt/Megatron-Bridge' pattern) is not in wrapper_content, keeping the
existing assert "/opt/Megatron-Bridge" absence check or tightening it to match
the exact mount syntax emitted by MegatronBridgeSlurmCommandGenStrategy.

---

Outside diff comments:
In `@src/cloudai/_core/installables.py`:
- Line 2: Update the file header line that currently reads "Copyright (c) 2025
NVIDIA CORPORATION & AFFILIATES. All rights reserved." to use the required year
range "2025-2026" so it matches the project's header pattern; locate the
top-of-file copyright string in src/cloudai/_core/installables.py and replace
the single year with "2025-2026".

---

Duplicate comments:
In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 323-325: The test function
test_mount_as_adds_repo_to_container_mounts currently declares an unused
tmp_path fixture which triggers ARG002; remove tmp_path from the function
signature so it reads def test_mount_as_adds_repo_to_container_mounts(self,
configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]) ->
None:, and ensure there are no remaining references to tmp_path in the body of
that test (adjust imports or fixtures if needed).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 57739f39-82cc-44a7-add5-3cc1080fcf41

📥 Commits

Reviewing files that changed from the base of the PR and between a159736 and 52cef4c.

📒 Files selected for processing (17)

conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml
conf/experimental/megatron_bridge/test/gb200/megatron_bridge_qwen_30b.toml
conf/experimental/megatron_bridge/test/gb300/megatron_bridge_qwen_30b.toml
conf/experimental/megatron_bridge/test/h100/megatron_bridge_qwen_30b.toml
src/cloudai/_core/installables.py
src/cloudai/_core/test_scenario.py
src/cloudai/models/scenario.py
src/cloudai/systems/kubernetes/kubernetes_installer.py
src/cloudai/systems/slurm/slurm_command_gen_strategy.py
src/cloudai/systems/slurm/slurm_installer.py
src/cloudai/systems/slurm/slurm_system.py
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
tests/systems/slurm/test_allocation.py
tests/systems/slurm/test_command_gen_strategy.py
tests/systems/slurm/test_system.py
tests/test_git_repo_installer.py
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

tests/test_git_repo_installer.py

tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

greptile-apps · 2026-03-16T10:30:21Z

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

+        if args.gpus_per_node:
+            additional_slurm_params.append(f"gpus-per-node={args.gpus_per_node}")


gpus_per_node truthiness check misses fields_set guard

gpus_per_node is always appended to additional_slurm_params whenever args.gpus_per_node is truthy, regardless of whether it was explicitly set by the user (i.e., is in fields_set). Every other flag uses add_field(field, flag, value) to respect fields_set. While gpus_per_node is in force_fields (so it is always emitted via -gn), the inconsistency could produce a redundant gpus-per-node=N in --additional_slurm_params even in scenarios where the user explicitly wants to rely on the Slurm cluster's default GPU allocation rather than hard-coding it.

Consider guarding with "gpus_per_node" in fields_set:

Suggested change

if args.gpus_per_node:

additional_slurm_params.append(f"gpus-per-node={args.gpus_per_node}")

if args.gpus_per_node and "gpus_per_node" in fields_set:

additional_slurm_params.append(f"gpus-per-node={args.gpus_per_node}")

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/_core/installables.py`:
- Line 94: The early return in both _install_one_git_repo() implementations
skips submodule setup for existing clones, so ensure submodules are initialized
when item.init_submodules is true even if repo_path.exists() and commit matches;
move or duplicate the item.init_submodules handling out of the post-clone-only
block (or add an explicit check after the early-return path) so that when
repo_path.exists() and item.init_submodules is true you call the same submodule
init/update sequence used after cloning (reference symbols:
_install_one_git_repo, repo_path.exists(), item.init_submodules).

In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`:
- Around line 111-133: The _parse_srun_args_as_slurm_params function incorrectly
uses str.split() which breaks quoted shell arguments; replace tokens =
srun_args.split() with tokens = shlex.split(srun_args) so quoted values (e.g.
--comment "nightly run") are treated as a single token and the rest of the
parsing logic (the while loop handling --key=val, --key value, and standalone
flags) remains unchanged; shlex is already imported so just swap the split call
to shlex.split.

In `@tests/test_git_repo_installer.py`:
- Around line 281-299: Add a regression test that covers the "existing repo"
fast path with init_submodules enabled: create a test (in
tests/test_git_repo_installer.py) that pre-creates the repository directory
referenced by the GitRepo fixture (repo_path), set git.init_submodules = True,
mock installer._verify_commit to return InstallStatusResult(True) and mock
installer._clone_repository/_checkout_commit as needed, call
installer._install_one_git_repo(git), assert the result is successful and that
installer._init_submodules.assert_called_once() to ensure submodules are
initialized on the pre-existing-repo path.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 375-385: Add a new unit test in the same test file exercising
MegatronBridgeSlurmCommandGenStrategy._parse_srun_args_as_slurm_params with a
quoted value, e.g. call it with '--comment "nightly run" --reservation
my_reserv' and assert the returned list includes 'comment=nightly run' and
'reservation=my_reserv' (ordering-insensitive or match exact expected order used
in existing tests); this ensures the parser correctly strips quotes and
preserves spaces for values like "nightly run".

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 001e2203-7e96-421a-ad8b-79c332c8123d

📥 Commits

Reviewing files that changed from the base of the PR and between 52cef4c and b6a5dcf.

📒 Files selected for processing (5)

conf/experimental/megatron_bridge/test_scenario/megatron_bridge_qwen_30b.toml
src/cloudai/_core/installables.py
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
tests/test_git_repo_installer.py
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

coderabbitai · 2026-03-16T10:32:12Z

src/cloudai/_core/installables.py


    url: str
    commit: str
+    init_submodules: bool = False


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Inspecting existing-repo fast path vs submodule init path..." for f in $(fd 'kubernetes_installer.py|slurm_installer.py' src/cloudai/systems); do echo "===== $f =====" rg -n -C6 '_install_one_git_repo|repo_path.exists|_verify_commit|init_submodules|_init_submodules' "$f" done

Repository: NVIDIA/cloudai

Length of output: 12550

init_submodules is silently bypassed on pre-existing clones.

Both _install_one_git_repo() implementations return early when repo_path.exists() and the commit is verified (kubernetes_installer.py:162, slurm_installer.py:215). The if item.init_submodules: block only executes after clone, so submodules are never initialized for repositories that already exist. A repo cloned without submodules in one run will remain without submodules in a later run requesting init_submodules=True.

Move the submodule initialization outside the early-return path or add a check to initialize submodules even for existing repos when item.init_submodules is true.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cloudai/_core/installables.py` at line 94, The early return in both _install_one_git_repo() implementations skips submodule setup for existing clones, so ensure submodules are initialized when item.init_submodules is true even if repo_path.exists() and commit matches; move or duplicate the item.init_submodules handling out of the post-clone-only block (or add an explicit check after the early-return path) so that when repo_path.exists() and item.init_submodules is true you call the same submodule init/update sequence used after cloning (reference symbols: _install_one_git_repo, repo_path.exists(), item.init_submodules).

coderabbitai · 2026-03-16T10:32:12Z

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

+    @staticmethod
+    def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]:
+        """
+        Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params.
+
+        Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare
+        key names without a ``=value`` suffix.
+        """
+        params: list[str] = []
+        tokens = srun_args.split()
+        i = 0
+        while i < len(tokens):
+            tok = tokens[i]
+            if tok.startswith("--") and "=" in tok:
+                key, val = tok[2:].split("=", 1)
+                params.append(f"{key}={val}")
+            elif tok.startswith("--") and i + 1 < len(tokens) and not tokens[i + 1].startswith("--"):
+                params.append(f"{tok[2:]}={tokens[i + 1]}")
+                i += 1
+            elif tok.startswith("--"):
+                params.append(tok[2:])
+            i += 1
+        return params


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' import shlex s='--comment "nightly run" --reservation=my_reserv --exclusive' print("str.split():", s.split()) print("shlex.split():", shlex.split(s)) PY

Repository: NVIDIA/cloudai

Length of output: 1775

🏁 Script executed:

#!/bin/bash set -euo pipefail python3 - <<'PY' import shlex s='--comment "nightly run" --reservation=my_reserv --exclusive' print("str.split():", s.split()) print("shlex.split():", shlex.split(s)) PY

Repository: NVIDIA/cloudai

Length of output: 232

🌐 Web query:

shlex.split() vs str.split() Python shell argument parsing quoted strings

💡 Result:

Key difference

str.split(): splits a string by a delimiter (or by runs of whitespace if sep is omitted/None). It does not understand quotes or backslash escapes—quotes stay in the output and spaces inside quotes still split. [1]

shlex.split(): tokenizes using shell-like rules, so it respects quoted substrings and (in POSIX mode, the default) handles backslash escaping in a shell-ish way. It’s meant for parsing “quoted strings” / mini-languages and shell-style command lines. [2]

Example (quoted arguments)

s = 'cmd --name "Ada Lovelace" --path C:\\Temp\\file.txt' s.split() # ['cmd', '--name', '"Ada', 'Lovelace"', '--path', 'C:\\Temp\\file.txt'] import shlex shlex.split(s) # ['cmd', '--name', 'Ada Lovelace', '--path', 'C:Tempfile.txt'] # note: backslashes treated as escapes in POSIX mode

shlex.split() will also raise ValueError if quotes aren’t closed (because it can’t determine the token boundary). [2]

When to use which

Use str.split() for simple delimiter/whitespace splitting where quoting doesn’t matter. [1]

Use shlex.split() when you need shell-like argument parsing (quoted strings, escapes). [2]

Note: shlex.split() is shell-like, not a full shell: it doesn’t perform expansions like globbing (*.txt) or $VAR expansion—those are shell features, not tokenization.

Sources: Python str.split docs [1], Python shlex.split docs [2].

🏁 Script executed:

#!/bin/bash # Check imports in the file and full function context head -50 "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" | cat -n

Repository: NVIDIA/cloudai

Length of output: 2293

🏁 Script executed:

#!/bin/bash # Search for shlex usage and imports rg "import shlex|from shlex" "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py"

Repository: NVIDIA/cloudai

Length of output: 70

🏁 Script executed:

#!/bin/bash # Get the full function to verify the logic is correct after the fix sed -n '111,133p' "src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py" | cat -n

Repository: NVIDIA/cloudai

Length of output: 1154

🏁 Script executed:

#!/bin/bash # Check if there are any tests for this function rg "_parse_srun_args_as_slurm_params" --type py -A 5

Repository: NVIDIA/cloudai

Length of output: 2775

Use shlex.split() instead of str.split() to properly parse shell-like arguments with quoted values.

Line 120 uses .split(), which cannot handle quoted values in arguments. For example, --comment "nightly run" is incorrectly tokenized as ['--comment', '"nightly', 'run"'] instead of ['--comment', 'nightly run']. This produces invalid additional_slurm_params output.

Use shlex.split() (already imported at line 20) to handle shell-like quoting and escaping correctly:

Proposed fix

`@staticmethod` def _parse_srun_args_as_slurm_params(srun_args: str) -> list[str]: """ Convert ``--key value`` pairs from extra_srun_args into ``key=value`` for --additional_slurm_params. Standalone boolean flags (e.g. ``--exclusive``) are emitted as bare key names without a ``=value`` suffix. """ params: list[str] = [] - tokens = srun_args.split() + tokens = shlex.split(srun_args)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around lines 111 - 133, The _parse_srun_args_as_slurm_params function incorrectly uses str.split() which breaks quoted shell arguments; replace tokens = srun_args.split() with tokens = shlex.split(srun_args) so quoted values (e.g. --comment "nightly run") are treated as a single token and the rest of the parsing logic (the while loop handling --key=val, --key value, and standalone flags) remains unchanged; shlex is already imported so just swap the split call to shlex.split.

coderabbitai · 2026-03-16T10:32:12Z

tests/test_git_repo_installer.py

+    def test_submodules_skipped_when_not_requested(
+        self, installer: Union[KubernetesInstaller, SlurmInstaller], git: GitRepo
+    ):
+        installer._clone_repository = Mock(return_value=InstallStatusResult(True))
+        installer._checkout_commit = Mock(return_value=InstallStatusResult(True))
+        installer._init_submodules = Mock(return_value=InstallStatusResult(True))
+        res = installer._install_one_git_repo(git)
+        assert res.success
+        installer._init_submodules.assert_not_called()
+
+    def test_submodules_run_when_requested(self, installer: Union[KubernetesInstaller, SlurmInstaller], git: GitRepo):
+        git.init_submodules = True
+        installer._clone_repository = Mock(return_value=InstallStatusResult(True))
+        installer._checkout_commit = Mock(return_value=InstallStatusResult(True))
+        installer._init_submodules = Mock(return_value=InstallStatusResult(True))
+        res = installer._install_one_git_repo(git)
+        assert res.success
+        installer._init_submodules.assert_called_once()
+


🛠️ Refactor suggestion | 🟠 Major

Missing regression case: existing repo + init_submodules=True.

The new tests cover clone flow well, but they don’t guard the pre-existing-repo fast path. Add one test that pre-creates repo_path, mocks successful _verify_commit, sets git.init_submodules = True, and asserts _init_submodules is called.

✅ Suggested test addition

+ def test_existing_repo_with_init_submodules_runs_submodule_init( + self, installer: Union[KubernetesInstaller, SlurmInstaller], git: GitRepo + ): + git.init_submodules = True + repo_path = installer.system.install_path / git.repo_name + repo_path.mkdir(parents=True, exist_ok=True) + installer._verify_commit = Mock(return_value=InstallStatusResult(True)) + installer._init_submodules = Mock(return_value=InstallStatusResult(True)) + + res = installer._install_one_git_repo(git) + + assert res.success + installer._init_submodules.assert_called_once_with(repo_path)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_submodules_skipped_when_not_requested(

self, installer: Union[KubernetesInstaller, SlurmInstaller], git: GitRepo

):

installer._clone_repository = Mock(return_value=InstallStatusResult(True))

installer._checkout_commit = Mock(return_value=InstallStatusResult(True))

installer._init_submodules = Mock(return_value=InstallStatusResult(True))

res = installer._install_one_git_repo(git)

assert res.success

installer._init_submodules.assert_not_called()

def test_submodules_run_when_requested(self, installer: Union[KubernetesInstaller, SlurmInstaller], git: GitRepo):

git.init_submodules = True

installer._clone_repository = Mock(return_value=InstallStatusResult(True))

installer._checkout_commit = Mock(return_value=InstallStatusResult(True))

installer._init_submodules = Mock(return_value=InstallStatusResult(True))

res = installer._install_one_git_repo(git)

assert res.success

installer._init_submodules.assert_called_once()

def test_submodules_skipped_when_not_requested(

self, installer: Union[KubernetesInstaller, SlurmInstaller], git: GitRepo

):

installer._clone_repository = Mock(return_value=InstallStatusResult(True))

installer._checkout_commit = Mock(return_value=InstallStatusResult(True))

installer._init_submodules = Mock(return_value=InstallStatusResult(True))

res = installer._install_one_git_repo(git)

assert res.success

installer._init_submodules.assert_not_called()

def test_submodules_run_when_requested(self, installer: Union[KubernetesInstaller, SlurmInstaller], git: GitRepo):

git.init_submodules = True

installer._clone_repository = Mock(return_value=InstallStatusResult(True))

installer._checkout_commit = Mock(return_value=InstallStatusResult(True))

installer._init_submodules = Mock(return_value=InstallStatusResult(True))

res = installer._install_one_git_repo(git)

assert res.success

installer._init_submodules.assert_called_once()

def test_existing_repo_with_init_submodules_runs_submodule_init(

self, installer: Union[KubernetesInstaller, SlurmInstaller], git: GitRepo

):

git.init_submodules = True

repo_path = installer.system.install_path / git.repo_name

repo_path.mkdir(parents=True, exist_ok=True)

installer._verify_commit = Mock(return_value=InstallStatusResult(True))

installer._init_submodules = Mock(return_value=InstallStatusResult(True))

res = installer._install_one_git_repo(git)

assert res.success

installer._init_submodules.assert_called_once_with(repo_path)

🧰 Tools

🪛 Ruff (0.15.5)

[warning] 282-282: Use X | Y for type annotations

Convert to X | Y

(UP007)

[warning] 284-284: Boolean positional value in function call

(FBT003)

[warning] 285-285: Boolean positional value in function call

(FBT003)

[warning] 286-286: Boolean positional value in function call

(FBT003)

[warning] 291-291: Use X | Y for type annotations

Convert to X | Y

(UP007)

[warning] 293-293: Boolean positional value in function call

(FBT003)

[warning] 294-294: Boolean positional value in function call

(FBT003)

[warning] 295-295: Boolean positional value in function call

(FBT003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_git_repo_installer.py` around lines 281 - 299, Add a regression test that covers the "existing repo" fast path with init_submodules enabled: create a test (in tests/test_git_repo_installer.py) that pre-creates the repository directory referenced by the GitRepo fixture (repo_path), set git.init_submodules = True, mock installer._verify_commit to return InstallStatusResult(True) and mock installer._clone_repository/_checkout_commit as needed, call installer._install_one_git_repo(git), assert the result is successful and that installer._init_submodules.assert_called_once() to ensure submodules are initialized on the pre-existing-repo path.

coderabbitai · 2026-03-16T10:32:12Z

tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

+    def test_parse_srun_args_as_slurm_params(self) -> None:
+        result = MegatronBridgeSlurmCommandGenStrategy._parse_srun_args_as_slurm_params(
+            "--reservation my_reserv --constraint=gpu"
+        )
+        assert result == ["reservation=my_reserv", "constraint=gpu"]
+
+    def test_parse_srun_args_boolean_flags(self) -> None:
+        result = MegatronBridgeSlurmCommandGenStrategy._parse_srun_args_as_slurm_params(
+            "--exclusive --reservation my_reserv --overcommit"
+        )
+        assert result == ["exclusive", "reservation=my_reserv", "overcommit"]


🧹 Nitpick | 🔵 Trivial

Add quoted-value coverage for _parse_srun_args_as_slurm_params.

Current parser tests won’t catch regressions for values containing spaces/quotes. Please add a case like --comment "nightly run".

🧪 Suggested test

+ def test_parse_srun_args_preserves_quoted_values(self) -> None: + result = MegatronBridgeSlurmCommandGenStrategy._parse_srun_args_as_slurm_params( + '--comment "nightly run" --reservation my_reserv' + ) + assert result == ["comment=nightly run", "reservation=my_reserv"]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py` around lines 375 - 385, Add a new unit test in the same test file exercising MegatronBridgeSlurmCommandGenStrategy._parse_srun_args_as_slurm_params with a quoted value, e.g. call it with '--comment "nightly run" --reservation my_reserv' and assert the returned list includes 'comment=nightly run' and 'reservation=my_reserv' (ordering-insensitive or match exact expected order used in existing tests); this ensures the parser correctly strips quotes and preserves spaces for values like "nightly run".

juntaowww added 9 commits March 11, 2026 09:00

Make node placement correctly propagate M-Bridge

a0d54a4

Allow excluding nodes from group of nodes

e9b1218

Also init submodules when install git repos

d720616

Add options to mount the repo in the container

ea4844d

Fix slurm gpu resource requesting

9f786ff

Update gb200 M-Bridge r0.3.0 qwen recipe

e6c9cb2

ruff check & ruff format

2acfb29

Fix tests to reflect updates

45e6153

Update copyright year

e14fa6f

juntaowww requested review from amaslenn, jeffnvidia and srivatsankrishnan as code owners March 11, 2026 09:08

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

src/cloudai/systems/slurm/slurm_system.py Show resolved Hide resolved

tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py Outdated Show resolved Hide resolved

juntaowww added 3 commits March 11, 2026 11:28

Add back the defensive filter

54f4fbb

Enhance nodes allocation for edge cases

ff2f2fd

Enhance tests by avoiding default names

f26fe7b

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

Make exclude nodes correctly propagate to M-Bridge

a159736

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

podkidyshev reviewed Mar 11, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Mar 11, 2026

Propagate exclude_nodes to all get_nodes_by_spec callers (nemo-run, nemo-launcher, single-sbatch-runner, and others) #832

Open

podkidyshev reviewed Mar 11, 2026

View reviewed changes

juntaowww added 4 commits March 16, 2026 08:23

Improve error messages for excluding nodes

5da0a57

Improve exclude nodes APIs

2d7b98b

Change additional_slurm_params separator to semi-colon

174dcef

Make init_submodules optional

3480ba8

juntaowww added 3 commits March 16, 2026 09:21

Update configurations

e12ec45

Make extra_srun_args correctly propagate to M-Bridge

af471d5

Allow submission of jobs to resv nodes

52cef4c

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

tests/test_git_repo_installer.py Show resolved Hide resolved

tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py Show resolved Hide resolved

tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py Show resolved Hide resolved

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py Show resolved Hide resolved

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py Show resolved Hide resolved

juntaowww added 3 commits March 16, 2026 10:52

Make the no-mount assertion less brittle

62f5cf7

Fix and improve the extra_srun_args propagation

f1b027e

ruff format & copyright year

b6a5dcf

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

		if args.gpus_per_node:
		additional_slurm_params.append(f"gpus-per-node={args.gpus_per_node}")

Conversation

juntaowww commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

juntaowww commented Mar 11, 2026 •

edited

Loading

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

greptile-apps bot commented Mar 11, 2026 •

edited

Loading