fix(pt): recognize AOTInductor-wrapped CUDA OOM in AutoBatchSize by OutisLi · Pull Request #5418 · deepmodeling/deepmd-kit

OutisLi · 2026-04-24T02:25:47Z

When running dp --pt-expt test (or any path that goes through deepmd.pt_expt.infer.deep_eval) against a .pt2 AOTInductor package, AutoBatchSize doubles the batch on every success. For models with a large sel the exploration eventually saturates GPU memory, and the CUDA caching allocator raises the usual CUDA out of memory from inside the AOTInductor runtime. AOTInductor then rewraps that error as a generic
RuntimeError: run_func_(...) API call failed at
.../aoti_runner/model_container_runner.cpp, line 144
The original "CUDA out of memory" text is printed only to stderr, so the old is_oom_error -- which keyed on a short list of substrings in e.args[0] -- never matched. execute() therefore did not shrink the batch; the exception propagated and the run crashed on a GPU that was otherwise completely idle (as confirmed by monitoring nvidia-smi --query-compute-apps, which showed dp itself as the sole consumer holding tens of GiB just before the failure). Widen is_oom_error to:

walk the exception chain via __cause__ / __context__, so that a future PyTorch preserving the original OOM text is handled for free;
keep matching the four plain CUDA OOM markers on every message in the chain;
additionally treat the AOTInductor wrapper signature (run_func_( plus model_container_runner) as an OOM candidate. If the AOTInductor wrapper ever hides a non-OOM failure, the batch shrinker will halve down to 1 and then raise OutOfMemoryError, so the fallback is bounded -- non-OOM bugs still surface with a clear terminal error rather than being silently retried forever.

Summary by CodeRabbit

Bug Fixes
- Improved out-of-memory detection to identify more CUDA memory exhaustion scenarios, including wrapped/instrumented failures; now reliably triggers cache cleanup when OOM conditions are detected.
Tests
- Added unit tests verifying OOM detection across different error shapes and that CUDA cache clearing is invoked for detected OOMs.

coderabbitai · 2026-04-24T02:28:36Z

📝 Walkthrough

Walkthrough

The AutoBatchSize.is_oom_error method was refactored to more reliably detect CUDA out-of-memory failures by handling torch.cuda.OutOfMemoryError, validating RuntimeError types, traversing exception cause/context chains to collect messages, matching CUDA/cuSolver and AOTInductor/PT2 wrapper patterns, and clearing CUDA cache when OOM is detected.

Changes

Cohort / File(s)	Summary
OOM detection logic `deepmd/pt/utils/auto_batch_size.py`	Refactored `is_oom_error` to handle `torch.cuda.OutOfMemoryError`, reject non-`RuntimeError` exceptions, traverse exception `__cause__/__context__` to aggregate message strings, detect CUDA/cuSolver OOM markers and AOTInductor/PT2 wrapper signatures, call `torch.cuda.empty_cache()` on detection, and return a boolean result.
Unit tests `source/tests/pt/test_auto_batch_size.py`	Added tests covering direct CUDA OOM text, OOM in `RuntimeError.__cause__`, and AOTInductor/PT2 wrapper message detection; mocks `torch.cuda.empty_cache` to assert it is invoked when OOM is detected.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(pt): recognize AOTInductor-wrapped CUDA OOM in AutoBatchSize' clearly and specifically describes the main change: improving OOM detection for AOTInductor-wrapped errors in the AutoBatchSize class.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

deepmd/pt/utils/auto_batch_size.py (1)

79-96: Optional: consolidate the three empty_cache() + return True branches.

Minor readability nit — the three OOM-positive branches repeat the same side effect. You can fold them into a single exit point without changing behavior.

♻️ Suggested consolidation

-        if any(m in msg for msg in msgs for m in plain_oom_markers):
-            torch.cuda.empty_cache()
-            return True
-
-        # AOTInductor (.pt2) wraps the underlying CUDA OOM as a generic
-        # ``run_func_(...) API call failed at .../model_container_runner.cpp``.
-        # ...
-        aoti_wrapped = any(
-            "run_func_(" in msg and "model_container_runner" in msg for msg in msgs
-        )
-        if aoti_wrapped:
-            torch.cuda.empty_cache()
-            return True
-
-        return False
+        plain_oom = any(m in msg for msg in msgs for m in plain_oom_markers)
+        # AOTInductor (.pt2) wraps the underlying CUDA OOM as a generic
+        # ``run_func_(...) API call failed at .../model_container_runner.cpp``.
+        # The original "CUDA out of memory" text is printed to stderr only, so
+        # we match on the wrapper signature. If the root cause is not OOM,
+        # ``execute()`` will shrink to batch size 1 and raise ``OutOfMemoryError``.
+        aoti_wrapped = any(
+            "run_func_(" in msg and "model_container_runner" in msg for msg in msgs
+        )
+        if plain_oom or aoti_wrapped:
+            torch.cuda.empty_cache()
+            return True
+        return False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/utils/auto_batch_size.py` around lines 79 - 96, The three places
that call torch.cuda.empty_cache() and return True (the plain OOM marker check
using plain_oom_markers, the earlier OOM detection loop over msgs, and the
AOTInductor wrapper check that sets aoti_wrapped) should be consolidated to a
single exit point: compute the boolean condition variables (e.g., plain_match =
any(m in msg for msg in msgs for m in plain_oom_markers), other_match = ... ,
aoti_wrapped = any("run_func_(" in msg and "model_container_runner" in msg for
msg in msgs)), combine them (e.g., if plain_match or other_match or
aoti_wrapped) then call torch.cuda.empty_cache() once and return True; update
the surrounding function (the OOM detection logic in auto_batch_size) to use
these names so the duplicated side effects are removed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@deepmd/pt/utils/auto_batch_size.py`:
- Around line 79-96: The three places that call torch.cuda.empty_cache() and
return True (the plain OOM marker check using plain_oom_markers, the earlier OOM
detection loop over msgs, and the AOTInductor wrapper check that sets
aoti_wrapped) should be consolidated to a single exit point: compute the boolean
condition variables (e.g., plain_match = any(m in msg for msg in msgs for m in
plain_oom_markers), other_match = ... , aoti_wrapped = any("run_func_(" in msg
and "model_container_runner" in msg for msg in msgs)), combine them (e.g., if
plain_match or other_match or aoti_wrapped) then call torch.cuda.empty_cache()
once and return True; update the surrounding function (the OOM detection logic
in auto_batch_size) to use these names so the duplicated side effects are
removed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d8e9df32-3cac-49f1-ae75-1e812e9778fb

📥 Commits

Reviewing files that changed from the base of the PR and between 54f42d9 and 0dcff08.

📒 Files selected for processing (1)

deepmd/pt/utils/auto_batch_size.py

Copilot

Pull request overview

Improves PyTorch AutoBatchSize OOM detection for .pt2 AOTInductor-packaged models by recognizing AOTInductor’s wrapped CUDA OOM failures, allowing batch size to shrink instead of crashing.

Changes:

Expand OOM detection to scan the exception chain (__cause__ / __context__) for known CUDA/cusolver OOM markers.
Add detection for AOTInductor wrapper error signatures (run_func_( + model_container_runner) and treat them as OOM.
Keep GPU cache cleanup (torch.cuda.empty_cache()) when an OOM is detected.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-04-24T03:31:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.46%. Comparing base (5c22e17) to head (a00b10f).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5418      +/-   ##
==========================================
- Coverage   80.46%   80.46%   -0.01%     
==========================================
  Files         823      823              
  Lines       86625    86645      +20     
  Branches     4139     4139              
==========================================
+ Hits        69701    69717      +16     
  Misses      15651    15651              
- Partials     1273     1277       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

njzjz-bot

Nice fix. I walked through the exception-chain handling and the AOTInductor wrapper detection in AutoBatchSize, and the fallback behavior still looks bounded/reasonable. The current CI matrix is green, so I'm happy with this as-is.

— OpenClaw 2026.4.22 (model: gpt-5.4)

When running `dp --pt-expt test` (or any path that goes through `deepmd.pt_expt.infer.deep_eval`) against a `.pt2` AOTInductor package, `AutoBatchSize` doubles the batch on every success. For models with a large `sel` the exploration eventually saturates GPU memory, and the CUDA caching allocator raises the usual ``CUDA out of memory`` from inside the AOTInductor runtime. AOTInductor then rewraps that error as a generic RuntimeError: run_func_(...) API call failed at .../aoti_runner/model_container_runner.cpp, line 144 The original "CUDA out of memory" text is printed only to stderr, so the old `is_oom_error` -- which keyed on a short list of substrings in `e.args[0]` -- never matched. `execute()` therefore did not shrink the batch; the exception propagated and the run crashed on a GPU that was otherwise completely idle (as confirmed by monitoring `nvidia-smi --query-compute-apps`, which showed dp itself as the sole consumer holding tens of GiB just before the failure). Widen `is_oom_error` to: * walk the exception chain via `__cause__` / `__context__`, so that a future PyTorch preserving the original OOM text is handled for free; * keep matching the four plain CUDA OOM markers on every message in the chain; * additionally treat the AOTInductor wrapper signature (`run_func_(` plus `model_container_runner`) as an OOM candidate. If the AOTInductor wrapper ever hides a non-OOM failure, the batch shrinker will halve down to 1 and then raise `OutOfMemoryError`, so the fallback is bounded -- non-OOM bugs still surface with a clear terminal error rather than being silently retried forever.

coderabbitai

🧹 Nitpick comments (1)

source/tests/pt/test_auto_batch_size.py (1)
14-43: Consider adding a negative case and a direct OutOfMemoryError case.

The three new tests cover the chain-traversal and AOTI-wrapper paths well, but two coverage gaps remain that would harden against future regressions:

The direct torch.cuda.OutOfMemoryError branch (early return at the top of is_oom_error) isn't exercised — instantiating one is a bit awkward (it requires CUDA), but a mock.patch on torch.cuda.OutOfMemoryError or constructing via RuntimeError-like surrogate plus isinstance patching can do it; alternatively a simple skip-if-no-CUDA path also works.

There's no negative test asserting that an unrelated RuntimeError (e.g. RuntimeError("shape mismatch")) returns False and that empty_cache is not called. This guards against the marker list/AOTI heuristic accidentally widening into false positives, which would silently clear caches and shrink batch sizes on non-OOM bugs.
♻️ Suggested addition
+    `@mock.patch`("deepmd.pt.utils.auto_batch_size.torch.cuda.empty_cache")
+    def test_is_oom_error_non_oom_runtime_error(self, empty_cache) -> None:
+        auto_batch_size = AutoBatchSize(256, 2.0)
+        self.assertFalse(
+            auto_batch_size.is_oom_error(RuntimeError("shape mismatch"))
+        )
+        empty_cache.assert_not_called()
+
+    `@mock.patch`("deepmd.pt.utils.auto_batch_size.torch.cuda.empty_cache")
+    def test_is_oom_error_non_runtime_error(self, empty_cache) -> None:
+        auto_batch_size = AutoBatchSize(256, 2.0)
+        self.assertFalse(auto_batch_size.is_oom_error(ValueError("nope")))
+        empty_cache.assert_not_called()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/pt/test_auto_batch_size.py` around lines 14 - 43, Add two tests
to TestAutoBatchSize covering (1) direct torch.cuda.OutOfMemoryError and (2) a
negative unrelated RuntimeError: create a test that patches
torch.cuda.OutOfMemoryError (or mocks isinstance checks) and asserts
AutoBatchSize.is_oom_error returns True and torch.cuda.empty_cache is called,
and add another test that calls AutoBatchSize.is_oom_error(RuntimeError("shape
mismatch")) asserting it returns False and that the patched
torch.cuda.empty_cache was NOT called; reference the AutoBatchSize.is_oom_error
method and the existing tests that patch
"deepmd.pt.utils.auto_batch_size.torch.cuda.empty_cache" to mirror setup.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@source/tests/pt/test_auto_batch_size.py`:
- Around line 14-43: Add two tests to TestAutoBatchSize covering (1) direct
torch.cuda.OutOfMemoryError and (2) a negative unrelated RuntimeError: create a
test that patches torch.cuda.OutOfMemoryError (or mocks isinstance checks) and
asserts AutoBatchSize.is_oom_error returns True and torch.cuda.empty_cache is
called, and add another test that calls
AutoBatchSize.is_oom_error(RuntimeError("shape mismatch")) asserting it returns
False and that the patched torch.cuda.empty_cache was NOT called; reference the
AutoBatchSize.is_oom_error method and the existing tests that patch
"deepmd.pt.utils.auto_batch_size.torch.cuda.empty_cache" to mirror setup.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 04fc2d82-1a39-480a-957e-57976382f46c

📥 Commits

Reviewing files that changed from the base of the PR and between 0dcff08 and a00b10f.

📒 Files selected for processing (2)

deepmd/pt/utils/auto_batch_size.py
source/tests/pt/test_auto_batch_size.py

njzjz-bot · 2026-04-25T19:06:41Z

+
+        # AOTInductor (.pt2) wraps the underlying CUDA OOM as a generic
+        # ``run_func_(...) API call failed at .../model_container_runner.cpp``.
+        # https://github.com/deepmodeling/deepmd-kit/issues/4594


I think #4594 is cited in the wrong place.

That issue looks like a plain CUDA OOM report (RuntimeError: CUDA error: out of memory), but this comment block is specifically describing the AOTInductor/PT2 wrapper signature (run_func_(...) / model_container_runner.cpp). So #4594 seems to support the plain OOM markers above rather than this wrapper-specific branch.

Could you either move #4594 back to the plain OOM section, or replace it with a reference that actually contains the wrapper signature?

— OpenClaw 2026.4.22 (model: gpt-5.4)

Copilot AI review requested due to automatic review settings April 24, 2026 02:25

github-actions Bot added the Python label Apr 24, 2026

dosubot Bot added the bug label Apr 24, 2026

OutisLi requested a review from njzjz April 24, 2026 02:28

Copilot started reviewing on behalf of OutisLi April 24, 2026 02:28 View session

OutisLi requested a review from njzjz April 24, 2026 02:28

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread deepmd/pt/utils/auto_batch_size.py

Comment thread deepmd/pt/utils/auto_batch_size.py Outdated

njzjz-bot approved these changes Apr 24, 2026

View reviewed changes

njzjz reviewed Apr 24, 2026

View reviewed changes

Comment thread deepmd/pt/utils/auto_batch_size.py

OutisLi added 2 commits April 25, 2026 10:36

fixup

a00b10f

OutisLi force-pushed the pr/bs branch from 0dcff08 to a00b10f Compare April 25, 2026 02:37

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

njzjz-bot reviewed Apr 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pt): recognize AOTInductor-wrapped CUDA OOM in AutoBatchSize#5418

fix(pt): recognize AOTInductor-wrapped CUDA OOM in AutoBatchSize#5418
OutisLi wants to merge 2 commits intodeepmodeling:masterfrom
OutisLi:pr/bs

OutisLi commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

njzjz-bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

njzjz-bot Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

OutisLi commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

njzjz-bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OutisLi commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

codecov Bot commented Apr 24, 2026 •

edited

Loading