Commit 95002ad
fix(AlgEngine): math-SDPA switch for H-series (Hopper) training crash
On H-series (Hopper, sm_90: H100/H200/H800/H20/GH200) GPUs, the
attention-heavy DiffusionPlanningHead routes its nn.MultiheadAttention
through torch 2.0.1's fused flash / mem-efficient SDPA kernels, which
fault on the first training iteration with "CUDA error: an illegal
instruction was encountered". The fault is reported asynchronously by
the NCCL watchdog, so it masquerades as a distributed-comm/hardware
error even though the real culprit is the attention kernel. Repros
across machines (same conda env: torch 2.0.1, bundled NCCL 2.14.3);
1-2 GPU and the vadv2 head are unaffected.
Add maybe_disable_efficient_sdp(), gated by env NAVFORMER_DISABLE_EFFICIENT_SDP=1,
to force the math SDPA backend as a workaround. No-op by default, so
other configs (vadv2 / hydramdp) are untouched unless the var is set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent f254779 commit 95002ad
1 file changed
Lines changed: 17 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
27 | 43 | | |
28 | 44 | | |
29 | 45 | | |
| |||
95 | 111 | | |
96 | 112 | | |
97 | 113 | | |
| 114 | + | |
98 | 115 | | |
99 | 116 | | |
100 | 117 | | |
| |||
0 commit comments