fix(AlgEngine): math-SDPA switch for H-series (Hopper) training crash

WCJ-BERT · claude · WCJ-BERT · commit 95002adf499c · 2026-05-28T16:13:45.000Z
On H-series (Hopper, sm_90: H100/H200/H800/H20/GH200) GPUs, the
attention-heavy DiffusionPlanningHead routes its nn.MultiheadAttention
through torch 2.0.1's fused flash / mem-efficient SDPA kernels, which
fault on the first training iteration with "CUDA error: an illegal
instruction was encountered". The fault is reported asynchronously by
the NCCL watchdog, so it masquerades as a distributed-comm/hardware
error even though the real culprit is the attention kernel. Repros
across machines (same conda env: torch 2.0.1, bundled NCCL 2.14.3);
1-2 GPU and the vadv2 head are unaffected.

Add maybe_disable_efficient_sdp(), gated by env NAVFORMER_DISABLE_EFFICIENT_SDP=1,
to force the math SDPA backend as a workaround. No-op by default, so
other configs (vadv2 / hydramdp) are untouched unless the var is set.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/projects/AlgEngine/scripts/train.py b/projects/AlgEngine/scripts/train.py
@@ -24,6 +24,22 @@
 warnings.filterwarnings("ignore")
 
 
+def maybe_disable_efficient_sdp():
+    if os.environ.get("NAVFORMER_DISABLE_EFFICIENT_SDP", "0") != "1":
+        return
+
+    cuda_backends = getattr(torch.backends, "cuda", None)
+    if cuda_backends is None:
+        return
+
+    if hasattr(cuda_backends, "enable_flash_sdp"):
+        cuda_backends.enable_flash_sdp(False)
+    if hasattr(cuda_backends, "enable_mem_efficient_sdp"):
+        cuda_backends.enable_mem_efficient_sdp(False)
+    if hasattr(cuda_backends, "enable_math_sdp"):
+        cuda_backends.enable_math_sdp(True)
+    print("NAVFORMER_DISABLE_EFFICIENT_SDP=1: using math SDPA backend")
+
 
 def parse_args():
     parser = argparse.ArgumentParser(description='Train a detector')
@@ -95,6 +111,7 @@ def parse_args():
 
 
 def main():
+    maybe_disable_efficient_sdp()
     args = parse_args()
 
     cfg = Config.fromfile(args.config)