Skip to content

fix(compute): chunk bulkUploadF32 to stop wedging the GB10 driver (#106)#107

Merged
dndungu merged 4 commits into
mainfrom
fix/bulk-upload-chunking-106
Jun 6, 2026
Merged

fix(compute): chunk bulkUploadF32 to stop wedging the GB10 driver (#106)#107
dndungu merged 4 commits into
mainfrom
fix/bulk-upload-chunking-106

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Jun 5, 2026

What

Fixes #106. bulkUploadF32 consolidated all eligible f32 tensors into one
device allocation + one H2D copy. At CrossAsset sample-upload scale (~213k
tensors → multi-GB) that single large cudaMalloc/cudaMemcpy wedges the
GB10 (sm_121) driver in an uninterruptible ioctl: the worker thread sits in
D-state, the container becomes unkillable, and podman exec/logs/rm all
hang (this also drove the recurring orchestrator pod-leak).

Change

Upload in bounded chunks: cap each device allocation + copy at
bulkUploadF32MaxChunkBytes (64 MiB) / bulkUploadF32MaxChunkTensors (4096),
appending each chunk buffer to bulkUploadBuffers. Preserves the
few-round-trips win over the per-tensor path; the resulting GPU storage views
are byte-identical to before.

Chunk-boundary math is extracted into a pure bulkUploadChunkRanges helper and
unit-tested on CPU (no GPU required): tiling (no gaps/overlaps), both caps, a
lone-oversized tensor getting its own range, and the 213k-count production case
splitting into bounded chunks.

Test

  • go test ./compute/ -run TestBulkUploadChunkRanges — green (CPU).
  • go build ./..., go vet ./compute/ — clean.
  • Existing TestGPUEngine_UploadWeights_BulkPath unchanged (small input still
    collapses to one chunk → one buffer).
  • GB10 end-to-end verification (Wolf train-crossasset, full 213k-tensor upload)
    is being run downstream; will post the result.

Notes

bulkUploadF32MaxChunkBytes is a var so a test (or caller) can force the
multi-chunk path with small inputs. Caps are conservative (a prior per-tensor
smoke at ~16k tensors worked; 64 MiB / 4096 leaves wide margin under the
observed wedge threshold).

bulkUploadF32 consolidated ALL eligible f32 tensors into ONE device
allocation + ONE H2D copy. At CrossAsset sample-upload scale (~213k
tensors -> multi-GB) that single large cudaMalloc/cudaMemcpy wedges the
GB10 (sm_121) driver in an uninterruptible ioctl: the worker thread
stays in D-state, the container becomes unkillable, and podman rm /
exec / logs all hang (this also drove the recurring orchestrator
pod-leak).

Upload in bounded chunks instead: cap each device allocation + copy at
bulkUploadF32MaxChunkBytes (64 MiB) / bulkUploadF32MaxChunkTensors
(4096), appending each chunk buffer to bulkUploadBuffers. Preserves the
few-round-trips win over the per-tensor path; GPU storage views are
identical. Chunk-boundary math is extracted to bulkUploadChunkRanges
and unit-tested on CPU (tiling, both caps, lone-oversized tensor, and
the 213k-count bound).

Refs #106.
@dndungu
Copy link
Copy Markdown
Contributor Author

dndungu commented Jun 6, 2026

⚠️ Verification FAILED — do not merge as a fix yet.

Built Wolf train-crossasset against this branch (4eaae4b, no vendoring — the chunked code is in the binary) and ran the matched repro (full COIN bars, 213,304-tensor pre-upload on GB10). It wedged identically: after reaching the upload, podman exec + log streaming + ssh/logind all hang while the orchestrator control plane stays responsive — the same uninterruptible D-state CUDA-driver wedge from #106.

So capping each alloc/copy at 64 MiB / 4096 tensors did not prevent the wedge. The failure point is therefore not (only) the single large cudaMalloc/cudaMemcpy this PR chunks — it's somewhere not yet pinned (possibly the first CUDA op of the upload regardless of size, the per-tensor loop after bulkUploadF32, the ~213k SetStorage view creations, or the first kernel touching the data).

Holding this PR until the exact wedging cgo frame is captured (persisted /proc kernel-stack dump that doesn't depend on the wedged data-plane). The chunking is still a reasonable defensive bound, but it is not the fix on its own. Will update #106 with the pinned frame.

dndungu added 3 commits June 5, 2026 20:51
Record the chunking decision (dual cap: 64 MiB bytes + 4096 tensors) in
ADR 003, retire the shipped capture-hang plan into devlog, and rewrite
docs/plan.md around the sole open issue #106. Marks E0/E1 done against
commit 4eaae4b.

Refs #106.
Add TestGPUEngine_UploadWeights_MultiChunk: uploads 256 MiB (256x1MiB
tensors) so the bounded-chunk path issues 4 real 64 MiB device allocs +
copies, proving a 64 MiB chunk does not wedge the GB10 driver and that
cross-chunk GPUStorage views round-trip. Skips without CUDA.

Refs #106.
TestGPUEngine_UploadWeights_MultiChunk PASSED on DGX GB10 (Spark pod
ztensor-issue106-multichunk-guard-3c04539, exit-0 guard). 256 MiB
uploaded as 4 bounded 64 MiB chunks, no driver wedge, cross-chunk views
round-trip. Marks E2 done; adds the validation manifest.

Refs #106.
@dndungu dndungu merged commit 0cecc28 into main Jun 6, 2026
1 check passed
@dndungu dndungu deleted the fix/bulk-upload-chunking-106 branch June 6, 2026 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bulkUploadF32 wedges GB10 driver (uninterruptible D-state) on large one-shot uploads (~213k tensors); needs chunking

1 participant