fix(compute): chunk bulkUploadF32 to stop wedging the GB10 driver (#106) by dndungu · Pull Request #107 · zerfoo/ztensor

dndungu · 2026-06-05T23:43:57Z

What

Fixes #106. bulkUploadF32 consolidated all eligible f32 tensors into one
device allocation + one H2D copy. At CrossAsset sample-upload scale (~213k
tensors → multi-GB) that single large cudaMalloc/cudaMemcpy wedges the
GB10 (sm_121) driver in an uninterruptible ioctl: the worker thread sits in
D-state, the container becomes unkillable, and podman exec/logs/rm all
hang (this also drove the recurring orchestrator pod-leak).

Change

Upload in bounded chunks: cap each device allocation + copy at
bulkUploadF32MaxChunkBytes (64 MiB) / bulkUploadF32MaxChunkTensors (4096),
appending each chunk buffer to bulkUploadBuffers. Preserves the
few-round-trips win over the per-tensor path; the resulting GPU storage views
are byte-identical to before.

Chunk-boundary math is extracted into a pure bulkUploadChunkRanges helper and
unit-tested on CPU (no GPU required): tiling (no gaps/overlaps), both caps, a
lone-oversized tensor getting its own range, and the 213k-count production case
splitting into bounded chunks.

Test

go test ./compute/ -run TestBulkUploadChunkRanges — green (CPU).
go build ./..., go vet ./compute/ — clean.
Existing TestGPUEngine_UploadWeights_BulkPath unchanged (small input still
collapses to one chunk → one buffer).
GB10 end-to-end verification (Wolf train-crossasset, full 213k-tensor upload)
is being run downstream; will post the result.

Notes

bulkUploadF32MaxChunkBytes is a var so a test (or caller) can force the
multi-chunk path with small inputs. Caps are conservative (a prior per-tensor
smoke at ~16k tensors worked; 64 MiB / 4096 leaves wide margin under the
observed wedge threshold).

bulkUploadF32 consolidated ALL eligible f32 tensors into ONE device allocation + ONE H2D copy. At CrossAsset sample-upload scale (~213k tensors -> multi-GB) that single large cudaMalloc/cudaMemcpy wedges the GB10 (sm_121) driver in an uninterruptible ioctl: the worker thread stays in D-state, the container becomes unkillable, and podman rm / exec / logs all hang (this also drove the recurring orchestrator pod-leak). Upload in bounded chunks instead: cap each device allocation + copy at bulkUploadF32MaxChunkBytes (64 MiB) / bulkUploadF32MaxChunkTensors (4096), appending each chunk buffer to bulkUploadBuffers. Preserves the few-round-trips win over the per-tensor path; GPU storage views are identical. Chunk-boundary math is extracted to bulkUploadChunkRanges and unit-tested on CPU (tiling, both caps, lone-oversized tensor, and the 213k-count bound). Refs #106.

dndungu · 2026-06-06T00:14:49Z

⚠️ Verification FAILED — do not merge as a fix yet.

Built Wolf train-crossasset against this branch (4eaae4b, no vendoring — the chunked code is in the binary) and ran the matched repro (full COIN bars, 213,304-tensor pre-upload on GB10). It wedged identically: after reaching the upload, podman exec + log streaming + ssh/logind all hang while the orchestrator control plane stays responsive — the same uninterruptible D-state CUDA-driver wedge from #106.

So capping each alloc/copy at 64 MiB / 4096 tensors did not prevent the wedge. The failure point is therefore not (only) the single large cudaMalloc/cudaMemcpy this PR chunks — it's somewhere not yet pinned (possibly the first CUDA op of the upload regardless of size, the per-tensor loop after bulkUploadF32, the ~213k SetStorage view creations, or the first kernel touching the data).

Holding this PR until the exact wedging cgo frame is captured (persisted /proc kernel-stack dump that doesn't depend on the wedged data-plane). The chunking is still a reasonable defensive bound, but it is not the fix on its own. Will update #106 with the pinned frame.

Record the chunking decision (dual cap: 64 MiB bytes + 4096 tensors) in ADR 003, retire the shipped capture-hang plan into devlog, and rewrite docs/plan.md around the sole open issue #106. Marks E0/E1 done against commit 4eaae4b. Refs #106.

Add TestGPUEngine_UploadWeights_MultiChunk: uploads 256 MiB (256x1MiB tensors) so the bounded-chunk path issues 4 real 64 MiB device allocs + copies, proving a 64 MiB chunk does not wedge the GB10 driver and that cross-chunk GPUStorage views round-trip. Skips without CUDA. Refs #106.

TestGPUEngine_UploadWeights_MultiChunk PASSED on DGX GB10 (Spark pod ztensor-issue106-multichunk-guard-3c04539, exit-0 guard). 256 MiB uploaded as 4 bounded 64 MiB chunks, no driver wedge, cross-chunk views round-trip. Marks E2 done; adds the validation manifest. Refs #106.

dndungu added 3 commits June 5, 2026 20:51

dndungu merged commit 0cecc28 into main Jun 6, 2026
1 check passed

dndungu deleted the fix/bulk-upload-chunking-106 branch June 6, 2026 05:16

dndungu mentioned this pull request Jun 6, 2026

bulkUploadF32 wedges GB10 driver (uninterruptible D-state) on large one-shot uploads (~213k tensors); needs chunking #106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(compute): chunk bulkUploadF32 to stop wedging the GB10 driver (#106)#107

fix(compute): chunk bulkUploadF32 to stop wedging the GB10 driver (#106)#107
dndungu merged 4 commits into
mainfrom
fix/bulk-upload-chunking-106

dndungu commented Jun 5, 2026

Uh oh!

dndungu commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dndungu commented Jun 5, 2026

What

Change

Test

Notes

Uh oh!

dndungu commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant