Skip to content

Add CUDA process checkpointing helpers#1983

Open
kkraus14 wants to merge 2 commits intoNVIDIA:mainfrom
kkraus14:kk/issue-1343-cuda-checkpointing
Open

Add CUDA process checkpointing helpers#1983
kkraus14 wants to merge 2 commits intoNVIDIA:mainfrom
kkraus14:kk/issue-1343-cuda-checkpointing

Conversation

@kkraus14
Copy link
Copy Markdown
Collaborator

@kkraus14 kkraus14 commented Apr 28, 2026

Summary

  • add a dedicated cuda.core.checkpoint module for CUDA process checkpointing APIs
  • expose a narrow runtime API via checkpoint.Process; Process.state now returns typed string states ("running", "locked", "checkpointed", or "failed") instead of a public enum
  • model checkpoint operations with checkpoint.Process(pid): state, restore_thread_id, lock, checkpoint, restore, and unlock
  • support restore-time GPU UUID remapping by accepting a mapping in Process.restore(gpu_mapping=...) and converting it to the driver CUcheckpointGpuPair / CUcheckpointRestoreArgs structures internally
  • keep checkpointing separate from cuda.core.system, which remains focused on CUDA system and NVML capabilities
  • validate checkpoint API availability lazily and cache the successful check, covering the cuda-bindings version, required binding symbols, and CUDA driver version
  • document the checkpoint lifecycle, Linux support scope, restore-thread requirement, restore/unlock state transition, and 1.0.0 release-note coverage

Closes #1343

Testing

  • pixi run ruff check cuda_core/cuda/core/checkpoint.py cuda_core/tests/test_checkpoint.py
  • pixi run ruff format cuda_core/cuda/core/checkpoint.py cuda_core/tests/test_checkpoint.py
  • pixi run --manifest-path cuda_core pytest cuda_core/tests/test_checkpoint.py (25 passed)
  • SPHINX_CUDA_CORE_VER=0.7.1.dev63 BUILD_LATEST=1 pixi run --manifest-path cuda_core -e docs sphinx-build -b html -W --keep-going -j 4 cuda_core/docs/source /tmp/cuda_core_docs_checkpoint_verify_2
  • pixi run --manifest-path cuda_core test (2817 passed, 346 skipped, 2 failed in local NVML/system tests; checkpoint tests passed)
  • git diff --check

The current checkpoint tests are implemented as focused unit tests in cuda_core/tests/test_checkpoint.py. They use a small mock CUDA driver surface and monkeypatch checkpoint._get_driver() so the behavioral tests do not require a live checkpoint-capable driver or process. The mock driver records each driver call and provides minimal stand-ins for CUcheckpointLockArgs, CUcheckpointRestoreArgs, CUcheckpointGpuPair, CUresult, and process states.

The tests cover public symbol exposure, string process state mapping, restore thread queries, lock timeout argument construction, checkpoint/unlock null argument behavior, restore GPU UUID mapping conversion, empty restore mappings, input validation for pid, timeout_ms, and gpu_mapping, unsupported-driver error translation, missing runtime checkpoint symbol translation, cached availability checks, unsupported cuda-bindings versions, missing binding symbols, and unsupported driver versions.

The two local full-suite failures are the existing NVML/system environment-sensitive failures we are ignoring for this PR:

  • tests/system/test_system_device.py::test_get_inforom_version returns an empty InfoROM board part number locally.
  • tests/system/test_system_system.py::test_get_process_name hits an NVML UTF-8 decode error locally.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label Apr 28, 2026
@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 396a2ca to 7c66b2f Compare April 28, 2026 16:28
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 28, 2026

/ok to test

@kkraus14, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test 7c66b2f

@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch 2 times, most recently from 779c697 to 82f816c Compare April 28, 2026 16:44
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 28, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@github-actions
Copy link
Copy Markdown

Comment thread cuda_core/cuda/core/system/__init__.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 82f816c to 25455d8 Compare April 28, 2026 18:22
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 25455d8 to aaf1418 Compare April 28, 2026 19:14
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

Comment thread cuda_core/cuda/core/checkpoint.py Outdated
@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from aaf1418 to d8a2031 Compare April 28, 2026 20:24
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@kkraus14 kkraus14 marked this pull request as ready for review April 29, 2026 13:59
@kkraus14 kkraus14 added the feature New feature or request label Apr 29, 2026
@kkraus14 kkraus14 added this to the cuda.core v1.0.0 milestone Apr 29, 2026
@kkraus14 kkraus14 self-assigned this Apr 29, 2026
@rparolin rparolin requested review from leofang and rparolin April 29, 2026 17:44
Comment thread cuda_core/tests/test_checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py
Comment thread cuda_core/cuda/core/checkpoint.py
Comment thread cuda_core/cuda/core/checkpoint.py
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py
@leofang leofang requested a review from Andy-Jost April 29, 2026 18:05
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

Comment thread cuda_core/cuda/core/checkpoint.py
Comment thread cuda_core/cuda/core/checkpoint.py
@leofang leofang self-requested a review April 30, 2026 18:58
@leofang leofang added the P1 Medium priority - Should do label May 1, 2026
from cuda import cuda as _driver


ProcessStateT = _Literal["running", "locked", "checkpointed", "failed"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expose this to cuda.core.typing and then add it to api_private.rst to make it rendered by Sphinx

Comment on lines +22 to +25
0: "running",
1: "locked",
2: "checkpointed",
3: "failed",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use the actual enumerators as key, instead of plain Python ints


from cuda.core import checkpoint

process = checkpoint.Process(pid)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Should we teach users how to get pid in Python?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would allow check pointing the non current process so I don't think using os.getpid() is appropriate?

Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth pointing out in api.rst that this is typically used to checkpoint a different process.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would allow check pointing the non current process

this is typically used to checkpoint a different process.

I think os.getpid() allows for checkpointing self, which is useful as demo'd in the linked code.

I do not believe all PIDs are allowed. I assume only processes owned by the current user can be checkpointed (either limited by the Linux kernel or the CUDA driver).

In any case, the example snippet in api.rst isn't very clear with the current 4 lines of code (lock -> checkpoint -> restore -> unlock). It is not the full story. There are lots of things that need to happen behind the scene. It was the main reason I started digging all of these myself without relying on AI.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, my understanding is that checkpointing is something you'd typically do to a process, analogous to sending a signal or attaching a debugger. Linux handles the permissions, and checkpointing requires CAP_SYS_PTRACE, the same permissions needed to attach a debugger or run strace against another user's process. One might expect a system admin to run it with sudo privileges.

The main purpose of CUDA checkpoint is to ensure everything managed by CUDA resides in CPU user space so that a tool such as CRIU can capture a complete process image. Without this, CRIU would miss the GPU state.

Use cases:

  1. Migrate a GPU workload to a different system.
  2. Periodically checkpoint a long-running job so it can be quickly resumed after a potential system failure.
  3. Preempt GPU resources to favor a job with higher priority.

These fit naturally into a system-admin role. It looks like CUDA allows a process to checkpoint itself, but it seems to me the use cases would be niche.

Comment on lines +207 to +219
pairs = []
for old_uuid, new_uuid in gpu_mapping.items():
pair = driver.CUcheckpointGpuPair()
pair.oldUuid = old_uuid
pair.newUuid = new_uuid
pairs.append(pair)

if not pairs:
return None

args = driver.CUcheckpointRestoreArgs()
args.gpuPairs = pairs
args.gpuPairsCount = len(pairs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I find this test suite very problematic. Why are we mocking the entire tests? This only works when we implement the checkpoint module in pure Python. Once we lower to Cython/C++, it won't work. Plus, across cuda-core we never, ever mock the tests. We always require GPU machines to test cuda-core functionalities. This seems like an agentic laziness to avoid writing/running GPU tests from within sandbox!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had my agent mock the test suite because CUDA check pointing requires more than just interacting with the CUDA driver / other libraries. It requires using CRIU which requires a whole bunch of kernel capabilities alongside building a harness for process management. It felt like it would be fragile to try to implement reliable actual tests because of this so I opted for mocking the driver given the API surface is quite small.

Happy to give it a shot of having actual tests that checkpoint and restore a process instead of mocking things if you think that would be fruitful.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we write the tests using Andy’s min-2-GPU decorator fixture, we can test the GPU migration capability without CRIU. The idea is that we shuffle each GPU’s state to the next one (and wrap around).

Comment on lines +106 to +109
gpu_mapping : mapping, optional
GPU UUID remapping from each checkpointed GPU UUID to the GPU UUID
to restore onto. If provided, the mapping must contain every
checkpointed GPU UUID.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is why a real, instead of mocking, test is a MUST. Apparently, the API doc and the example code diverge here. The latter requests "all devices visible to CUDA" must appear in the mapping, not just those participating in checkpointing (as indicated by the former). We should get clarification on this (even better, find a way to test this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support CUDA Checkpointing

4 participants