Skip to content

Add persistent program cache for Program.compile#1912

Open
cpcloud wants to merge 2 commits intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178
Open

Add persistent program cache for Program.compile#1912
cpcloud wants to merge 2 commits intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178

Conversation

@cpcloud
Copy link
Copy Markdown
Contributor

@cpcloud cpcloud commented Apr 14, 2026

Summary

Adds a persistent on-disk cache for cuda.core.Program.compile outputs. The high-level integration is one keyword on Program.compile:

from cuda.core import Program, ProgramOptions
from cuda.core.utils import FileStreamProgramCache

source = 'extern "C" __global__ void k(int *a){ *a = 1; }'
options = ProgramOptions(arch="sm_80")

with FileStreamProgramCache() as cache:  # default: $XDG_CACHE_HOME/cuda-python/program-cache
    obj = Program(source, "c++", options=options).compile("cubin", cache=cache)
    obj.get_kernel("k")

A second invocation with the same inputs short-circuits the entire NVRTC compile — cache.get(key) (one stat + one read) and an ObjectCode._init from the bytes. No Program_compile is invoked. This is the fast path the cache exists to provide:

# Fresh process / second run -- same source, same options.
with FileStreamProgramCache() as cache:
    obj = Program(source, "c++", options=options).compile("cubin", cache=cache)
    # ~10us round-trip on a warm page cache, vs hundreds of ms to seconds
    # for an actual NVRTC invocation.

Public API

  • Program.compile(target_type, *, cache=...) — convenience wrapper. Derives the key, returns a fresh ObjectCode on hit, stores the compile output on miss.
  • cuda.core.utils.ProgramCacheResource — abstract bytes-in / bytes-out interface for custom backends. Provides get, update (Mapping or pairs), clear, and the mapping mutators (__getitem__/__setitem__/__delitem__/__len__). __contains__ is intentionally omitted: cache.get(key) is the recommended idiom because the two-call if key in cache: cache[key] pattern is racy across processes.
  • cuda.core.utils.InMemoryProgramCache — single-process LRU on OrderedDict, threading.RLock, size-only cap. For "compile once, look up many" workflows that don't need persistence.
  • cuda.core.utils.FileStreamProgramCache — directory of atomic per-entry files. Safe across processes via os.replace + Windows sharing-violation retries on os.replace / read / unlink.
  • cuda.core.utils.make_program_cache_key — escape hatch when the compile inputs require an extra_digest (include_path, pre_include, pch, use_pch, pch_dir, NVVM use_libdevice=True, NVRTC options.name with a directory component). Program.compile(cache=...) rejects those compiles with a ValueError pointing here.

On-disk format

Each entry is the raw compiled binary verbatim — cubin / PTX / LTO-IR — with no pickle, JSON, length prefix, or framing of any kind. Cache files are directly consumable by external NVIDIA tools (cuobjdump, nvdisasm, cuda-gdb).

ObjectCode.symbol_mapping from name_expressions is not preserved across a cache round-trip; the wrapper rejects Program.compile(name_expressions=..., cache=...) outright so the first-call-works/second-call-breaks footgun can't surface. Callers that need get_kernel(name_expression) should compile without cache=.

FileStreamProgramCache

  • Atomic writes: stage to tmp/, fsync, os.replace into entries/<2char>/<hash>. Concurrent readers never observe partial writes. Windows os.replace retries on ERROR_ACCESS_DENIED / ERROR_SHARING_VIOLATION / ERROR_LOCK_VIOLATION (winerrors 5/32/33) within a bounded backoff (~185 ms); after the budget, the write is dropped and the next call recompiles. The same retry covers reads and path.unlink so eviction doesn't crash the writer that triggered it on win-64.
  • Sharing-violation predicate: _is_windows_sharing_violation(exc) filters EACCES only when winerror is absent — non-sharing winerrors are real config errors and propagate. Off-Windows PermissionError always propagates.
  • Transparent input forms: cache[key] = value (and cache.update({key: value, ...})) accept raw bytes, bytearray, memoryview, or any ObjectCode (path-backed too — the file is read at write time so the cached entry is the binary content, not a path that could move). Reads return the same bytes that went in.
  • Size-only bound: max_size_bytes is the only knob — no element-count cap. None means unbounded.
  • True LRU via atime: every successful read calls os.utime (fd-based on Linux/macOS via os.supports_fd, path-based on Windows) to bump st_atime regardless of mount options or NtfsDisableLastAccessUpdate. Eviction sorts by oldest st_atime first. The atime touch is stat-guarded so a racing rewriter's freshly-replaced file never has its mtime rolled back.
  • Stat-guarded prunes: clear(), _enforce_size_cap(), and the atime touch all snapshot (ino, size, mtime_ns) per entry and refuse to unlink / overwrite stamps if a writer replaced the file mid-operation.
  • Cache key derivation (make_program_cache_key): a backend-strategy pattern with one class per code_type (_NvrtcBackend / _LinkerBackend / _NvvmBackend). Each owns its own validate / encode_code / option_fingerprint / encode_name_expressions / hash_version_probe / hash_extra_payload. The orchestrator validates code_type/target_type, dispatches to the right backend, and assembles the digest in fixed order. Adding a new backend is one new class, not a five-place edit.
  • NVRTC options.name with a directory component: rejected without extra_digest because NVRTC resolves quoted #include directives relative to that directory — neighbour-header changes wouldn't invalidate the cache otherwise.
  • PTX-loadability warning on cache hit: when the active driver can't load freshly-generated PTX, the wrapper emits the same RuntimeWarning the uncached path emits — loadability depends on the driver, not on whether the bytes were freshly compiled.
  • Default cache directory: when path is omitted, resolves via platformdirs.user_cache_path("cuda-python", appauthor=False, opinion=False) / "program-cache":
    • Linux/BSD: \$XDG_CACHE_HOME/cuda-python/program-cache (default ~/.cache/cuda-python/program-cache)
    • macOS: ~/Library/Caches/cuda-python/program-cache
    • Windows: %LOCALAPPDATA%\\cuda-python\\program-cache
  • tmp/ self-heal: if something deletes tmp/ after the cache is opened, the next write recreates it rather than crashing with FileNotFoundError.
  • Crashed-writer cleanup: stale temp files older than 1 hour are swept on open and on size-cap enforcement.

Test plan

  • tests/test_program_cache.py — abstract-class contract, update accepts mapping or pairs, transparent input-form equivalence (bytes / bytearray / memoryview / bytes-backed ObjectCode / path-backed ObjectCode all round-trip to the same on-disk bytes), make_program_cache_key semantics (deterministic, supported-target matrix mirrors Program.compile, backend probe failures fail closed but stable, env-version changes don't perturb the key on the wrong backends, options-fingerprint canonicalization for the linker path, side-effect / external-content / NVRTC options.name-dir-component guards, schema version mixing), filestream CRUD, atomic-write race coverage, stat-guarded prune / atime-touch / clear / size-cap, atime LRU promotes recently-read, default-dir uses platformdirs, _is_windows_sharing_violation predicate's truth table including the regression case (non-sharing winerror plus EACCES propagates), tmp/ recreation after external wipe.
  • tests/test_program_cache_multiprocess.py — concurrent writers same key, distinct keys, reader-vs-writer torn-file safety, size-cap eviction race (rewriter vs. churner) under stat-guarded eviction.
  • tests/test_program_compile_cache.pyProgram.compile(cache=...) miss/hit/error paths against a recording stub, name_expressions rejection, extra_digest-required / side-effect / NVRTC options.name-dir-component rejection, PTX loadability warning on cache hit (positive + negative), real-NVRTC end-to-end roundtrip across reopen.

@cpcloud cpcloud added this to the cuda.core v1.0.0 milestone Apr 14, 2026
@cpcloud cpcloud added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Apr 14, 2026
@cpcloud cpcloud self-assigned this Apr 14, 2026
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from de57bd8 to ac38a68 Compare April 14, 2026 22:15
@github-actions
Copy link
Copy Markdown

@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch 23 times, most recently from f1ae40e to b27ed2c Compare April 19, 2026 13:28
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch 3 times, most recently from 2dc5c8f to 5da111b Compare April 20, 2026 12:18
@cpcloud cpcloud requested review from leofang and rwgk April 20, 2026 13:21
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 20, 2026

Generated with the help of Cursor GPT-5.4 Extra High Fast


High: make_program_cache_key() misses implicit source-directory header dependencies

make_program_cache_key() only forces extra_digest for explicit include/PCH options in cuda_core/cuda/core/utils/_program_cache.py:393 and cuda_core/cuda/core/utils/_program_cache.py:592, but NVRTC also implicitly searches the source file's directory unless no_source_include is set in cuda_core/cuda/core/_program.pyx:1001.

Program passes options.name straight to nvrtcCreateProgram() in cuda_core/cuda/core/_program.pyx:635, while the key builder only hashes that path string in cuda_core/cuda/core/utils/_program_cache.py:778. That means a workflow like options.name="/path/to/kernel.cu" plus #include "local.h" can reuse a stale cached ObjectCode after local.h changes.

The new tests cover explicit include/PCH knobs, but not this default source-directory include path (cuda_core/tests/test_program_cache.py:765).

Medium: FileStreamProgramCache._enforce_size_cap() can over-evict under concurrent capped writers

After the re-stat at cuda_core/cuda/core/utils/_program_cache.py:1515, a concurrent deleter can remove the candidate before path.unlink(). That FileNotFoundError is suppressed at cuda_core/cuda/core/utils/_program_cache.py:1530, but total is not adjusted, so eviction continues and can delete newer entries unnecessarily.

For a backend explicitly documented for multi-process use, that turns ordinary contention into avoidable cache data loss. The current multiprocess coverage exercises concurrent writes and prune races, but not max_size_bytes under concurrency (cuda_core/tests/test_program_cache_multiprocess.py).

Reduced simulation I used locally:

import time
from pathlib import Path

from cuda.core._module import ObjectCode
from cuda.core.utils import FileStreamProgramCache

cache = FileStreamProgramCache("/tmp/cuda_cache_review_race", max_size_bytes=1000)
cache[b"old"] = ObjectCode._init(b"a" * 600, "cubin", name="old")
time.sleep(0.01)

old_path = cache._path_for_key(b"old")
orig_unlink = Path.unlink
state = {"done": False}


def flaky_unlink(self, *args, **kwargs):
    if self == old_path and not state["done"]:
        state["done"] = True
        # Simulate another process deleting the file after stat() but before
        # _enforce_size_cap() updates its bookkeeping.
        orig_unlink(self, *args, **kwargs)
        raise FileNotFoundError(self)
    return orig_unlink(self, *args, **kwargs)


Path.unlink = flaky_unlink
try:
    cache[b"new"] = ObjectCode._init(b"b" * 600, "cubin", name="new")
finally:
    Path.unlink = orig_unlink

remaining = [key for key in (b"old", b"new") if cache.get(key) is not None]
print(remaining)  # []

That produced [] for me: once the first deletion race is swallowed without decrementing total, the loop keeps evicting and drops the fresh entry too.

Low: from cuda.core.utils import * now eagerly imports the cache stack

The package conversion keeps explicit imports like from cuda.core.utils import StridedMemoryView lightweight, but from cuda.core.utils import * walks __all__, resolves the lazy cache symbols, and imports _program_cache (cuda_core/cuda/core/utils/__init__.py:10, cuda_core/cuda/core/utils/__init__.py:32).

I verified that star-import now loads cuda.core.utils._program_cache. That said, this only affects import *, which is already discouraged. I think a short comment explaining that the laziness guarantee is intended for explicit imports, not star-import, seems sufficient here.

Comment thread cuda_core/cuda/core/utils/__init__.py
@leofang
Copy link
Copy Markdown
Member

leofang commented Apr 22, 2026

Thanks, Phillip! I have this PR in my review backlog 🙏

The most important question: Are these cache implementations multithreading/multiprocessing safe? This is the key challenge that real-world apps will stress test. In CuPy, our on-disk cache has been stress-tested in DOE supercomputers.

@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from 3a32786 to cad93d0 Compare April 22, 2026 12:04
@cpcloud
Copy link
Copy Markdown
Contributor Author

cpcloud commented Apr 22, 2026

Addressed in ff886d3585 (fixes) and cad93d0 (refactor + star-import note).

High -- source-directory include. make_program_cache_key() now refuses to build an NVRTC key when options.name contains a directory separator and neither extra_digest nor no_source_include=True is set. Scoping the guard to names that actually introduce a new search directory (/abs/kernel.cu, rel/kernel.cu, C:\src\kernel.cu) keeps bare labels like "default_program" or "kernel-a" -- which fall back to CWD, the same search root every NVRTC compile sees -- accepted unchanged. Tests cover POSIX and Windows separators, the extra_digest and no_source_include=True accept paths, and confirm the guard is NVRTC-only (PTX and NVVM unaffected).

Medium -- over-eviction race. FileStreamProgramCache._enforce_size_cap() now decrements total whether it unlinks the candidate itself or a concurrent pruner already removed the file. The FileNotFoundError is still suppressed, but the accounting now matches reality, so the loop stops as soon as the cap is met. Added a test that monkeypatches Path.unlink to simulate a concurrent deleter winning exactly once, then verifies the freshly-committed entry survives.

Low -- star-import. Added a note in cuda_core/cuda/core/utils/__init__.py that the laziness guarantee is for explicit imports only -- from cuda.core.utils import * walks __all__ and therefore resolves every lazy attribute. Star-imports are discouraged anyway, so treat that as expected.

@cpcloud
Copy link
Copy Markdown
Contributor Author

cpcloud commented Apr 22, 2026

@leofang -- yes, all three backends are designed and tested for concurrent access, with different scopes:

InMemoryProgramCache -- thread-safe, not process-safe. Dict-backed (OrderedDict) cache that lives only in the owning process. threading.RLock serialises every method (__getitem__, __setitem__, __contains__, __delitem__, __len__, clear, and the internal eviction pass) so threads can share one cache object without external locking. It is not process-safe by design: each process has its own independent cache, there is no shared state or IPC; for multi-process sharing use SQLiteProgramCache or FileStreamProgramCache. Note that entries are stored by reference, not copied -- a thread that mutates a returned ObjectCode affects the cached entry, so callers must treat reads as read-only. Stressed by a threaded test with 4 writers + 4 readers x 200 ops against a cache with max_entries set, verifying the cache stays consistent and never exceeds the cap.

SQLiteProgramCache -- thread-safe; multi-process best-effort. check_same_thread=False on the connection plus a threading.RLock serialises every connection-touching method, so threads cannot interleave a read/update or a write/VACUUM pair. WAL + autocommit on open. Stressed by a 4 writers + 4 readers x 200 ops test in test_program_cache.py. Sharing the sqlite file across processes does work (sqlite3 WAL serialises writes at the file level and our size-cap/VACUUM passes run under the same WAL discipline), but the threading.RLock does not cross process boundaries and the VACUUM pass skips under active readers, so the on-disk file can temporarily grow above max_size_bytes until readers release. For workloads with many concurrent processes, FileStreamProgramCache is the better fit.

FileStreamProgramCache -- thread-safe and process-safe. Every write lands on a per-write temp file and is promoted via os.replace, so a reader/writer race either sees the old entry or the new one -- never a half-written file. Reader pruning, clear(), and _enforce_size_cap are all stat-guarded: before unlinking, the code re-stats the candidate and refuses if (ino, size, mtime_ns) differs from the snapshot, so a concurrent writer's os.replace is preserved. Stale temp files are swept on open. On Windows, os.replace can surface ERROR_SHARING_VIOLATION (32) / ERROR_LOCK_VIOLATION (33) against a reader briefly holding the handle; the code retries with bounded backoff (~185ms total) before treating it as a non-fatal cache miss -- all other PermissionErrors and POSIX failures propagate.

Cross-process coverage in test_program_cache_multiprocess.py:

  • concurrent writers producing overlapping keys
  • a writer/reader race exercising the stat-guarded prune path
  • clear/eviction race injection via generator cleanup (the cleanup code after the last yield runs at StopIteration, which is exactly between _enforce_size_cap's scan and its eviction loop)
  • Windows PermissionError narrowing (winerror 32/33 swallow + retry, all others propagate)

One concurrency bug this review shook out (over-eviction after a suppressed FileNotFoundError in _enforce_size_cap) is fixed with its own test. If you see a DOE-style pattern from CuPy's cache that we don't cover yet, happy to add a test that reproduces it -- mapping that stress-testing onto this backend would be useful.

@cpcloud cpcloud changed the title Add PersistentProgramCache (sqlite + filestream backends) Add program caches (in-memory, sqlite, filestream) Apr 22, 2026
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from 457cab7 to cfddd08 Compare April 23, 2026 12:35
@cpcloud cpcloud requested a review from rwgk April 23, 2026 15:00
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from cfddd08 to fce123f Compare April 24, 2026 12:41
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 27, 2026

FWIW, I briefly explored "safe pickle" and "signed pickle blobs" in this chat:

The conclusion there is:

  • Do not pickle compiled-kernel cache entries.
  • Use JSON / simple binary files for metadata and artifacts.

@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch 4 times, most recently from 7d1cb23 to 86dab90 Compare April 29, 2026 13:41
@cpcloud cpcloud changed the title Add program caches (in-memory, sqlite, filestream) Add persistent program cache for Program.compile Apr 29, 2026
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from 86dab90 to a60f1c6 Compare April 29, 2026 14:08
Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Phillip! Sorry for the long wait. Sending out the first wave of my review. Will continue asap.

Note: It would be nice if we can break up the two largest files (cuda_core/cuda/core/utils/_program_cache.py and cuda_core/tests/test_program_cache.py, each are 1-2k lines) into smaller logical units.

Comment thread cuda_core/pyproject.toml Outdated
dependencies = [
"cuda-pathfinder >=1.4.2",
"numpy",
"platformdirs >=3.0",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let us not introduce new dependencies.

Comment thread cuda_core/pixi.toml Outdated
numpy = "*"
cuda-bindings = "*"
cuda-pathfinder = "*"
platformdirs = ">=3.0"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

platformdirs would otherwise insert on Windows, keeping the layout
identical across platforms (``<root>/cuda-python/program-cache``).
"""
return platformdirs.user_cache_path("cuda-python", appauthor=False, opinion=False) / "program-cache"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Doesn't make sense to introduce a new dependency just for this niche use case. We can totally vibe this out ourselves.
  2. CUDA does not support macOS for many years.

Comment thread cuda_core/pixi.lock
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we drop platiformdir we won't need to update the lock here

Comment on lines +135 to +153
# ``name_expressions`` is incompatible with the cache: NVRTC
# populates ``ObjectCode.symbol_mapping`` from name-expression
# mangling at compile time, and that mapping isn't carried in
# the binary bytes the cache stores. Without this guard the
# first call (cache miss) would return an ObjectCode with
# symbol_mapping populated, while every subsequent call (hit)
# would return one without -- silently breaking later
# ``get_kernel(name_expression)`` lookups that work on the
# uncached path. Fail loud here instead.
if name_expressions:
raise ValueError(
"Program.compile(cache=...) does not support name_expressions: "
"ObjectCode.symbol_mapping is populated by NVRTC at compile "
"time and is not preserved across a cache round-trip, so cache "
"hits would silently break get_kernel(name_expression) lookups "
"that the uncached path supports. Compile without cache= when "
"name_expressions are needed, or look up mangled symbols by "
"hand from the cached ObjectCode."
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: I need to address this after 1.0 is out, xref: cupy/cupy#9801

Comment on lines +1056 to +1066
def __getitem__(self, key: object) -> bytes:
k = _as_key_bytes(key)
with self._lock:
try:
data, _size = self._entries[k]
except KeyError:
raise KeyError(key) from None
# Touch LRU: a real read promotes the entry to "most recent"
# so eviction prefers genuinely cold entries.
self._entries.move_to_end(k)
return data
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What would be our recommended way of using InMemoryProgramCache in a multi-GPU env? Wondering about this because we usually have each GPU driven by a thread, and if the intended use case is a global cache object (which makes sense on a homogeneous system like DGX) this would cause serialization.

In CuPy internally there is a per-device cache so this issue is avoided.

@cpcloud
Copy link
Copy Markdown
Contributor Author

cpcloud commented Apr 30, 2026

@leofang -- on the multi-GPU question. Two options worth weighing, both viable:

Background. make_program_cache_key already mixes arch into the digest, so on a homogeneous system (all sm_80, all sm_90) every device produces the same key for the same source -- sharing one cache amortises compiles across every GPU. The single-lock cost is real but bounded: the dict update + LRU bump runs in microseconds, and the compile itself runs outside the lock. So the contention only matters during compile-heavy startup with many threads, on heterogeneous-arch systems, or when profiling shows it.

Option A -- document the dict-of-caches pattern, no API change.

caches = {d.device_id: InMemoryProgramCache() for d in devices}
# per thread:
program.compile(\"cubin\", cache=caches[Device().device_id])
  • Pros: zero API surface added; `Program.compile(cache=...)` is duck-typed today, so callers can also wrap the dict in a thin router themselves.
  • Cons: callers thread the dict through every compile site; easy to forget the device-id key.

Option B -- ship a `PerDeviceProgramCache` routing wrapper.

  • Pros: one-liner for callers (`cache = PerDeviceProgramCache()`); routing on `Device().device_id` lives inside the class.
  • Cons: couples the cache layer to `Device.current()` (every `getitem` reaches into the device API); more API surface to lock down before 1.0; harder to mock in tests.

I lean A for the first cut: the common case (homogeneous DGX, single SKU) is correctly served by one shared cache and benefits from cross-device amortisation, the heterogeneous-arch case is a 3-line dict away, and B can ship post-1.0 if real workloads make it pattern enough to deserve a class. Happy to pivot to B if you'd rather have it on the public surface from day one.

cpcloud added 2 commits April 30, 2026 04:45
Add a bytes-in / bytes-out cache abstraction and two backends for
caching compiled CUDA programs across process boundaries.

* ``ProgramCacheResource`` -- abstract base. Concrete backends store
  raw binary bytes keyed by ``bytes`` or ``str``; reads return the
  same payload. ``__setitem__`` accepts ``bytes``, ``bytearray``,
  ``memoryview``, or any :class:`~cuda.core.ObjectCode` (path-backed
  too -- the file is read at write time so the cached entry holds
  the binary content, not a path that could move). Provides default
  ``get``, ``update`` (mapping or pairs), ``close``, and context
  manager. ``__contains__`` is intentionally NOT abstract: the racy
  ``if key in cache; data = cache[key]`` idiom is steered toward
  ``cache.get(key)`` instead.

* ``InMemoryProgramCache`` -- single-process LRU on
  ``collections.OrderedDict`` with ``threading.RLock`` and a
  size-only cap. Reads promote via ``move_to_end``.

* ``FileStreamProgramCache`` -- directory of atomic per-entry files.
  Writes stage to ``tmp/`` then ``os.replace`` into
  ``entries/<2-char>/<blake2b-hex>``; concurrent readers never see a
  torn file. Each entry is the raw compiled binary (no pickle, no
  framing) so files are directly consumable by external NVIDIA
  tools (``cuobjdump``, ``nvdisasm``, ``cuda-gdb``). Eviction is
  true LRU via ``st_atime`` (the read path calls ``os.utime`` to
  bypass ``relatime`` / ``NtfsDisableLastAccessUpdate`` /
  ``noatime``). Stat-guarded prunes refuse to unlink entries
  another process replaced mid-eviction. ``tmp/`` is recreated on
  every write so an external wipe doesn't crash later writes.
  Default cache directory comes from
  ``platformdirs.user_cache_path("cuda-python", appauthor=False,
  opinion=False) / "program-cache"``.

* Windows sharing-violation handling -- ``os.replace``,
  ``path.stat() + read_bytes()``, and ``path.unlink`` all retry on
  winerror 5/32/33 with a bounded backoff (~185 ms). The
  ``_is_windows_sharing_violation`` predicate filters EACCES only
  when ``winerror`` is absent so non-sharing winerrors propagate as
  the real config errors they are. Off-Windows ``PermissionError``
  always propagates.

* ``make_program_cache_key`` -- escape hatch for callers whose
  compile inputs require an ``extra_digest`` (header / PCH content
  fingerprints, NVVM libdevice). Builds a 32-byte blake2b digest
  via a backend-strategy pattern: a ``_KeyBackend`` ABC with
  per-code-type subclasses (``_NvrtcBackend``, ``_LinkerBackend``,
  ``_NvvmBackend``) owns each backend's validation, code coercion,
  option fingerprinting, name-expression handling, version probe,
  and extra-payload hashing. The orchestrator dispatches via
  ``_BACKENDS_BY_CODE_TYPE[code_type]`` and assembles the digest in
  fixed order. Backend gates match ``Program.compile``: rejects
  inputs the real compile would reject (side-effect options,
  external-content options without an ``extra_digest``,
  driver-linker-unsupported options, NVRTC ``options.name`` with a
  directory component). NVVM ``extra_sources`` is hashed in
  caller-provided order because NVVM module linking is
  order-dependent in the general case (overlapping symbols, weak
  definitions); canonicalising would silently change behavior for
  order-dependent inputs.

Adds ``platformdirs >=3.0`` to ``cuda_core/pyproject.toml`` and the
matching pixi manifests.

Tests cover the abstract contract, key-construction matrix
(deterministic, supported-target gates, backend-probe taint, gate
canonicalization, side-effect / external-content / dir-component
guards, schema version mixing), single-process CRUD and LRU,
atomic-write race coverage, atime LRU promotion, stat-guarded
prune / atime touch / clear / size-cap, default-dir resolution via
platformdirs, the ``_is_windows_sharing_violation`` predicate's
truth table including the regression case (non-sharing winerror
plus EACCES propagates), tmp-dir recreation after external wipe,
multiprocess concurrent writers / reader-vs-writer torn-file
safety / size-cap eviction race.
Adds a ``cache=`` keyword to :meth:`cuda.core.Program.compile` that
threads the persistent cache machinery into the high-level compile
path. With ``cache=None`` (the default) the call is byte-identical
to the un-cached path -- no key derivation, no extra import, no
behavior change.

When a cache is provided, the wrapper derives a key via
:func:`~cuda.core.utils.make_program_cache_key` from the program's
source, options, and target type; checks the cache; on hit,
returns a fresh
``ObjectCode._init(hit_bytes, target_type, name=self._options.name)``;
on miss, runs the underlying compile and stores
``cache[key] = compiled`` (the cache extracts ``bytes(obj.code)``).

Two compile-time guards close obvious footguns:

* ``name_expressions`` plus ``cache=`` raises ``ValueError``.
  NVRTC populates ``ObjectCode.symbol_mapping`` from name-expression
  mangling at compile time, and that mapping isn't carried in the
  binary the cache stores. Without this guard the first call (miss)
  would return an ObjectCode with mappings populated, while every
  subsequent call (hit) would return one without -- silently
  breaking later ``get_kernel(name_expression)`` lookups that work
  on the uncached path. Compiles that need name_expressions should
  run without ``cache=``, or look up mangled symbols by hand from
  the cached ``ObjectCode``.

* Inputs whose compilation effect isn't captured by the key
  (``include_path``, ``pre_include``, ``pch``, ``use_pch``,
  ``pch_dir``, NVVM ``use_libdevice=True``, NVRTC ``options.name``
  with a directory component, side-effect options like
  ``create_pch`` / ``time`` / ``fdevice_time_trace``) propagate the
  ``ValueError`` from ``make_program_cache_key`` -- those callers
  should use ``make_program_cache_key`` directly with an
  ``extra_digest`` covering the external content.

Cache hits also mirror the uncached path's NVRTC-PTX loadability
warning: when ``self._backend == "NVRTC"``, ``target_type ==
"ptx"``, and ``_can_load_generated_ptx()`` returns False, a
``RuntimeWarning`` is emitted before returning the cached bytes.
Loadability is a property of the active driver, not of how the
bytes were produced, so the warning applies equally to cached PTX.

Supporting refactors:

* Unify ``Program``'s source retention into a single ``_code``
  field (was split between ``_code`` for NVVM and a separate
  ``_source`` for c++/ptx). ``_code`` is now always bytes; the
  cache wrapper decodes back to ``str`` for c++/ptx before passing
  to ``make_program_cache_key`` (which only accepts bytes for NVVM).

* Move the actual compile call into a module-level
  ``_program_compile_uncached`` so tests can monkeypatch the seam
  without going through NVRTC. ``Program`` is a ``cdef class``, so
  its methods cannot be reassigned from Python -- the seam has to
  live outside the class.

* The unified ``_code`` field also exposed a pre-existing bug on
  the NVVM path: the C pointer was being recomputed from the
  caller's original ``code`` argument rather than from
  ``self._code``, which crashed for ``bytearray`` inputs that the
  field's bytes coercion handled cleanly. Fixed; regression test
  added in ``test_program.py``.

Tests in ``test_program_compile_cache.py`` cover both halves of the
contract: the wrapper-level miss/hit/error paths against a recording
stub (verifying it's duck-typed and doesn't require subclassing
``ProgramCacheResource``), the rejection paths (name_expressions,
extra_digest-required options, side-effect options, NVRTC
``options.name`` with a directory component), the PTX loadability
warning on cache hit (positive: warns when the driver can't load
the cached PTX; negative: stays quiet otherwise), and a real NVRTC
end-to-end roundtrip using ``FileStreamProgramCache`` across reopen
so the bytes match across processes.
@cpcloud cpcloud force-pushed the persistent-program-cache-178 branch from a60f1c6 to d177450 Compare April 30, 2026 08:45
@cpcloud
Copy link
Copy Markdown
Contributor Author

cpcloud commented Apr 30, 2026

Pushed d177450 addressing the first review wave:

platformdirs dropped (cuda_core/pyproject.toml, cuda_core/pixi.toml, cuda_core/cuda/core/utils/_program_cache/_file_stream.py). _default_cache_dir is now ~10 lines: `$XDG_CACHE_HOME or ~/.cache` on Linux, `%LOCALAPPDATA% or ~/AppData/Local` on Windows -- no macOS branch since CUDA doesn't support it. pixi.lock is reverted to upstream (the lock churn was solely from the now-removed dep). Tests in test_default_cache_dir_lives_under_user_cache_root parametrise both branches.

_program_cache.py split into a package (cuda_core/cuda/core/utils/_program_cache/{__init__.py,_abc.py,_keys.py,_in_memory.py,_file_stream.py}) -- 1700 lines → 5 focused submodules, each ≤ 700 lines. Public surface (`FileStreamProgramCache`, `InMemoryProgramCache`, `ProgramCacheResource`, `make_program_cache_key`) re-exported from the package, so external imports are unchanged. Tests that monkeypatch internals were updated to address the owning submodule directly (_program_cache._keys._linker_backend_and_version, _program_cache._file_stream._IS_WINDOWS, etc.) -- the package-level symbols are convenience aliases and don't intercept calls within submodules.

test_program_cache.py file split deferred. The 2179-line file is internally well-organised by section (key construction / InMemory / FileStream / multi-process), and after the source split the test reorganisation is mechanical churn rather than a clarity win. Happy to do it as a follow-up if you'd prefer.

Multi-GPU question replied to in #issuecomment-4350987016.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants