Add persistent program cache for Program.compile#1912
Add persistent program cache for Program.compile#1912cpcloud wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
de57bd8 to
ac38a68
Compare
|
f1ae40e to
b27ed2c
Compare
2dc5c8f to
5da111b
Compare
|
Generated with the help of Cursor GPT-5.4 Extra High Fast High:
|
|
Thanks, Phillip! I have this PR in my review backlog 🙏 The most important question: Are these cache implementations multithreading/multiprocessing safe? This is the key challenge that real-world apps will stress test. In CuPy, our on-disk cache has been stress-tested in DOE supercomputers. |
3a32786 to
cad93d0
Compare
|
Addressed in ff886d3585 (fixes) and cad93d0 (refactor + star-import note). High -- source-directory include. Medium -- over-eviction race. Low -- star-import. Added a note in |
|
@leofang -- yes, all three backends are designed and tested for concurrent access, with different scopes:
Cross-process coverage in
One concurrency bug this review shook out (over-eviction after a suppressed |
457cab7 to
cfddd08
Compare
cfddd08 to
fce123f
Compare
|
FWIW, I briefly explored "safe pickle" and "signed pickle blobs" in this chat: The conclusion there is:
|
7d1cb23 to
86dab90
Compare
86dab90 to
a60f1c6
Compare
leofang
left a comment
There was a problem hiding this comment.
Thanks, Phillip! Sorry for the long wait. Sending out the first wave of my review. Will continue asap.
Note: It would be nice if we can break up the two largest files (cuda_core/cuda/core/utils/_program_cache.py and cuda_core/tests/test_program_cache.py, each are 1-2k lines) into smaller logical units.
| dependencies = [ | ||
| "cuda-pathfinder >=1.4.2", | ||
| "numpy", | ||
| "platformdirs >=3.0", |
There was a problem hiding this comment.
Please let us not introduce new dependencies.
| numpy = "*" | ||
| cuda-bindings = "*" | ||
| cuda-pathfinder = "*" | ||
| platformdirs = ">=3.0" |
| platformdirs would otherwise insert on Windows, keeping the layout | ||
| identical across platforms (``<root>/cuda-python/program-cache``). | ||
| """ | ||
| return platformdirs.user_cache_path("cuda-python", appauthor=False, opinion=False) / "program-cache" |
There was a problem hiding this comment.
- Doesn't make sense to introduce a new dependency just for this niche use case. We can totally vibe this out ourselves.
- CUDA does not support macOS for many years.
There was a problem hiding this comment.
If we drop platiformdir we won't need to update the lock here
| # ``name_expressions`` is incompatible with the cache: NVRTC | ||
| # populates ``ObjectCode.symbol_mapping`` from name-expression | ||
| # mangling at compile time, and that mapping isn't carried in | ||
| # the binary bytes the cache stores. Without this guard the | ||
| # first call (cache miss) would return an ObjectCode with | ||
| # symbol_mapping populated, while every subsequent call (hit) | ||
| # would return one without -- silently breaking later | ||
| # ``get_kernel(name_expression)`` lookups that work on the | ||
| # uncached path. Fail loud here instead. | ||
| if name_expressions: | ||
| raise ValueError( | ||
| "Program.compile(cache=...) does not support name_expressions: " | ||
| "ObjectCode.symbol_mapping is populated by NVRTC at compile " | ||
| "time and is not preserved across a cache round-trip, so cache " | ||
| "hits would silently break get_kernel(name_expression) lookups " | ||
| "that the uncached path supports. Compile without cache= when " | ||
| "name_expressions are needed, or look up mangled symbols by " | ||
| "hand from the cached ObjectCode." | ||
| ) |
There was a problem hiding this comment.
Note to self: I need to address this after 1.0 is out, xref: cupy/cupy#9801
| def __getitem__(self, key: object) -> bytes: | ||
| k = _as_key_bytes(key) | ||
| with self._lock: | ||
| try: | ||
| data, _size = self._entries[k] | ||
| except KeyError: | ||
| raise KeyError(key) from None | ||
| # Touch LRU: a real read promotes the entry to "most recent" | ||
| # so eviction prefers genuinely cold entries. | ||
| self._entries.move_to_end(k) | ||
| return data |
There was a problem hiding this comment.
Q: What would be our recommended way of using InMemoryProgramCache in a multi-GPU env? Wondering about this because we usually have each GPU driven by a thread, and if the intended use case is a global cache object (which makes sense on a homogeneous system like DGX) this would cause serialization.
In CuPy internally there is a per-device cache so this issue is avoided.
|
@leofang -- on the multi-GPU question. Two options worth weighing, both viable: Background. Option A -- document the dict-of-caches pattern, no API change. caches = {d.device_id: InMemoryProgramCache() for d in devices}
# per thread:
program.compile(\"cubin\", cache=caches[Device().device_id])
Option B -- ship a `PerDeviceProgramCache` routing wrapper.
I lean A for the first cut: the common case (homogeneous DGX, single SKU) is correctly served by one shared cache and benefits from cross-device amortisation, the heterogeneous-arch case is a 3-line dict away, and B can ship post-1.0 if real workloads make it pattern enough to deserve a class. Happy to pivot to B if you'd rather have it on the public surface from day one. |
Add a bytes-in / bytes-out cache abstraction and two backends for
caching compiled CUDA programs across process boundaries.
* ``ProgramCacheResource`` -- abstract base. Concrete backends store
raw binary bytes keyed by ``bytes`` or ``str``; reads return the
same payload. ``__setitem__`` accepts ``bytes``, ``bytearray``,
``memoryview``, or any :class:`~cuda.core.ObjectCode` (path-backed
too -- the file is read at write time so the cached entry holds
the binary content, not a path that could move). Provides default
``get``, ``update`` (mapping or pairs), ``close``, and context
manager. ``__contains__`` is intentionally NOT abstract: the racy
``if key in cache; data = cache[key]`` idiom is steered toward
``cache.get(key)`` instead.
* ``InMemoryProgramCache`` -- single-process LRU on
``collections.OrderedDict`` with ``threading.RLock`` and a
size-only cap. Reads promote via ``move_to_end``.
* ``FileStreamProgramCache`` -- directory of atomic per-entry files.
Writes stage to ``tmp/`` then ``os.replace`` into
``entries/<2-char>/<blake2b-hex>``; concurrent readers never see a
torn file. Each entry is the raw compiled binary (no pickle, no
framing) so files are directly consumable by external NVIDIA
tools (``cuobjdump``, ``nvdisasm``, ``cuda-gdb``). Eviction is
true LRU via ``st_atime`` (the read path calls ``os.utime`` to
bypass ``relatime`` / ``NtfsDisableLastAccessUpdate`` /
``noatime``). Stat-guarded prunes refuse to unlink entries
another process replaced mid-eviction. ``tmp/`` is recreated on
every write so an external wipe doesn't crash later writes.
Default cache directory comes from
``platformdirs.user_cache_path("cuda-python", appauthor=False,
opinion=False) / "program-cache"``.
* Windows sharing-violation handling -- ``os.replace``,
``path.stat() + read_bytes()``, and ``path.unlink`` all retry on
winerror 5/32/33 with a bounded backoff (~185 ms). The
``_is_windows_sharing_violation`` predicate filters EACCES only
when ``winerror`` is absent so non-sharing winerrors propagate as
the real config errors they are. Off-Windows ``PermissionError``
always propagates.
* ``make_program_cache_key`` -- escape hatch for callers whose
compile inputs require an ``extra_digest`` (header / PCH content
fingerprints, NVVM libdevice). Builds a 32-byte blake2b digest
via a backend-strategy pattern: a ``_KeyBackend`` ABC with
per-code-type subclasses (``_NvrtcBackend``, ``_LinkerBackend``,
``_NvvmBackend``) owns each backend's validation, code coercion,
option fingerprinting, name-expression handling, version probe,
and extra-payload hashing. The orchestrator dispatches via
``_BACKENDS_BY_CODE_TYPE[code_type]`` and assembles the digest in
fixed order. Backend gates match ``Program.compile``: rejects
inputs the real compile would reject (side-effect options,
external-content options without an ``extra_digest``,
driver-linker-unsupported options, NVRTC ``options.name`` with a
directory component). NVVM ``extra_sources`` is hashed in
caller-provided order because NVVM module linking is
order-dependent in the general case (overlapping symbols, weak
definitions); canonicalising would silently change behavior for
order-dependent inputs.
Adds ``platformdirs >=3.0`` to ``cuda_core/pyproject.toml`` and the
matching pixi manifests.
Tests cover the abstract contract, key-construction matrix
(deterministic, supported-target gates, backend-probe taint, gate
canonicalization, side-effect / external-content / dir-component
guards, schema version mixing), single-process CRUD and LRU,
atomic-write race coverage, atime LRU promotion, stat-guarded
prune / atime touch / clear / size-cap, default-dir resolution via
platformdirs, the ``_is_windows_sharing_violation`` predicate's
truth table including the regression case (non-sharing winerror
plus EACCES propagates), tmp-dir recreation after external wipe,
multiprocess concurrent writers / reader-vs-writer torn-file
safety / size-cap eviction race.
Adds a ``cache=`` keyword to :meth:`cuda.core.Program.compile` that threads the persistent cache machinery into the high-level compile path. With ``cache=None`` (the default) the call is byte-identical to the un-cached path -- no key derivation, no extra import, no behavior change. When a cache is provided, the wrapper derives a key via :func:`~cuda.core.utils.make_program_cache_key` from the program's source, options, and target type; checks the cache; on hit, returns a fresh ``ObjectCode._init(hit_bytes, target_type, name=self._options.name)``; on miss, runs the underlying compile and stores ``cache[key] = compiled`` (the cache extracts ``bytes(obj.code)``). Two compile-time guards close obvious footguns: * ``name_expressions`` plus ``cache=`` raises ``ValueError``. NVRTC populates ``ObjectCode.symbol_mapping`` from name-expression mangling at compile time, and that mapping isn't carried in the binary the cache stores. Without this guard the first call (miss) would return an ObjectCode with mappings populated, while every subsequent call (hit) would return one without -- silently breaking later ``get_kernel(name_expression)`` lookups that work on the uncached path. Compiles that need name_expressions should run without ``cache=``, or look up mangled symbols by hand from the cached ``ObjectCode``. * Inputs whose compilation effect isn't captured by the key (``include_path``, ``pre_include``, ``pch``, ``use_pch``, ``pch_dir``, NVVM ``use_libdevice=True``, NVRTC ``options.name`` with a directory component, side-effect options like ``create_pch`` / ``time`` / ``fdevice_time_trace``) propagate the ``ValueError`` from ``make_program_cache_key`` -- those callers should use ``make_program_cache_key`` directly with an ``extra_digest`` covering the external content. Cache hits also mirror the uncached path's NVRTC-PTX loadability warning: when ``self._backend == "NVRTC"``, ``target_type == "ptx"``, and ``_can_load_generated_ptx()`` returns False, a ``RuntimeWarning`` is emitted before returning the cached bytes. Loadability is a property of the active driver, not of how the bytes were produced, so the warning applies equally to cached PTX. Supporting refactors: * Unify ``Program``'s source retention into a single ``_code`` field (was split between ``_code`` for NVVM and a separate ``_source`` for c++/ptx). ``_code`` is now always bytes; the cache wrapper decodes back to ``str`` for c++/ptx before passing to ``make_program_cache_key`` (which only accepts bytes for NVVM). * Move the actual compile call into a module-level ``_program_compile_uncached`` so tests can monkeypatch the seam without going through NVRTC. ``Program`` is a ``cdef class``, so its methods cannot be reassigned from Python -- the seam has to live outside the class. * The unified ``_code`` field also exposed a pre-existing bug on the NVVM path: the C pointer was being recomputed from the caller's original ``code`` argument rather than from ``self._code``, which crashed for ``bytearray`` inputs that the field's bytes coercion handled cleanly. Fixed; regression test added in ``test_program.py``. Tests in ``test_program_compile_cache.py`` cover both halves of the contract: the wrapper-level miss/hit/error paths against a recording stub (verifying it's duck-typed and doesn't require subclassing ``ProgramCacheResource``), the rejection paths (name_expressions, extra_digest-required options, side-effect options, NVRTC ``options.name`` with a directory component), the PTX loadability warning on cache hit (positive: warns when the driver can't load the cached PTX; negative: stays quiet otherwise), and a real NVRTC end-to-end roundtrip using ``FileStreamProgramCache`` across reopen so the bytes match across processes.
a60f1c6 to
d177450
Compare
|
Pushed d177450 addressing the first review wave: platformdirs dropped (
test_program_cache.py file split deferred. The 2179-line file is internally well-organised by section (key construction / InMemory / FileStream / multi-process), and after the source split the test reorganisation is mechanical churn rather than a clarity win. Happy to do it as a follow-up if you'd prefer. Multi-GPU question replied to in #issuecomment-4350987016. |
Summary
Adds a persistent on-disk cache for
cuda.core.Program.compileoutputs. The high-level integration is one keyword onProgram.compile:A second invocation with the same inputs short-circuits the entire NVRTC compile —
cache.get(key)(one stat + one read) and anObjectCode._initfrom the bytes. NoProgram_compileis invoked. This is the fast path the cache exists to provide:Public API
Program.compile(target_type, *, cache=...)— convenience wrapper. Derives the key, returns a freshObjectCodeon hit, stores the compile output on miss.cuda.core.utils.ProgramCacheResource— abstract bytes-in / bytes-out interface for custom backends. Providesget,update(Mapping or pairs),clear, and the mapping mutators (__getitem__/__setitem__/__delitem__/__len__).__contains__is intentionally omitted:cache.get(key)is the recommended idiom because the two-callif key in cache: cache[key]pattern is racy across processes.cuda.core.utils.InMemoryProgramCache— single-process LRU onOrderedDict,threading.RLock, size-only cap. For "compile once, look up many" workflows that don't need persistence.cuda.core.utils.FileStreamProgramCache— directory of atomic per-entry files. Safe across processes viaos.replace+ Windows sharing-violation retries onos.replace/ read /unlink.cuda.core.utils.make_program_cache_key— escape hatch when the compile inputs require anextra_digest(include_path,pre_include,pch,use_pch,pch_dir, NVVMuse_libdevice=True, NVRTCoptions.namewith a directory component).Program.compile(cache=...)rejects those compiles with aValueErrorpointing here.On-disk format
Each entry is the raw compiled binary verbatim — cubin / PTX / LTO-IR — with no pickle, JSON, length prefix, or framing of any kind. Cache files are directly consumable by external NVIDIA tools (
cuobjdump,nvdisasm,cuda-gdb).ObjectCode.symbol_mappingfromname_expressionsis not preserved across a cache round-trip; the wrapper rejectsProgram.compile(name_expressions=..., cache=...)outright so the first-call-works/second-call-breaks footgun can't surface. Callers that needget_kernel(name_expression)should compile withoutcache=.FileStreamProgramCache
tmp/, fsync,os.replaceintoentries/<2char>/<hash>. Concurrent readers never observe partial writes. Windowsos.replaceretries onERROR_ACCESS_DENIED/ERROR_SHARING_VIOLATION/ERROR_LOCK_VIOLATION(winerrors 5/32/33) within a bounded backoff (~185 ms); after the budget, the write is dropped and the next call recompiles. The same retry covers reads andpath.unlinkso eviction doesn't crash the writer that triggered it on win-64._is_windows_sharing_violation(exc)filtersEACCESonly whenwinerroris absent — non-sharing winerrors are real config errors and propagate. Off-WindowsPermissionErroralways propagates.cache[key] = value(andcache.update({key: value, ...})) accept raw bytes, bytearray, memoryview, or anyObjectCode(path-backed too — the file is read at write time so the cached entry is the binary content, not a path that could move). Reads return the same bytes that went in.max_size_bytesis the only knob — no element-count cap.Nonemeans unbounded.os.utime(fd-based on Linux/macOS viaos.supports_fd, path-based on Windows) to bumpst_atimeregardless of mount options orNtfsDisableLastAccessUpdate. Eviction sorts by oldestst_atimefirst. The atime touch is stat-guarded so a racing rewriter's freshly-replaced file never has its mtime rolled back.clear(),_enforce_size_cap(), and the atime touch all snapshot(ino, size, mtime_ns)per entry and refuse to unlink / overwrite stamps if a writer replaced the file mid-operation.make_program_cache_key): a backend-strategy pattern with one class percode_type(_NvrtcBackend/_LinkerBackend/_NvvmBackend). Each owns its own validate / encode_code / option_fingerprint / encode_name_expressions / hash_version_probe / hash_extra_payload. The orchestrator validatescode_type/target_type, dispatches to the right backend, and assembles the digest in fixed order. Adding a new backend is one new class, not a five-place edit.options.namewith a directory component: rejected withoutextra_digestbecause NVRTC resolves quoted#includedirectives relative to that directory — neighbour-header changes wouldn't invalidate the cache otherwise.RuntimeWarningthe uncached path emits — loadability depends on the driver, not on whether the bytes were freshly compiled.pathis omitted, resolves viaplatformdirs.user_cache_path("cuda-python", appauthor=False, opinion=False) / "program-cache":\$XDG_CACHE_HOME/cuda-python/program-cache(default~/.cache/cuda-python/program-cache)~/Library/Caches/cuda-python/program-cache%LOCALAPPDATA%\\cuda-python\\program-cachetmp/self-heal: if something deletestmp/after the cache is opened, the next write recreates it rather than crashing withFileNotFoundError.Test plan
tests/test_program_cache.py— abstract-class contract,updateaccepts mapping or pairs, transparent input-form equivalence (bytes / bytearray / memoryview / bytes-backedObjectCode/ path-backedObjectCodeall round-trip to the same on-disk bytes),make_program_cache_keysemantics (deterministic, supported-target matrix mirrorsProgram.compile, backend probe failures fail closed but stable, env-version changes don't perturb the key on the wrong backends, options-fingerprint canonicalization for the linker path, side-effect / external-content / NVRTCoptions.name-dir-component guards, schema version mixing), filestream CRUD, atomic-write race coverage, stat-guarded prune / atime-touch / clear / size-cap, atime LRU promotes recently-read, default-dir usesplatformdirs,_is_windows_sharing_violationpredicate's truth table including the regression case (non-sharing winerror plus EACCES propagates),tmp/recreation after external wipe.tests/test_program_cache_multiprocess.py— concurrent writers same key, distinct keys, reader-vs-writer torn-file safety, size-cap eviction race (rewriter vs. churner) under stat-guarded eviction.tests/test_program_compile_cache.py—Program.compile(cache=...)miss/hit/error paths against a recording stub,name_expressionsrejection,extra_digest-required / side-effect / NVRTCoptions.name-dir-component rejection, PTX loadability warning on cache hit (positive + negative), real-NVRTC end-to-end roundtrip across reopen.