Conversation
The shared OCI cache at data_dir/system/oci-cache grew without bound
because neither the pull path nor the registry push path had a cleanup
hook. The image retention controller only touches data_dir/images, so
manifests and layer blobs that were no longer referenced lived forever.
This change adds a new lib/ocicachegc package that walks index.json and
every referenced manifest to build the live set of blob digests, then
deletes any file under blobs/sha256/ that is not in that set. Blobs
whose mtime is within the configured min_blob_age are kept; this grace
period is what lets the sweep run safely alongside concurrent pulls
(which write layer blobs before updating index.json) and registry
pushes.
Disabled by default. Enable via:
images:
oci_cache_gc:
enabled: true
interval: 1h
min_blob_age: 1h
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: PR adds a new garbage collection package for OCI cache management, but does not modify API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal), which are the specific areas the filter requires for monitoring. To monitor this PR anyway, reply with |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8c46b4d. Configure here.
| if h, ok := digestHex(doc.Subject.Digest); ok { | ||
| live[h] = struct{}{} | ||
| } | ||
| } |
There was a problem hiding this comment.
Subject descriptor not recursed, risking data loss
High Severity
The subject descriptor is only added to the live set as a leaf, but unlike manifests entries (which are recursed via walkDescriptor), it is never walked to discover its own transitive references. Since subject points to another manifest in the OCI referrers model, that manifest's config and layers blobs won't be marked live and could be incorrectly garbage-collected. The PR description explicitly states the intent to "recurse into its config, layers, manifests, and subject references," but the implementation only records the subject blob's digest without descending into it.
Reviewed by Cursor Bugbot for commit 8c46b4d. Configure here.


Summary
The shared OCI cache at
data_dir/system/oci-cachecurrently growswithout bound — neither the pull path (
layout.AppendImage) nor theregistry push path (
BlobStore.Put) ever remove blobs, and the imageretention controller only touches
data_dir/images. Over time thisaccumulates dead manifest, config, and layer blobs that are no longer
reachable from
index.json.This change adds a new
lib/ocicachegcpackage that walksindex.jsonand every referenced manifest to build the set of live blob digests,
then deletes any file under
blobs/sha256/that isn't in that set.Blobs whose mtime is within the configured
min_blob_ageare alwayskept; that grace period is what lets the sweep run safely alongside
concurrent pulls (which write layer blobs before updating
index.json)and registry pushes (which rename
<hex>.tmp→<hex>before themanifest trigger).
Config
Disabled by default. Opt-in via:
How it decides what's live
index.json.If the blob is a manifest or manifest index, recurse into its
config,layers,manifests, andsubjectreferences.Unparseable or missing referenced blobs are treated as opaque leaves —
they remain "live" but we don't descend into them. The collector never
deletes a blob it cannot prove is dead.
.tmpfiles and anything whose name is not a 64-hex-char blob digestare ignored by the sweep entirely.
Metrics
hypeman_oci_cache_gc_sweeps_total(counter, status)hypeman_oci_cache_gc_sweep_duration_seconds(histogram)hypeman_oci_cache_gc_deleted_blobs_total(counter)hypeman_oci_cache_gc_deleted_bytes_total(counter)Test plan
go test ./lib/ocicachegc/...passes (live set kept, orphans deleted, grace period honored, tmp/non-blob filenames ignored, manifest index traversal)go test ./cmd/api/config/...passes (new duration validators)go test ./lib/imageretention/...passes (unchanged)go build ./cmd/api/...cleango vet ./...cleanManual validation
deft-kernel-dev, ran the realhypemanbinary from a fresh scratch clone withimages.oci_cache_gc.enabled: trueand an isolated tempdata_dir.data_dir/system/oci-cachewith one live manifest/config/layer set, one old orphan blob, and one recent orphan blob.oci cache gc enabled,oci cache gc started, anddeleted unreferenced oci blobfor the old orphan digest.go mod download,make oapi-generate,make build,go run ./cmd/test-prewarm,go test -count=1 -tags containers_image_openpgp -timeout=20m ./...(pass,300s).Note
Medium Risk
Adds a new background process that deletes files under
data_dir/system/oci-cache, so misconfiguration or edge cases in live-set computation/grace-period timing could remove blobs still needed by pulls/pushes. Mitigated by being disabled by default and validatinginterval/min_blob_agedurations.Overview
Adds an opt-in mark-and-sweep garbage collector for the shared OCI cache (
data_dir/system/oci-cache) to reclaim unreferenced blobs based on reachability fromindex.json, with amin_blob_agegrace period to avoid racing concurrent writes.Introduces new config block
images.oci_cache_gc(defaultsenabled: false,interval: 1h,min_blob_age: 1h), wires the collector intocmd/apistartup as a supervised goroutine, and records new OTel metrics for sweep outcomes and reclaimed space. Tests and example configs/README are updated to cover defaults, validation, and collector behavior.Reviewed by Cursor Bugbot for commit 8c46b4d. Bugbot is set up for automated code reviews on this repo. Configure here.