repository: make check and delete work at N=1 with sha256 pack ids#9783
repository: make check and delete work at N=1 with sha256 pack ids#9783mr-raj12 wants to merge 1 commit into
Conversation
check rebuilds the chunk index from object headers (real chunk_id) instead of the pack file name; delete resolves the pack via the chunk index, not pack_id==chunk_id.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #9783 +/- ##
=======================================
Coverage 84.88% 84.88%
=======================================
Files 92 92
Lines 15172 15180 +8
Branches 2275 2277 +2
=======================================
+ Hits 12879 12886 +7
- Misses 1589 1591 +2
+ Partials 704 703 -1 ☔ View full report in Codecov by Harness. |
| pytest.skip("only works locally") | ||
| check_cmd_setup(archiver) | ||
| with Repository(archiver.repository_location, exclusive=True) as repository: | ||
| # list() yields pack_ids (== the packs/* file names), so drop the packs directly via the |
There was a problem hiding this comment.
Repository.list yielded chunk ids. That it yields pack ids is just because it has not been adapted to packs yet, so that is a bug, not a feature.
| # embed the chunk_id in the object header so each object's bytes -- and thus its | ||
| # sha256 pack id -- are unique; identical bytes would otherwise collapse to a single | ||
| # pack under BORG_TESTONLY_SHA256_PACK_ID. | ||
| repository.put(H(x), fchunk(b"SOMEDATA", chunk_id=H(x))) # put() updates _chunks via PackWriter |
There was a problem hiding this comment.
better move that comment to fchunk definition.
| Each object's identity and extent are recovered from its on-disk header, so callers do | ||
| not have to assume the pack file name equals the chunk_id. That assumption only holds at | ||
| N=1 without sha256 pack ids; once a pack is named sha256(pack_bytes) (N>1, or with the | ||
| BORG_TESTONLY_SHA256_PACK_ID switch) the file name is the pack_id, not the chunk_id. | ||
| Works for a single-object pack (N=1) and for multi-object packs (N>1). | ||
| """ |
There was a problem hiding this comment.
you will have to fix/remove all that text talking about the N=1 packid==chunkid hack later.
| key = "packs/" + bin_to_hex(entry.pack_id) | ||
| try: | ||
| self.store.delete(key) | ||
| except StoreObjectNotFound: | ||
| raise self.ObjectNotFound(id, str(self._location)) from None | ||
| del self.chunks[id] |
There was a problem hiding this comment.
I repeat myself: do not introduce code that is very harmful for the general case.
The API user wants to delete one chunk and you delete the whole pack here.
Did you read my last review? Did you write this code yourself? Did you use AI and not read its output?
Maybe just do the index lookup, log a warning like "ignoring deletion of CHUNKID in PACKID" and just do nothing else for now until we have a better idea.
Also: TODO fix this later
Description
borg checkand per-object delete both assumed that a pack file's name is the chunk_id it holds. That is true only at N=1 while pack ids equal chunk ids. WithBORG_TESTONLY_SHA256_PACK_ID=1(and later at N>1) a pack is namedsha256(pack_bytes), so the file name is the pack_id, not the chunk_id.Two places broke under that:
Repository.check()rebuilt the chunk index from the barepacks/listing and used the file name as the chunk_id. With sha256 pack ids it indexed every object under the wrong id and wrote that wrong index back to the repo cache, so the archive check that runs next reported all chunks as missing.Repository.delete(id)deletedpacks/{id}, which does not exist once the pack is named by content.The object header already carries the chunk_id, meta size and data size (
repoobj.py), so check can recover the real(chunk_id, offset, size)of each object from the header instead of trusting the file name, and delete can find the pack through the chunk index.Changes:
repoobj.py: addRepoObj.iter_object_headers(), which walks a pack's object headers and yields(chunk_id, obj_offset, obj_size). Handles one object per pack (N=1) and several (N>1).repository.py:check()rebuilds the index fromiter_object_headers();delete()resolves the pack via the chunk index, drops it, and removes the id from the index.repository_test.py: give test objects a real chunk_id so identical payloads do not collapse into one pack under sha256, and make the twopack_id == chunk_idassertions aware of the switch.check_cmd_test.py:test_empty_repositorydrops packs through the store, sincelist()yields pack ids, not chunk ids.Verified with the local test subset, with and without
BORG_TESTONLY_SHA256_PACK_ID=1: the repository-level checks pass under the switch and nothing regresses without it. Archive-levelcheck --repairunder sha256 pack ids is left for a separate change.refs #8572
Checklist
master