feat: implement reproducibility validation pipeline in CI#5
feat: implement reproducibility validation pipeline in CI#5Muneerali199 wants to merge 9 commits intoAOSSIE-Org:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a cross-platform reproducibility CI workflow, a new manifest pipeline module for deterministic per-file hashing and manifest comparison, a reproducibility test script and checked-in hash, and unit tests plus public API exports for the pipeline and hashing utilities. Changes
Sequence Diagram(s)sequenceDiagram
participant GHA as GitHub Actions
participant Runner as OS Runner
participant Script as Repro Script / CI Job
participant Pipeline as Manifest Pipeline
participant FS as Filesystem
participant Hash as SHA-256 Service
participant Artifact as Artifact Storage
participant Compare as Hash Comparator
GHA->>Runner: Start matrix job (ubuntu/windows/macos)
Runner->>Script: Run reproducibility script / pipeline
Script->>Pipeline: Invoke run_pipeline / generate_manifest
Pipeline->>FS: Walk directory (posix-relative paths)
FS->>Pipeline: Return file list
Pipeline->>Hash: Normalize contents and compute per-file SHA-256
Hash->>Pipeline: Return per-file hashes
Pipeline->>Hash: Serialize deterministic manifest JSON and compute manifest hash
Pipeline->>Artifact: Upload `reproducible-hash-<os>` artifact
Compare->>Artifact: Download all `reproducible-hash-*`
Compare->>Compare: Compare hashes for bitwise equality
alt All match
Compare->>GHA: Success
else Any mismatch
Compare->>GHA: Fail CI
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@Muneerali199 most of the utilities from the Can you build upon the current setup and add some concrete tests to use in github actions. Current one doesnt seem that important |
…flow - Add comprehensive test suite for dataset hashing (8 tests) - Add pipeline integration tests (13 tests) - Add GitHub Actions workflow with cross-platform validation - Add CLI script for reproducibility testing - Add pipeline utilities with manifest generation
a4453c1 to
fe027c0
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/pipeline.py`:
- Around line 53-63: The validate_manifest_integrity function currently only
checks listed manifest entries and lets file-missing errors propagate; update it
so it (1) safely handles missing/unreadable files by catching
FileNotFoundError/OSError from compute_normalized_sha256/compute_sha256 and
returning False instead of raising, and (2) ensures there are no extra files in
the target directory by comparing the set of manifest paths (entry["path"]) to
the set of actual file paths under directory_path (e.g., using Path.rglob or
Path.iterdir and relative paths), returning False if any extra or missing files
are found, while still performing the existing per-entry hash comparison
(recomputed_hash) in the loop.
In `@scripts/test_reproducibility.py`:
- Around line 15-33: The manifest hashing in generate_manifest uses raw
compute_sha256 on files, causing divergence from production because it skips
openverifiablellm.pipeline's normalization; update generate_manifest to read
each file, pass its contents through the pipeline normalization routine (use the
normalization helper in openverifiablellm.pipeline — e.g.,
pipeline.normalize_content or the equivalent exported function) and compute the
SHA256 over the normalized bytes/string instead of the raw file bytes, keeping
the rest of the manifest fields the same; ensure get_manifest_hash still
canonicalizes JSON with sort_keys and separators as before so overall hashing
matches production semantics.
In `@tests/test_reproducibility_pipeline.py`:
- Around line 9-33: The test defines local duplicates of normalize_line_endings,
generate_manifest, and get_manifest_hash which shadow the production API; remove
these local implementations and instead import the canonical functions from
openverifiablellm.pipeline (e.g., from openverifiablellm.pipeline import
normalize_line_endings, generate_manifest, get_manifest_hash) and update any
test references to call those imported symbols so the tests exercise the real
production implementations.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (7)
.github/workflows/reproducibility-tests.ymlopenverifiablellm/__init__.pyopenverifiablellm/pipeline.pyreproducible_hash.txtscripts/test_reproducibility.pytests/test_dataset_hash.pytests/test_reproducibility_pipeline.py
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
tests/test_reproducibility_pipeline.py (1)
9-33:⚠️ Potential issue | 🟠 MajorTests should call
openverifiablellm.pipelinedirectly, not local helper clones.These local implementations shadow production behavior, so the suite can miss regressions in shipped code paths (including normalization/hash behavior).
Proposed fix
-import json -import hashlib +import json +import hashlib @@ -from openverifiablellm import compute_sha256 +from openverifiablellm import compute_sha256 +from openverifiablellm.pipeline import ( + normalize_line_endings, + generate_manifest, + get_manifest_hash, +) @@ -def normalize_line_endings(content: str) -> str: - ... - -def generate_manifest(directory_path: Path) -> dict: - ... - -def get_manifest_hash(manifest: dict) -> str: - ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_reproducibility_pipeline.py` around lines 9 - 33, Replace the local test helpers with calls to the production implementations instead of shadowing them: remove or stop using the local normalize_line_endings, generate_manifest, and get_manifest_hash functions in tests and import them from the package (e.g. from openverifiablellm.pipeline import normalize_line_endings, generate_manifest, get_manifest_hash or import openverifiablellm.pipeline and reference pipeline.generate_manifest/etc.); ensure the tests call the production functions so normalization, hashing, and manifest generation behavior exercised in tests matches shipped code.openverifiablellm/pipeline.py (1)
53-63:⚠️ Potential issue | 🟠 Major
validate_manifest_integritycan miss tampering and break its boolean contract.It currently ignores extra files and may raise on missing/unreadable paths instead of returning
False.Proposed fix
def validate_manifest_integrity(manifest: Dict, directory_path: Union[str, Path]) -> bool: dir_path = Path(directory_path) - for entry in manifest["files"]: + manifest_files = manifest.get("files", []) + manifest_paths = {entry.get("path") for entry in manifest_files if "path" in entry} + actual_paths = { + str(p.relative_to(dir_path)).replace("\\", "/") + for p in dir_path.glob("**/*") + if p.is_file() + } + if manifest_paths != actual_paths: + return False + + for entry in manifest_files: file_path = dir_path / entry["path"] try: - recomputed_hash = compute_normalized_sha256(file_path) - except (UnicodeDecodeError, ValueError): - recomputed_hash = compute_sha256(file_path) + try: + recomputed_hash = compute_normalized_sha256(file_path) + except (UnicodeDecodeError, ValueError): + recomputed_hash = compute_sha256(file_path) + except (FileNotFoundError, OSError): + return False if entry["sha256"] != recomputed_hash: return False return True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/pipeline.py` around lines 53 - 63, The validate_manifest_integrity function can raise on missing/unreadable files and currently ignores extra files; update it so it always returns a bool by (1) verifying the set of actual files under directory_path (relative paths) matches exactly the set of manifest["files"] paths (treat extra or missing files as integrity failure), and (2) when iterating entries call compute_normalized_sha256/compute_sha256 but catch any file-access errors (e.g., FileNotFoundError, PermissionError, UnicodeDecodeError, ValueError and any other IO-related exceptions) and return False instead of letting exceptions propagate; reference validate_manifest_integrity, manifest["files"], entry["path"], compute_normalized_sha256 and compute_sha256 when making these changes.scripts/test_reproducibility.py (1)
12-33:⚠️ Potential issue | 🔴 CriticalUse the production pipeline API for manifest/hash generation.
At Line 22, raw
compute_sha256hashing diverges fromopenverifiablellm/pipeline.pysemantics (normalized text hashing + canonical manifest flow). This undermines the CI reproducibility signal.Proposed fix
-from openverifiablellm import compute_sha256 +from openverifiablellm.pipeline import ( + generate_manifest, + get_manifest_hash, +) -def generate_manifest(directory_path: Path) -> dict: - ... - -def get_manifest_hash(manifest: dict) -> str: - ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/test_reproducibility.py` around lines 12 - 33, The current generate_manifest and get_manifest_hash use compute_sha256 directly which bypasses the production pipeline semantics; instead import and call the canonical pipeline functions (e.g., use openverifiablellm.pipeline.generate_manifest and openverifiablellm.pipeline.get_manifest_hash or the equivalent exported API) rather than compute_sha256 and the ad-hoc JSON hashing; update generate_manifest to delegate to the pipeline API for normalized text hashing and canonical manifest creation and replace get_manifest_hash to call the pipeline's manifest-hash function so CI uses the exact production manifest/hash logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/__init__.py`:
- Around line 12-21: The __all__ export list is unsorted and triggers RUF022;
sort the entries in the __all__ list into a deterministic order (e.g.,
alphabetical) so the export declarations are stable under lint checks—update the
__all__ array containing "compute_sha256", "compute_normalized_sha256",
"compare_manifests", "generate_manifest", "get_manifest_hash",
"normalize_line_endings", "run_pipeline", and "validate_manifest_integrity" to
be in sorted order.
In `@scripts/test_reproducibility.py`:
- Around line 64-72: run_all_tests currently ignores CLI options by using
defaults; change its signature to accept runs and seed parameters (e.g., def
run_all_tests(runs: int, seed: int) -> Dict[str, bool]) and pass those values
into test_reproducibility_multiple_runs (and any other test helper it calls).
Then update the caller in main() to forward the parsed CLI --runs and --seed
into run_all_tests. Also update the other test invocations inside run_all_tests
(the calls around the 91-95 area) so they receive and propagate runs/seed as
needed to ensure CLI values are used throughout.
---
Duplicate comments:
In `@openverifiablellm/pipeline.py`:
- Around line 53-63: The validate_manifest_integrity function can raise on
missing/unreadable files and currently ignores extra files; update it so it
always returns a bool by (1) verifying the set of actual files under
directory_path (relative paths) matches exactly the set of manifest["files"]
paths (treat extra or missing files as integrity failure), and (2) when
iterating entries call compute_normalized_sha256/compute_sha256 but catch any
file-access errors (e.g., FileNotFoundError, PermissionError,
UnicodeDecodeError, ValueError and any other IO-related exceptions) and return
False instead of letting exceptions propagate; reference
validate_manifest_integrity, manifest["files"], entry["path"],
compute_normalized_sha256 and compute_sha256 when making these changes.
In `@scripts/test_reproducibility.py`:
- Around line 12-33: The current generate_manifest and get_manifest_hash use
compute_sha256 directly which bypasses the production pipeline semantics;
instead import and call the canonical pipeline functions (e.g., use
openverifiablellm.pipeline.generate_manifest and
openverifiablellm.pipeline.get_manifest_hash or the equivalent exported API)
rather than compute_sha256 and the ad-hoc JSON hashing; update generate_manifest
to delegate to the pipeline API for normalized text hashing and canonical
manifest creation and replace get_manifest_hash to call the pipeline's
manifest-hash function so CI uses the exact production manifest/hash logic.
In `@tests/test_reproducibility_pipeline.py`:
- Around line 9-33: Replace the local test helpers with calls to the production
implementations instead of shadowing them: remove or stop using the local
normalize_line_endings, generate_manifest, and get_manifest_hash functions in
tests and import them from the package (e.g. from openverifiablellm.pipeline
import normalize_line_endings, generate_manifest, get_manifest_hash or import
openverifiablellm.pipeline and reference pipeline.generate_manifest/etc.);
ensure the tests call the production functions so normalization, hashing, and
manifest generation behavior exercised in tests matches shipped code.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
openverifiablellm/__init__.pyopenverifiablellm/pipeline.pyscripts/test_reproducibility.pytests/test_dataset_hash.pytests/test_reproducibility_pipeline.py
| __all__ = [ | ||
| "compute_sha256", | ||
| "generate_manifest", | ||
| "get_manifest_hash", | ||
| "run_pipeline", | ||
| "compute_normalized_sha256", | ||
| "normalize_line_endings", | ||
| "validate_manifest_integrity", | ||
| "compare_manifests", | ||
| ] |
There was a problem hiding this comment.
Sort __all__ to keep export declarations stable under lint checks.
The current list order triggers Ruff RUF022; sort entries (or keep a deliberate deterministic order consistent with your lint config) to avoid CI noise.
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 12-21: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/__init__.py` around lines 12 - 21, The __all__ export list
is unsorted and triggers RUF022; sort the entries in the __all__ list into a
deterministic order (e.g., alphabetical) so the export declarations are stable
under lint checks—update the __all__ array containing "compute_sha256",
"compute_normalized_sha256", "compare_manifests", "generate_manifest",
"get_manifest_hash", "normalize_line_endings", "run_pipeline", and
"validate_manifest_integrity" to be in sorted order.
- Sort __all__ exports alphabetically (RUF022) - Update run_all_tests to accept runs/seed parameters - Use production pipeline functions in test_reproducibility.py - Fix validate_manifest_integrity to check extra/missing files and handle exceptions - Replace local test helpers with production implementations in tests
|
sir could you review it now @Archit381 |
|
@Muneerali199 pls use already existing functions defined in utilts and dont push any scripts folder yet just for tests. Use the tests folder Also, before pushing do pull from main once for updates |
- Add 7 test cases covering identical, added, removed, modified, mixed scenarios, and empty manifests - Uses existing compare_manifests function from pipeline module - Follows existing test patterns in the codebase
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/pipeline.py`:
- Around line 29-33: The fallback calls to compute_sha256 are using positional
args but the function is defined with a leading * (all keyword-only), causing a
TypeError at runtime; update the two fallback calls in pipeline.py (the
try/except around compute_normalized_sha256 and the later fallback) to call
compute_sha256 with the correct keyword (e.g., compute_sha256(filename=file) or
compute_sha256(data=file) matching the function signature in utils.py) so the
fallback hashing works for binary/non-UTF8 files.
In `@scripts/test_reproducibility.py`:
- Around line 30-39: The test_reproducibility routine currently allows runs <= 0
which makes the final check (all_match) true with an empty hashes list and
returns a false-success; add validation at the start of the function or before
the loop to reject non-positive runs (e.g., raise ValueError or exit with an
error) when runs <= 0, ensuring you check the runs parameter used by the for i
in range(runs) loop and prevent returning hashes[0] when hashes is empty; update
any callers or CLI parsing that populate runs to enforce this same validation.
- Around line 69-76: The code is calling test_reproducibility_multiple_runs
again after run_all_tests, duplicating work and risking inconsistent results;
modify the logic so run_all_tests returns the reproducibility outcome (e.g.,
success, hashes, final_hash) or add an output flag to run_all_tests to obtain
the hash, and then use that returned final_hash to print REPRODUCIBLE_HASH and
write reproducible_hash.txt instead of calling
test_reproducibility_multiple_runs again; update references to run_all_tests in
the main branch where args.test == "all" to consume those returned values and
remove the extra call to test_reproducibility_multiple_runs.
In `@tests/test_reproducibility_pipeline.py`:
- Around line 224-226: The test calls compute_sha256 positionally which now
requires the keyword argument file_path; update the call in
tests/test_reproducibility_pipeline.py so compute_sha256 is invoked as
compute_sha256(file_path=binary_path) (and update any related uses of
binary_hash if necessary), ensuring the function name compute_sha256 and the
variable binary_path are the referenced identifiers to locate and fix the call.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
openverifiablellm/__init__.pyopenverifiablellm/pipeline.pyscripts/test_reproducibility.pytests/test_reproducibility_pipeline.py
- pipeline.py: Use keyword arg for compute_sha256 fallback calls - test_reproducibility.py: Add validation for runs <= 0 - test_reproducibility.py: Remove duplicate test_reproducibility_multiple_runs call - test_reproducibility_pipeline.py: Use keyword arg for compute_sha256
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/pipeline.py`:
- Around line 43-45: get_manifest_hash currently hashes the entire manifest
including per-entry raw "size" metadata which can vary across OSes; modify
get_manifest_hash to compute the hash from a deterministic, normalized version
of the manifest by deep-copying the input manifest, removing/omitting any raw
"size" fields from entries (e.g., each file entry) and any other
non-content-deterministic metadata, then JSON-dump that cleaned manifest with
sort_keys=True and separators=(",", ":") before hashing; update references to
the function to continue accepting the same manifest argument but ensure the
hashing uses this cleaned_manifest inside get_manifest_hash.
In `@scripts/test_reproducibility.py`:
- Line 28: The function name test_reproducibility_multiple_runs() conflicts with
pytest collection; rename it to a non-test-prefixed identifier (e.g.,
reproducibility_multiple_runs or _reproducibility_multiple_runs) and update
every internal reference/call in the same file (including the other occurrences
noted around lines corresponding to 51-52 and 81) to use the new name so the
helper is not discovered as a test by pytest.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
openverifiablellm/pipeline.pyscripts/test_reproducibility.pytests/test_reproducibility_pipeline.py
- pipeline.py: get_manifest_hash now removes 'size' field for deterministic hashing - test_reproducibility.py: renamed test_reproducibility_multiple_runs to _reproducibility_multiple_runs
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
scripts/test_reproducibility.py (1)
81-81: 🧹 Nitpick | 🔵 TrivialPrefix unused variable with underscore.
Same as Line 51 -
hashesis unpacked but unused.♻️ Proposed fix
- success, hashes, final_hash = _reproducibility_multiple_runs(args.runs, args.seed) + success, _hashes, final_hash = _reproducibility_multiple_runs(args.runs, args.seed)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/test_reproducibility.py` at line 81, The variable "hashes" returned from _reproducibility_multiple_runs is unpacked but unused; update the unpacking at the call site (the line assigning success, hashes, final_hash) to prefix the unused variable with an underscore (e.g., success, _hashes, final_hash) so the unused value is clearly marked and linter warnings are avoided while keeping the function call to _reproducibility_multiple_runs intact.openverifiablellm/pipeline.py (1)
70-70:⚠️ Potential issue | 🟡 MinorInconsistent access pattern may raise KeyError on malformed manifest.
Line 60 uses
manifest.get("files", [])defensively, but Line 70 accessesmanifest["files"]directly. If a malformed manifest without"files"key is passed and the directory is empty, this will raiseKeyError.🔧 Proposed fix
- for entry in manifest["files"]: + for entry in manifest.get("files", []):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/pipeline.py` at line 70, The loop currently uses manifest["files"] which can raise KeyError for malformed manifests; change the iteration to use the same defensive access as earlier by iterating over manifest.get("files", []) (or coalescing to an empty list) so the loop in pipeline.py (the for entry in ... loop) will safely handle missing or empty "files" keys without throwing; update any related usages in the surrounding function to consistently use manifest.get("files", []) as the source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/pipeline.py`:
- Line 77: Update the exception handling on the except clause in
openverifiablellm/pipeline.py: remove the deprecated IOError from the exception
tuple so it reads either except (FileNotFoundError, PermissionError, OSError):
or simply except OSError:; modify the existing except block (the try/except
around file access in pipeline.py at the shown except clause) to drop IOError
and keep explicit subclasses if you prefer readability, ensuring no other
references to IOError remain in that handler.
In `@scripts/test_reproducibility.py`:
- Around line 76-77: The file writes for the reproducible hash omit an explicit
encoding; update the two open(...) calls that write "reproducible_hash.txt" (the
block that writes final_hash and the other occurrence around lines 84-85) to
pass encoding="utf-8" (e.g., with open("reproducible_hash.txt", "w",
encoding="utf-8") as f) so the hash is written with a consistent UTF-8 encoding
across platforms.
- Line 51: The unpacked but unused variable `hashes` returned from
_reproducibility_multiple_runs should be prefixed with an underscore to satisfy
static analysis; change the assignment `success, hashes, final_hash =
_reproducibility_multiple_runs(runs, seed)` to use `success, _hashes, final_hash
= _reproducibility_multiple_runs(runs, seed)` (or `_, hashes, final_hash` as
preferred) so the unused value is clearly marked.
---
Duplicate comments:
In `@openverifiablellm/pipeline.py`:
- Line 70: The loop currently uses manifest["files"] which can raise KeyError
for malformed manifests; change the iteration to use the same defensive access
as earlier by iterating over manifest.get("files", []) (or coalescing to an
empty list) so the loop in pipeline.py (the for entry in ... loop) will safely
handle missing or empty "files" keys without throwing; update any related usages
in the surrounding function to consistently use manifest.get("files", []) as the
source of truth.
In `@scripts/test_reproducibility.py`:
- Line 81: The variable "hashes" returned from _reproducibility_multiple_runs is
unpacked but unused; update the unpacking at the call site (the line assigning
success, hashes, final_hash) to prefix the unused variable with an underscore
(e.g., success, _hashes, final_hash) so the unused value is clearly marked and
linter warnings are avoided while keeping the function call to
_reproducibility_multiple_runs intact.
openverifiablellm/pipeline.py
Outdated
| recomputed_hash = compute_normalized_sha256(file_path) | ||
| except (UnicodeDecodeError, ValueError): | ||
| recomputed_hash = compute_sha256(file_path=file_path) | ||
| except (FileNotFoundError, PermissionError, OSError, IOError): |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Replace deprecated IOError with OSError.
IOError is a deprecated alias for OSError since Python 3.3. The exception tuple already includes OSError, making IOError redundant.
♻️ Proposed fix
- except (FileNotFoundError, PermissionError, OSError, IOError):
+ except (FileNotFoundError, PermissionError, OSError):Note: FileNotFoundError and PermissionError are also subclasses of OSError, so the entire tuple could be simplified to just except OSError:, though explicit subclasses improve readability.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except (FileNotFoundError, PermissionError, OSError, IOError): | |
| except (FileNotFoundError, PermissionError, OSError): |
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 77-77: Replace aliased errors with OSError
Replace with builtin OSError
(UP024)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/pipeline.py` at line 77, Update the exception handling on
the except clause in openverifiablellm/pipeline.py: remove the deprecated
IOError from the exception tuple so it reads either except (FileNotFoundError,
PermissionError, OSError): or simply except OSError:; modify the existing except
block (the try/except around file access in pipeline.py at the shown except
clause) to drop IOError and keep explicit subclasses if you prefer readability,
ensuring no other references to IOError remain in that handler.
scripts/test_reproducibility.py
Outdated
| results = {} | ||
|
|
||
| print("\n1. Testing reproducibility across multiple runs...") | ||
| success, hashes, final_hash = _reproducibility_multiple_runs(runs, seed) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Prefix unused variable with underscore.
Static analysis correctly identifies hashes is unpacked but unused.
♻️ Proposed fix
- success, hashes, final_hash = _reproducibility_multiple_runs(runs, seed)
+ success, _hashes, final_hash = _reproducibility_multiple_runs(runs, seed)🧰 Tools
🪛 Ruff (0.15.2)
[warning] 51-51: Unpacked variable hashes is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/test_reproducibility.py` at line 51, The unpacked but unused variable
`hashes` returned from _reproducibility_multiple_runs should be prefixed with an
underscore to satisfy static analysis; change the assignment `success, hashes,
final_hash = _reproducibility_multiple_runs(runs, seed)` to use `success,
_hashes, final_hash = _reproducibility_multiple_runs(runs, seed)` (or `_,
hashes, final_hash` as preferred) so the unused value is clearly marked.
scripts/test_reproducibility.py
Outdated
| with open("reproducible_hash.txt", "w") as f: | ||
| f.write(final_hash) |
There was a problem hiding this comment.
Specify UTF-8 encoding when writing hash file for cross-platform consistency.
The open() calls don't specify encoding, which defaults to platform-dependent encoding on some systems. For reproducibility across OSes, explicitly use UTF-8.
🔧 Proposed fix
- with open("reproducible_hash.txt", "w") as f:
+ with open("reproducible_hash.txt", "w", encoding="utf-8") as f:
f.write(final_hash)- with open("reproducible_hash.txt", "w") as f:
+ with open("reproducible_hash.txt", "w", encoding="utf-8") as f:
f.write(final_hash)Also applies to: 84-85
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/test_reproducibility.py` around lines 76 - 77, The file writes for
the reproducible hash omit an explicit encoding; update the two open(...) calls
that write "reproducible_hash.txt" (the block that writes final_hash and the
other occurrence around lines 84-85) to pass encoding="utf-8" (e.g., with
open("reproducible_hash.txt", "w", encoding="utf-8") as f) so the hash is
written with a consistent UTF-8 encoding across platforms.
- pipeline.py: use manifest.get() for defensive access, remove deprecated IOError - test_reproducibility.py: add encoding utf-8 to file writes, prefix unused hashes
|
sir @Archit381 ,look at it now and let me know if further changes are required |
Overview
OpenVerifiableLLM is designed to guarantee deterministic preprocessing and reproducible dataset fingerprints.
This PR introduces a comprehensive reproducibility validation framework that enforces bitwise-identical dataset fingerprints across:
This converts reproducibility from a design assumption into a continuously enforced property via automated CI.
Motivation
Determinism can silently degrade due to subtle implementation changes such as:
LFvsCRLF)Without automated enforcement, these regressions are difficult to detect during code review.
Given that reproducibility is central to OpenVerifiableLLM’s mission, it must be continuously validated.
Changes Introduced
1. Deterministic Manifest Generation Improvements
File:
openverifiablellm/pipeline.pyCRLF → LF) before hashing.2. Synthetic Dataset Generator
File:
openverifiablellm/synthetic_data.py3. Cross-Platform Reproducibility CI Workflow
File:
.github/workflows/reproducibility.ymlIntroduced a multi-layer validation pipeline:
Multi-Run Validation (Same OS)
Cross-OS Matrix Validation
Runs on:
ubuntu-latestwindows-latestmacos-latestArtifact Hash Comparison
compare-hashesjob asserts strict equality.4. Dedicated Testing Suite
File:
tests/test_reproducibility.pyAdded unit tests covering:
5. Local Reproducibility Validation Script
File:
scripts/reproducibility_check.pyAdded standalone verification utility:
Run locally using:
python scripts/reproducibility_check.pyThis allows contributors to confirm that modifications do not break deterministic hashing before opening a PR.
Verification
Local Testing
Executed:
python -m pytestAll tests passed, including new reproducibility enforcement tests.
Manual Cross-Run Validation (Windows)
Output:
SUCCESS: Reproducibility confirmed across 3 runs.
FINAL ROOT HASH: 4f7350f980174ea96a8b2e469777d945dc5f3ae2eb8b0fb954a8e447b9b9b401
CI Enforcement
Workflow triggers on:
maindevelopCI will now fail automatically if reproducibility is broken.
Impact
This PR significantly strengthens OpenVerifiableLLM by:
Reproducibility is now mathematically validated, not assumed.
Design Principles Followed
Future Improvements
Conclusion
This PR establishes a foundational enforcement mechanism for deterministic dataset fingerprinting.
OpenVerifiableLLM now moves from theoretical reproducibility guarantees to continuously verified reproducibility guarantees.
Feedback and review are welcome.
closed #6
Summary by CodeRabbit
New Features
Tests
Chores