feat: implement reproducibility validation pipeline in CI by Muneerali199 · Pull Request #5 · AOSSIE-Org/OpenVerifiableLLM

Muneerali199 · 2026-02-22T07:23:10Z

Overview

OpenVerifiableLLM is designed to guarantee deterministic preprocessing and reproducible dataset fingerprints.

This PR introduces a comprehensive reproducibility validation framework that enforces bitwise-identical dataset fingerprints across:

Multiple executions
Different operating systems (Ubuntu, Windows, macOS)
Fresh synthetic dataset generations

This converts reproducibility from a design assumption into a continuously enforced property via automated CI.

Motivation

Determinism can silently degrade due to subtle implementation changes such as:

Unordered filesystem traversal
Non-canonical JSON serialization
OS-specific newline differences (LF vs CRLF)
Encoding inconsistencies
Parallel execution ordering differences
Dependency version changes

Without automated enforcement, these regressions are difficult to detect during code review.

Given that reproducibility is central to OpenVerifiableLLM’s mission, it must be continuously validated.

Changes Introduced

1. Deterministic Manifest Generation Improvements

File: openverifiablellm/pipeline.py

Implemented recursive directory traversal with strict relative path sorting.
Normalized path separators to forward slashes for cross-platform consistency.
Added newline normalization (CRLF → LF) before hashing.
Enforced canonical JSON serialization:
- Sorted keys
- No extraneous whitespace
- Stable formatting
Ensured SHA256 root fingerprint is stable across environments.

2. Synthetic Dataset Generator

File: openverifiablellm/synthetic_data.py

Added deterministic synthetic dataset generator.
Generates predictable file structure and content.
Used as a CI baseline for reproducibility validation.
Ensures environment-independent benchmarking.

3. Cross-Platform Reproducibility CI Workflow

File: .github/workflows/reproducibility.yml

Introduced a multi-layer validation pipeline:

Multi-Run Validation (Same OS)

Executes manifest generation 3 times.
Verifies identical fingerprint across runs.

Cross-OS Matrix Validation

Runs on:

ubuntu-latest
windows-latest
macos-latest

Artifact Hash Comparison

Uploads root hash artifacts from each OS.
Final compare-hashes job asserts strict equality.
Fails CI immediately if any mismatch is detected.

4. Dedicated Testing Suite

File: tests/test_reproducibility.py

Added unit tests covering:

Multi-run determinism
Fresh dataset regeneration consistency
Newline normalization validation
Deterministic file ordering enforcement
Manifest canonical serialization integrity

5. Local Reproducibility Validation Script

File: scripts/reproducibility_check.py

Added standalone verification utility:

Run locally using:

python scripts/reproducibility_check.py

This allows contributors to confirm that modifications do not break deterministic hashing before opening a PR.

Verification

Local Testing

Executed:

python -m pytest

All tests passed, including new reproducibility enforcement tests.

Manual Cross-Run Validation (Windows)

Output:

SUCCESS: Reproducibility confirmed across 3 runs.
FINAL ROOT HASH: 4f7350f980174ea96a8b2e469777d945dc5f3ae2eb8b0fb954a8e447b9b9b401

CI Enforcement

Workflow triggers on:

Push to main
Push to develop
Pull Requests targeting those branches

CI will now fail automatically if reproducibility is broken.

Impact

This PR significantly strengthens OpenVerifiableLLM by:

Preventing silent determinism regressions
Enforcing cross-platform reproducibility
Increasing contributor confidence
Enhancing long-term reliability
Aligning infrastructure with project philosophy

Reproducibility is now mathematically validated, not assumed.

Design Principles Followed

Deterministic file traversal
Canonical serialization
Byte-level hashing consistency
Cross-platform normalization
CI-first enforcement strategy
Minimal performance overhead

Future Improvements

Larger-scale synthetic dataset stress tests
Manifest diff inspection tool
Performance benchmarking suite
Parallel processing determinism validation
Cryptographic manifest signing layer

Conclusion

This PR establishes a foundational enforcement mechanism for deterministic dataset fingerprinting.

OpenVerifiableLLM now moves from theoretical reproducibility guarantees to continuously verified reproducibility guarantees.

Feedback and review are welcome.

closed #6

Summary by CodeRabbit

New Features
- Public API now exposes manifest generation, validation, comparison, and reproducible hashing utilities.
- Added a CLI to run reproducibility checks and optionally emit a canonical reproducible hash.
Tests
- Added comprehensive tests covering hashing, manifest reproducibility, integrity, large and binary files, and cross-platform behavior.
Chores
- Added automated cross-platform reproducibility workflow and a canonical reproducible hash file.

coderabbitai · 2026-02-22T07:23:15Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds a cross-platform reproducibility CI workflow, a new manifest pipeline module for deterministic per-file hashing and manifest comparison, a reproducibility test script and checked-in hash, and unit tests plus public API exports for the pipeline and hashing utilities.

Changes

Cohort / File(s)	Summary
CI Workflow `\.github/workflows/reproducibility-tests.yml`	New GitHub Actions workflow with three jobs: dataset hashing, cross-platform matrix (ubuntu/windows/macos) producing per-OS `reproducible-hash-<os>` artifacts, and a comparator job that downloads and compares hashes, failing on mismatches.
Manifest Pipeline `openverifiablellm/pipeline.py`	New module implementing line-ending normalization, normalized SHA-256 per file (with decode fallback), deterministic manifest generation (posix paths, sorted), manifest hash computation, integrity validation, and manifest comparison utilities.
Public API Exports `openverifiablellm/__init__.py`	Re-exports `compute_sha256` (from `utils`) and pipeline functions: `generate_manifest`, `get_manifest_hash`, `run_pipeline`, `compute_normalized_sha256`, `normalize_line_endings`, `validate_manifest_integrity`, `compare_manifests`; updates `__all__`.
Reproducibility Script & Artifact `scripts/test_reproducibility.py`, `reproducible_hash.txt`	Adds CLI script to build deterministic test datasets, run multi-run reproducibility checks, and optionally emit the final reproducible hash; `reproducible_hash.txt` contains the checked-in expected hash.
Unit Tests — Hashing `tests/test_dataset_hash.py`	New tests for `compute_sha256`: correctness, chunked streaming, empty/binary/large files, FileNotFound handling, and Path vs string parity.
Unit Tests — Reproducibility Pipeline `tests/test_reproducibility_pipeline.py`	Extensive tests validating deterministic manifest generation, stable manifest hashing, change detection (add/modify/remove), empty/large datasets, path normalization (posix paths), integrity verification, JSON serialization consistency, binary file handling, and end-to-end pipeline consistency.

Sequence Diagram(s)

sequenceDiagram
    participant GHA as GitHub Actions
    participant Runner as OS Runner
    participant Script as Repro Script / CI Job
    participant Pipeline as Manifest Pipeline
    participant FS as Filesystem
    participant Hash as SHA-256 Service
    participant Artifact as Artifact Storage
    participant Compare as Hash Comparator

    GHA->>Runner: Start matrix job (ubuntu/windows/macos)
    Runner->>Script: Run reproducibility script / pipeline
    Script->>Pipeline: Invoke run_pipeline / generate_manifest
    Pipeline->>FS: Walk directory (posix-relative paths)
    FS->>Pipeline: Return file list
    Pipeline->>Hash: Normalize contents and compute per-file SHA-256
    Hash->>Pipeline: Return per-file hashes
    Pipeline->>Hash: Serialize deterministic manifest JSON and compute manifest hash
    Pipeline->>Artifact: Upload `reproducible-hash-<os>` artifact
    Compare->>Artifact: Download all `reproducible-hash-*`
    Compare->>Compare: Compare hashes for bitwise equality
    alt All match
        Compare->>GHA: Success
    else Any mismatch
        Compare->>GHA: Fail CI
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: add deterministic preprocessing and dataset identity tracking #2: Overlaps on compute_sha256 and deterministic manifest utilities — likely code-level overlap with the new pipeline and exports.
Add deterministic dataset hashing utility #1: Adds dataset SHA-256 hashing utilities and tests that intersect with the hashing tests added here.
Add Merkle Tree–Based Chunk-Level Hashing for Dataset Verification this establishes dataset verification integrity . #8: Modifies compute/manifest APIs and adds manifest-related fields — potential API surface conflicts with the pipeline functions.

Suggested labels

Python Lang

Suggested reviewers

Archit381

Poem

🐰 I hopped through files and fixed each line ending,
seeds and hashes set, no more pretending.
Across Linux, Mac, and Windows I prance,
CI checks the fingerprints—one true dance.
Carrots for determinism, hop-hop-hurray! 🥕

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately describes the main change: implementation of a reproducibility validation pipeline for CI.
Linked Issues check	✅ Passed	The PR implements all core coding requirements from issue `#6`: pipeline module with deterministic manifest generation, cross-platform validation workflow, multi-run checks, and test coverage.
Out of Scope Changes check	✅ Passed	All changes directly support the reproducibility pipeline objective. The public API exports, pipeline functions, tests, and CI workflow are all within scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Archit381 · 2026-02-25T23:53:01Z

@Muneerali199 most of the utilities from the openverifiablellm/pipeline.py have already been added in previous PRs.

Can you build upon the current setup and add some concrete tests to use in github actions. Current one doesnt seem that important

…flow - Add comprehensive test suite for dataset hashing (8 tests) - Add pipeline integration tests (13 tests) - Add GitHub Actions workflow with cross-platform validation - Add CLI script for reproducibility testing - Add pipeline utilities with manifest generation

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/pipeline.py`:
- Around line 53-63: The validate_manifest_integrity function currently only
checks listed manifest entries and lets file-missing errors propagate; update it
so it (1) safely handles missing/unreadable files by catching
FileNotFoundError/OSError from compute_normalized_sha256/compute_sha256 and
returning False instead of raising, and (2) ensures there are no extra files in
the target directory by comparing the set of manifest paths (entry["path"]) to
the set of actual file paths under directory_path (e.g., using Path.rglob or
Path.iterdir and relative paths), returning False if any extra or missing files
are found, while still performing the existing per-entry hash comparison
(recomputed_hash) in the loop.

In `@scripts/test_reproducibility.py`:
- Around line 15-33: The manifest hashing in generate_manifest uses raw
compute_sha256 on files, causing divergence from production because it skips
openverifiablellm.pipeline's normalization; update generate_manifest to read
each file, pass its contents through the pipeline normalization routine (use the
normalization helper in openverifiablellm.pipeline — e.g.,
pipeline.normalize_content or the equivalent exported function) and compute the
SHA256 over the normalized bytes/string instead of the raw file bytes, keeping
the rest of the manifest fields the same; ensure get_manifest_hash still
canonicalizes JSON with sort_keys and separators as before so overall hashing
matches production semantics.

In `@tests/test_reproducibility_pipeline.py`:
- Around line 9-33: The test defines local duplicates of normalize_line_endings,
generate_manifest, and get_manifest_hash which shadow the production API; remove
these local implementations and instead import the canonical functions from
openverifiablellm.pipeline (e.g., from openverifiablellm.pipeline import
normalize_line_endings, generate_manifest, get_manifest_hash) and update any
test references to call those imported symbols so the tests exercise the real
production implementations.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cb91e9 and fe027c0.

📒 Files selected for processing (7)

.github/workflows/reproducibility-tests.yml
openverifiablellm/__init__.py
openverifiablellm/pipeline.py
reproducible_hash.txt
scripts/test_reproducibility.py
tests/test_dataset_hash.py
tests/test_reproducibility_pipeline.py

openverifiablellm/pipeline.py

scripts/test_reproducibility.py

tests/test_reproducibility_pipeline.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

tests/test_reproducibility_pipeline.py (1)

9-33: ⚠️ Potential issue | 🟠 Major

Tests should call openverifiablellm.pipeline directly, not local helper clones.

These local implementations shadow production behavior, so the suite can miss regressions in shipped code paths (including normalization/hash behavior).

Proposed fix

-import json
-import hashlib
+import json
+import hashlib
@@
-from openverifiablellm import compute_sha256
+from openverifiablellm import compute_sha256
+from openverifiablellm.pipeline import (
+    normalize_line_endings,
+    generate_manifest,
+    get_manifest_hash,
+)
@@
-def normalize_line_endings(content: str) -> str:
-    ...
-
-def generate_manifest(directory_path: Path) -> dict:
-    ...
-
-def get_manifest_hash(manifest: dict) -> str:
-    ...

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_reproducibility_pipeline.py` around lines 9 - 33, Replace the
local test helpers with calls to the production implementations instead of
shadowing them: remove or stop using the local normalize_line_endings,
generate_manifest, and get_manifest_hash functions in tests and import them from
the package (e.g. from openverifiablellm.pipeline import normalize_line_endings,
generate_manifest, get_manifest_hash or import openverifiablellm.pipeline and
reference pipeline.generate_manifest/etc.); ensure the tests call the production
functions so normalization, hashing, and manifest generation behavior exercised
in tests matches shipped code.

openverifiablellm/pipeline.py (1)

53-63: ⚠️ Potential issue | 🟠 Major

validate_manifest_integrity can miss tampering and break its boolean contract.

It currently ignores extra files and may raise on missing/unreadable paths instead of returning False.

Proposed fix

 def validate_manifest_integrity(manifest: Dict, directory_path: Union[str, Path]) -> bool:
     dir_path = Path(directory_path)
-    for entry in manifest["files"]:
+    manifest_files = manifest.get("files", [])
+    manifest_paths = {entry.get("path") for entry in manifest_files if "path" in entry}
+    actual_paths = {
+        str(p.relative_to(dir_path)).replace("\\", "/")
+        for p in dir_path.glob("**/*")
+        if p.is_file()
+    }
+    if manifest_paths != actual_paths:
+        return False
+
+    for entry in manifest_files:
         file_path = dir_path / entry["path"]
         try:
-            recomputed_hash = compute_normalized_sha256(file_path)
-        except (UnicodeDecodeError, ValueError):
-            recomputed_hash = compute_sha256(file_path)
+            try:
+                recomputed_hash = compute_normalized_sha256(file_path)
+            except (UnicodeDecodeError, ValueError):
+                recomputed_hash = compute_sha256(file_path)
+        except (FileNotFoundError, OSError):
+            return False
         if entry["sha256"] != recomputed_hash:
             return False
     return True

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/pipeline.py` around lines 53 - 63, The
validate_manifest_integrity function can raise on missing/unreadable files and
currently ignores extra files; update it so it always returns a bool by (1)
verifying the set of actual files under directory_path (relative paths) matches
exactly the set of manifest["files"] paths (treat extra or missing files as
integrity failure), and (2) when iterating entries call
compute_normalized_sha256/compute_sha256 but catch any file-access errors (e.g.,
FileNotFoundError, PermissionError, UnicodeDecodeError, ValueError and any other
IO-related exceptions) and return False instead of letting exceptions propagate;
reference validate_manifest_integrity, manifest["files"], entry["path"],
compute_normalized_sha256 and compute_sha256 when making these changes.

scripts/test_reproducibility.py (1)

12-33: ⚠️ Potential issue | 🔴 Critical

Use the production pipeline API for manifest/hash generation.

At Line 22, raw compute_sha256 hashing diverges from openverifiablellm/pipeline.py semantics (normalized text hashing + canonical manifest flow). This undermines the CI reproducibility signal.

Proposed fix

-from openverifiablellm import compute_sha256
+from openverifiablellm.pipeline import (
+    generate_manifest,
+    get_manifest_hash,
+)

-def generate_manifest(directory_path: Path) -> dict:
-    ...
-
-def get_manifest_hash(manifest: dict) -> str:
-    ...

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/test_reproducibility.py` around lines 12 - 33, The current
generate_manifest and get_manifest_hash use compute_sha256 directly which
bypasses the production pipeline semantics; instead import and call the
canonical pipeline functions (e.g., use
openverifiablellm.pipeline.generate_manifest and
openverifiablellm.pipeline.get_manifest_hash or the equivalent exported API)
rather than compute_sha256 and the ad-hoc JSON hashing; update generate_manifest
to delegate to the pipeline API for normalized text hashing and canonical
manifest creation and replace get_manifest_hash to call the pipeline's
manifest-hash function so CI uses the exact production manifest/hash logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/__init__.py`:
- Around line 12-21: The __all__ export list is unsorted and triggers RUF022;
sort the entries in the __all__ list into a deterministic order (e.g.,
alphabetical) so the export declarations are stable under lint checks—update the
__all__ array containing "compute_sha256", "compute_normalized_sha256",
"compare_manifests", "generate_manifest", "get_manifest_hash",
"normalize_line_endings", "run_pipeline", and "validate_manifest_integrity" to
be in sorted order.

In `@scripts/test_reproducibility.py`:
- Around line 64-72: run_all_tests currently ignores CLI options by using
defaults; change its signature to accept runs and seed parameters (e.g., def
run_all_tests(runs: int, seed: int) -> Dict[str, bool]) and pass those values
into test_reproducibility_multiple_runs (and any other test helper it calls).
Then update the caller in main() to forward the parsed CLI --runs and --seed
into run_all_tests. Also update the other test invocations inside run_all_tests
(the calls around the 91-95 area) so they receive and propagate runs/seed as
needed to ensure CLI values are used throughout.

---

Duplicate comments:
In `@openverifiablellm/pipeline.py`:
- Around line 53-63: The validate_manifest_integrity function can raise on
missing/unreadable files and currently ignores extra files; update it so it
always returns a bool by (1) verifying the set of actual files under
directory_path (relative paths) matches exactly the set of manifest["files"]
paths (treat extra or missing files as integrity failure), and (2) when
iterating entries call compute_normalized_sha256/compute_sha256 but catch any
file-access errors (e.g., FileNotFoundError, PermissionError,
UnicodeDecodeError, ValueError and any other IO-related exceptions) and return
False instead of letting exceptions propagate; reference
validate_manifest_integrity, manifest["files"], entry["path"],
compute_normalized_sha256 and compute_sha256 when making these changes.

In `@scripts/test_reproducibility.py`:
- Around line 12-33: The current generate_manifest and get_manifest_hash use
compute_sha256 directly which bypasses the production pipeline semantics;
instead import and call the canonical pipeline functions (e.g., use
openverifiablellm.pipeline.generate_manifest and
openverifiablellm.pipeline.get_manifest_hash or the equivalent exported API)
rather than compute_sha256 and the ad-hoc JSON hashing; update generate_manifest
to delegate to the pipeline API for normalized text hashing and canonical
manifest creation and replace get_manifest_hash to call the pipeline's
manifest-hash function so CI uses the exact production manifest/hash logic.

In `@tests/test_reproducibility_pipeline.py`:
- Around line 9-33: Replace the local test helpers with calls to the production
implementations instead of shadowing them: remove or stop using the local
normalize_line_endings, generate_manifest, and get_manifest_hash functions in
tests and import them from the package (e.g. from openverifiablellm.pipeline
import normalize_line_endings, generate_manifest, get_manifest_hash or import
openverifiablellm.pipeline and reference pipeline.generate_manifest/etc.);
ensure the tests call the production functions so normalization, hashing, and
manifest generation behavior exercised in tests matches shipped code.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe027c0 and b9a3aaf.

📒 Files selected for processing (5)

openverifiablellm/__init__.py
openverifiablellm/pipeline.py
scripts/test_reproducibility.py
tests/test_dataset_hash.py
tests/test_reproducibility_pipeline.py

coderabbitai · 2026-02-27T10:07:12Z

openverifiablellm/__init__.py

+__all__ = [
+    "compute_sha256",
+    "generate_manifest",
+    "get_manifest_hash",
+    "run_pipeline",
+    "compute_normalized_sha256",
+    "normalize_line_endings",
+    "validate_manifest_integrity",
+    "compare_manifests",
+]


⚠️ Potential issue | 🟡 Minor

Sort __all__ to keep export declarations stable under lint checks.

The current list order triggers Ruff RUF022; sort entries (or keep a deliberate deterministic order consistent with your lint config) to avoid CI noise.

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 12-21: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/__init__.py` around lines 12 - 21, The __all__ export list is unsorted and triggers RUF022; sort the entries in the __all__ list into a deterministic order (e.g., alphabetical) so the export declarations are stable under lint checks—update the __all__ array containing "compute_sha256", "compute_normalized_sha256", "compare_manifests", "generate_manifest", "get_manifest_hash", "normalize_line_endings", "run_pipeline", and "validate_manifest_integrity" to be in sorted order.

scripts/test_reproducibility.py

- Sort __all__ exports alphabetically (RUF022) - Update run_all_tests to accept runs/seed parameters - Use production pipeline functions in test_reproducibility.py - Fix validate_manifest_integrity to check extra/missing files and handle exceptions - Replace local test helpers with production implementations in tests

Muneerali199 · 2026-02-27T10:14:59Z

sir could you review it now @Archit381

Archit381 · 2026-03-01T21:22:34Z

@Muneerali199 pls use already existing functions defined in utilts and dont push any scripts folder yet just for tests. Use the tests folder

Also, before pushing do pull from main once for updates

into feat/reproducibility-pipeline

- Add 7 test cases covering identical, added, removed, modified, mixed scenarios, and empty manifests - Uses existing compare_manifests function from pipeline module - Follows existing test patterns in the codebase

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/pipeline.py`:
- Around line 29-33: The fallback calls to compute_sha256 are using positional
args but the function is defined with a leading * (all keyword-only), causing a
TypeError at runtime; update the two fallback calls in pipeline.py (the
try/except around compute_normalized_sha256 and the later fallback) to call
compute_sha256 with the correct keyword (e.g., compute_sha256(filename=file) or
compute_sha256(data=file) matching the function signature in utils.py) so the
fallback hashing works for binary/non-UTF8 files.

In `@scripts/test_reproducibility.py`:
- Around line 30-39: The test_reproducibility routine currently allows runs <= 0
which makes the final check (all_match) true with an empty hashes list and
returns a false-success; add validation at the start of the function or before
the loop to reject non-positive runs (e.g., raise ValueError or exit with an
error) when runs <= 0, ensuring you check the runs parameter used by the for i
in range(runs) loop and prevent returning hashes[0] when hashes is empty; update
any callers or CLI parsing that populate runs to enforce this same validation.
- Around line 69-76: The code is calling test_reproducibility_multiple_runs
again after run_all_tests, duplicating work and risking inconsistent results;
modify the logic so run_all_tests returns the reproducibility outcome (e.g.,
success, hashes, final_hash) or add an output flag to run_all_tests to obtain
the hash, and then use that returned final_hash to print REPRODUCIBLE_HASH and
write reproducible_hash.txt instead of calling
test_reproducibility_multiple_runs again; update references to run_all_tests in
the main branch where args.test == "all" to consume those returned values and
remove the extra call to test_reproducibility_multiple_runs.

In `@tests/test_reproducibility_pipeline.py`:
- Around line 224-226: The test calls compute_sha256 positionally which now
requires the keyword argument file_path; update the call in
tests/test_reproducibility_pipeline.py so compute_sha256 is invoked as
compute_sha256(file_path=binary_path) (and update any related uses of
binary_hash if necessary), ensuring the function name compute_sha256 and the
variable binary_path are the referenced identifiers to locate and fix the call.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9a3aaf and 9ff7782.

📒 Files selected for processing (4)

openverifiablellm/__init__.py
openverifiablellm/pipeline.py
scripts/test_reproducibility.py
tests/test_reproducibility_pipeline.py

openverifiablellm/pipeline.py

scripts/test_reproducibility.py

tests/test_reproducibility_pipeline.py

- pipeline.py: Use keyword arg for compute_sha256 fallback calls - test_reproducibility.py: Add validation for runs <= 0 - test_reproducibility.py: Remove duplicate test_reproducibility_multiple_runs call - test_reproducibility_pipeline.py: Use keyword arg for compute_sha256

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/pipeline.py`:
- Around line 43-45: get_manifest_hash currently hashes the entire manifest
including per-entry raw "size" metadata which can vary across OSes; modify
get_manifest_hash to compute the hash from a deterministic, normalized version
of the manifest by deep-copying the input manifest, removing/omitting any raw
"size" fields from entries (e.g., each file entry) and any other
non-content-deterministic metadata, then JSON-dump that cleaned manifest with
sort_keys=True and separators=(",", ":") before hashing; update references to
the function to continue accepting the same manifest argument but ensure the
hashing uses this cleaned_manifest inside get_manifest_hash.

In `@scripts/test_reproducibility.py`:
- Line 28: The function name test_reproducibility_multiple_runs() conflicts with
pytest collection; rename it to a non-test-prefixed identifier (e.g.,
reproducibility_multiple_runs or _reproducibility_multiple_runs) and update
every internal reference/call in the same file (including the other occurrences
noted around lines corresponding to 51-52 and 81) to use the new name so the
helper is not discovered as a test by pytest.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ff7782 and 6add0af.

📒 Files selected for processing (3)

openverifiablellm/pipeline.py
scripts/test_reproducibility.py
tests/test_reproducibility_pipeline.py

openverifiablellm/pipeline.py

scripts/test_reproducibility.py

- pipeline.py: get_manifest_hash now removes 'size' field for deterministic hashing - test_reproducibility.py: renamed test_reproducibility_multiple_runs to _reproducibility_multiple_runs

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

scripts/test_reproducibility.py (1)

81-81: 🧹 Nitpick | 🔵 Trivial

Prefix unused variable with underscore.

Same as Line 51 - hashes is unpacked but unused.

♻️ Proposed fix

-        success, hashes, final_hash = _reproducibility_multiple_runs(args.runs, args.seed)
+        success, _hashes, final_hash = _reproducibility_multiple_runs(args.runs, args.seed)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/test_reproducibility.py` at line 81, The variable "hashes" returned
from _reproducibility_multiple_runs is unpacked but unused; update the unpacking
at the call site (the line assigning success, hashes, final_hash) to prefix the
unused variable with an underscore (e.g., success, _hashes, final_hash) so the
unused value is clearly marked and linter warnings are avoided while keeping the
function call to _reproducibility_multiple_runs intact.

openverifiablellm/pipeline.py (1)

70-70: ⚠️ Potential issue | 🟡 Minor

Inconsistent access pattern may raise KeyError on malformed manifest.

Line 60 uses manifest.get("files", []) defensively, but Line 70 accesses manifest["files"] directly. If a malformed manifest without "files" key is passed and the directory is empty, this will raise KeyError.
🔧 Proposed fix
-    for entry in manifest["files"]:
+    for entry in manifest.get("files", []):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/pipeline.py` at line 70, The loop currently uses
manifest["files"] which can raise KeyError for malformed manifests; change the
iteration to use the same defensive access as earlier by iterating over
manifest.get("files", []) (or coalescing to an empty list) so the loop in
pipeline.py (the for entry in ... loop) will safely handle missing or empty
"files" keys without throwing; update any related usages in the surrounding
function to consistently use manifest.get("files", []) as the source of truth.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/pipeline.py`:
- Line 77: Update the exception handling on the except clause in
openverifiablellm/pipeline.py: remove the deprecated IOError from the exception
tuple so it reads either except (FileNotFoundError, PermissionError, OSError):
or simply except OSError:; modify the existing except block (the try/except
around file access in pipeline.py at the shown except clause) to drop IOError
and keep explicit subclasses if you prefer readability, ensuring no other
references to IOError remain in that handler.

In `@scripts/test_reproducibility.py`:
- Around line 76-77: The file writes for the reproducible hash omit an explicit
encoding; update the two open(...) calls that write "reproducible_hash.txt" (the
block that writes final_hash and the other occurrence around lines 84-85) to
pass encoding="utf-8" (e.g., with open("reproducible_hash.txt", "w",
encoding="utf-8") as f) so the hash is written with a consistent UTF-8 encoding
across platforms.
- Line 51: The unpacked but unused variable `hashes` returned from
_reproducibility_multiple_runs should be prefixed with an underscore to satisfy
static analysis; change the assignment `success, hashes, final_hash =
_reproducibility_multiple_runs(runs, seed)` to use `success, _hashes, final_hash
= _reproducibility_multiple_runs(runs, seed)` (or `_, hashes, final_hash` as
preferred) so the unused value is clearly marked.

---

Duplicate comments:
In `@openverifiablellm/pipeline.py`:
- Line 70: The loop currently uses manifest["files"] which can raise KeyError
for malformed manifests; change the iteration to use the same defensive access
as earlier by iterating over manifest.get("files", []) (or coalescing to an
empty list) so the loop in pipeline.py (the for entry in ... loop) will safely
handle missing or empty "files" keys without throwing; update any related usages
in the surrounding function to consistently use manifest.get("files", []) as the
source of truth.

In `@scripts/test_reproducibility.py`:
- Line 81: The variable "hashes" returned from _reproducibility_multiple_runs is
unpacked but unused; update the unpacking at the call site (the line assigning
success, hashes, final_hash) to prefix the unused variable with an underscore
(e.g., success, _hashes, final_hash) so the unused value is clearly marked and
linter warnings are avoided while keeping the function call to
_reproducibility_multiple_runs intact.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6add0af and 747e946.

📒 Files selected for processing (2)

openverifiablellm/pipeline.py
scripts/test_reproducibility.py

coderabbitai · 2026-03-02T16:21:50Z

openverifiablellm/pipeline.py

+                recomputed_hash = compute_normalized_sha256(file_path)
+            except (UnicodeDecodeError, ValueError):
+                recomputed_hash = compute_sha256(file_path=file_path)
+        except (FileNotFoundError, PermissionError, OSError, IOError):


🧹 Nitpick | 🔵 Trivial

Replace deprecated IOError with OSError.

IOError is a deprecated alias for OSError since Python 3.3. The exception tuple already includes OSError, making IOError redundant.

♻️ Proposed fix

- except (FileNotFoundError, PermissionError, OSError, IOError): + except (FileNotFoundError, PermissionError, OSError):

Note: FileNotFoundError and PermissionError are also subclasses of OSError, so the entire tuple could be simplified to just except OSError:, though explicit subclasses improve readability.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except (FileNotFoundError, PermissionError, OSError, IOError):

except (FileNotFoundError, PermissionError, OSError):

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 77-77: Replace aliased errors with OSError

Replace with builtin OSError

(UP024)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/pipeline.py` at line 77, Update the exception handling on the except clause in openverifiablellm/pipeline.py: remove the deprecated IOError from the exception tuple so it reads either except (FileNotFoundError, PermissionError, OSError): or simply except OSError:; modify the existing except block (the try/except around file access in pipeline.py at the shown except clause) to drop IOError and keep explicit subclasses if you prefer readability, ensuring no other references to IOError remain in that handler.

coderabbitai · 2026-03-02T16:21:51Z

scripts/test_reproducibility.py

+    results = {}
+
+    print("\n1. Testing reproducibility across multiple runs...")
+    success, hashes, final_hash = _reproducibility_multiple_runs(runs, seed)


🧹 Nitpick | 🔵 Trivial

Prefix unused variable with underscore.

Static analysis correctly identifies hashes is unpacked but unused.

♻️ Proposed fix

- success, hashes, final_hash = _reproducibility_multiple_runs(runs, seed) + success, _hashes, final_hash = _reproducibility_multiple_runs(runs, seed)

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 51-51: Unpacked variable hashes is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/test_reproducibility.py` at line 51, The unpacked but unused variable `hashes` returned from _reproducibility_multiple_runs should be prefixed with an underscore to satisfy static analysis; change the assignment `success, hashes, final_hash = _reproducibility_multiple_runs(runs, seed)` to use `success, _hashes, final_hash = _reproducibility_multiple_runs(runs, seed)` (or `_, hashes, final_hash` as preferred) so the unused value is clearly marked.

coderabbitai · 2026-03-02T16:21:51Z

scripts/test_reproducibility.py

+                with open("reproducible_hash.txt", "w") as f:
+                    f.write(final_hash)


⚠️ Potential issue | 🟡 Minor

Specify UTF-8 encoding when writing hash file for cross-platform consistency.

The open() calls don't specify encoding, which defaults to platform-dependent encoding on some systems. For reproducibility across OSes, explicitly use UTF-8.

🔧 Proposed fix

- with open("reproducible_hash.txt", "w") as f: + with open("reproducible_hash.txt", "w", encoding="utf-8") as f: f.write(final_hash)

- with open("reproducible_hash.txt", "w") as f: + with open("reproducible_hash.txt", "w", encoding="utf-8") as f: f.write(final_hash)

Also applies to: 84-85

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/test_reproducibility.py` around lines 76 - 77, The file writes for the reproducible hash omit an explicit encoding; update the two open(...) calls that write "reproducible_hash.txt" (the block that writes final_hash and the other occurrence around lines 84-85) to pass encoding="utf-8" (e.g., with open("reproducible_hash.txt", "w", encoding="utf-8") as f) so the hash is written with a consistent UTF-8 encoding across platforms.

- pipeline.py: use manifest.get() for defensive access, remove deprecated IOError - test_reproducibility.py: add encoding utf-8 to file writes, prefix unused hashes

Muneerali199 · 2026-03-02T16:29:49Z

sir @Archit381 ,look at it now and let me know if further changes are required

github-actions bot added no-issue-linked backend configuration python github-actions ci-cd size/L first-time-contributor pending-coderabbit-review enhancement New feature or request and removed size/L labels Feb 22, 2026

Archit381 self-requested a review February 25, 2026 23:01

Muneerali199 force-pushed the feat/reproducibility-pipeline branch from a4453c1 to fe027c0 Compare February 27, 2026 09:57

github-actions bot added documentation Improvements or additions to documentation size/XL and removed size/L size/XL labels Feb 27, 2026

Merge origin/main into feature branch

e49ba3d

coderabbitai bot requested changes Feb 27, 2026

View reviewed changes

openverifiablellm/pipeline.py Show resolved Hide resolved

scripts/test_reproducibility.py Outdated Show resolved Hide resolved

tests/test_reproducibility_pipeline.py Outdated Show resolved Hide resolved

Merge main and fix imports to use utils module

b9a3aaf

github-actions bot added size/XL and removed size/XL labels Feb 27, 2026

coderabbitai bot requested changes Feb 27, 2026

View reviewed changes

github-actions bot added size/XL and removed size/XL labels Feb 27, 2026

Muneerali199 added 2 commits March 2, 2026 20:52

Merge branch 'main' of https://github.com/Muneerali199/OpenVerifiableLLM

a03d06f

into feat/reproducibility-pipeline

test: add tests for compare_manifests function

9ff7782

- Add 7 test cases covering identical, added, removed, modified, mixed scenarios, and empty manifests - Uses existing compare_manifests function from pipeline module - Follows existing test patterns in the codebase

github-actions bot added size/XL and removed size/XL labels Mar 2, 2026

coderabbitai bot requested changes Mar 2, 2026

View reviewed changes

openverifiablellm/pipeline.py Show resolved Hide resolved

scripts/test_reproducibility.py Show resolved Hide resolved

scripts/test_reproducibility.py Show resolved Hide resolved

tests/test_reproducibility_pipeline.py Show resolved Hide resolved

github-actions bot added size/XL and removed size/XL labels Mar 2, 2026

coderabbitai bot requested changes Mar 2, 2026

View reviewed changes

openverifiablellm/pipeline.py Show resolved Hide resolved

scripts/test_reproducibility.py Outdated Show resolved Hide resolved

fix: exclude size field from manifest hash and rename function

747e946

- pipeline.py: get_manifest_hash now removes 'size' field for deterministic hashing - test_reproducibility.py: renamed test_reproducibility_multiple_runs to _reproducibility_multiple_runs

github-actions bot added size/XL and removed size/XL labels Mar 2, 2026

coderabbitai bot requested changes Mar 2, 2026

View reviewed changes

fix: final code quality fixes

d0c8ef5

- pipeline.py: use manifest.get() for defensive access, remove deprecated IOError - test_reproducibility.py: add encoding utf-8 to file writes, prefix unused hashes

github-actions bot added size/XL and removed size/XL labels Mar 2, 2026

Archit381 closed this Mar 5, 2026

	except (FileNotFoundError, PermissionError, OSError, IOError):
	except (FileNotFoundError, PermissionError, OSError):

		with open("reproducible_hash.txt", "w") as f:
		f.write(final_hash)

Uh oh!

Conversation

Muneerali199 commented Feb 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Motivation

Changes Introduced

1. Deterministic Manifest Generation Improvements

2. Synthetic Dataset Generator

3. Cross-Platform Reproducibility CI Workflow

Multi-Run Validation (Same OS)

Cross-OS Matrix Validation

Artifact Hash Comparison

4. Dedicated Testing Suite

5. Local Reproducibility Validation Script

Verification

Local Testing

Manual Cross-Run Validation (Windows)

CI Enforcement

Impact

Design Principles Followed

Future Improvements

Conclusion

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

Archit381 commented Feb 25, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Muneerali199 commented Feb 27, 2026

Uh oh!

Archit381 commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Muneerali199 commented Feb 22, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 22, 2026 •

edited

Loading

Archit381 commented Mar 1, 2026 •

edited

Loading