[Harbor 1/4] vero core: split visibility, budget ledger, evaluation engine refactor#3
Open
varunursekar wants to merge 1 commit into
Open
[Harbor 1/4] vero core: split visibility, budget ledger, evaluation engine refactor#3varunursekar wants to merge 1 commit into
varunursekar wants to merge 1 commit into
Conversation
…ne refactor Foundational changes that the Harbor integration builds on, useful on their own: - 3-tier split visibility: `SplitAccessLevel` gains `no_access` (hidden) alongside `visible` and `non_viewable` (aggregate-only). - `BudgetLedger` (core/budget.py): a metered, optionally durable per-split eval budget extracted from the experiment runner. - `TaskOutput` is now a model; the task pipeline runs inference and scoring as separate, resumable, disk-backed stages, with label-field scrubbing before inference. - Evaluation refactor: consolidate the eval path under `vero/evaluation/` as a shared `EvaluationEngine` + `Evaluator`, port `ExperimentRunnerTool` onto it, and add an injectable `EvalStrategy` seam (the extension point Harbor Mode B uses). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines
+745
to
+748
| tasks, _ = self._load_and_prepare_data(params) | ||
| sample_ids = params.run.dataset_subset.sample_ids | ||
| if sample_ids is None: | ||
| sample_ids = list(range(len(tasks))) |
There was a problem hiding this comment.
Dataset loaded twice per evaluation
run_scoring_stage calls _load_and_prepare_data(params) even though run_inference_stage already called it earlier in run(). On large datasets loaded from disk, this doubles the I/O and deserialization cost. Consider passing the already-loaded (tasks, task_data) tuple into both stage methods, or loading once at the run() level and threading it through.
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/core/task/task.py
Line: 745-748
Comment:
**Dataset loaded twice per evaluation**
`run_scoring_stage` calls `_load_and_prepare_data(params)` even though `run_inference_stage` already called it earlier in `run()`. On large datasets loaded from disk, this doubles the I/O and deserialization cost. Consider passing the already-loaded `(tasks, task_data)` tuple into both stage methods, or loading once at the `run()` level and threading it through.
How can I resolve this? If you propose a fix, please make it concise.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft · Stack 1 of 4 — review/merge in order (1→4). No Harbor concepts here; foundations the rest builds on (and useful on their own).
SplitAccessLevelgainsno_access(hidden) alongsidevisible/non_viewable(aggregate-only).BudgetLedger(core/budget.py): metered, optionally-durable per-split eval budget.TaskOutputis a model; inference and scoring run as separate resumable disk-backed stages, with label-field scrubbing before inference.vero/evaluation/as a sharedEvaluationEngine+Evaluator, portExperimentRunnerToolonto it, and add the injectableEvalStrategyseam (Mode B's extension point).Review focus: the visibility enum,
BudgetLedger, theEvaluationEngineAPI, and the strategy seam. Much ofevaluation/is relocated code (skim).Part of the Harbor integration (full squashed view: #2). Stack: this → [2/4] sidecar → [3/4] compiler → [4/4] docs.
🤖 Generated with Claude Code
Greptile Summary
This PR lays the foundational plumbing for the Harbor integration: a three-tier
SplitAccessLevelenum (viewable/non_viewable/no_access), a meteredBudgetLedgerwith optional durable JSON persistence, and a refactored evaluation stack that splits inference and scoring into two resumable disk-backed stages with label-field scrubbing.BudgetLedger(core/budget.py): in-memory by default;persist_pathenables a crash-safe durable variant for the Harbor sidecar, withreserve()providing an atomic check+decrement under anasyncio.Lock. One issue:_flush()does synchronous file I/O while the lock is held, which blocks the event loop in the durable path.core/task/task.py):run_inference_stageandrun_scoring_stageeach persist per-sample results incrementally and skip already-completed samples on resume;label_fieldsare scrubbed from inference inputs but passed intact to scoring.EvaluationEngine+EvalStrategyseam (evaluation/): shared engine consolidates budget metering and sample resolution; aProtocol-basedEvalStrategyseam allows Harbor Mode B to inject aHarborRunnerwithout couplingvero.evaluationto harbor-specific code.Confidence Score: 3/5
Safe to merge for the in-process ExperimentRunnerTool path; the durable BudgetLedger path (Harbor sidecar) has a blocking I/O call under an asyncio lock that will degrade the event loop under concurrency.
The evaluation refactor, visibility enum, and EvalStrategy seam are clean and well-tested. The concern is in BudgetLedger.reserve(): it holds an asyncio.Lock and then calls _flush(), which does synchronous Path.write_text() on the main thread. For the in-memory path (persist_path=None) _flush() is a no-op and this is harmless. But the durable path — the variant the Harbor sidecar is documented to use — will block the event loop on every metered eval request, causing latency spikes and throttling throughput under concurrent load.
vero/src/vero/core/budget.py — the _flush() / reserve() interaction; vero/src/vero/core/task/task.py — double _load_and_prepare_data() call across the two stages.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[VeroTask.run] --> B[run_inference_stage] B --> B1{sample persisted?} B1 -- yes --> B2[skip] B1 -- no --> B3[_scrub_inputs] B3 --> B4[inference fn] B4 --> B5[_save_inference to disk] A --> C[run_scoring_stage] C --> C1{error or already scored?} C1 -- yes --> C2[skip] C1 -- no --> C3[load TaskOutput from disk] C3 --> C4[scoring fn with full row] C4 --> C5[_save_score to disk] A --> D[compute_metrics from disk] E[EvaluationEngine.evaluate] --> F{admin?} F -- no --> G[BudgetLedger.reserve] G --> G1[asyncio.Lock] G1 --> G2[check] G2 --> G3[record + _flush] G3 -. blocking I/O .-> G4[blocks event loop] F -- yes --> H[bypass budget] G --> I[Evaluator.evaluate] H --> I I --> J{eval_strategy set?} J -- Mode B --> K[EvalStrategy.produce_sample_results] J -- Mode A --> L[_run_task_in_subprocess]%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A[VeroTask.run] --> B[run_inference_stage] B --> B1{sample persisted?} B1 -- yes --> B2[skip] B1 -- no --> B3[_scrub_inputs] B3 --> B4[inference fn] B4 --> B5[_save_inference to disk] A --> C[run_scoring_stage] C --> C1{error or already scored?} C1 -- yes --> C2[skip] C1 -- no --> C3[load TaskOutput from disk] C3 --> C4[scoring fn with full row] C4 --> C5[_save_score to disk] A --> D[compute_metrics from disk] E[EvaluationEngine.evaluate] --> F{admin?} F -- no --> G[BudgetLedger.reserve] G --> G1[asyncio.Lock] G1 --> G2[check] G2 --> G3[record + _flush] G3 -. blocking I/O .-> G4[blocks event loop] F -- yes --> H[bypass budget] G --> I[Evaluator.evaluate] H --> I I --> J{eval_strategy set?} J -- Mode B --> K[EvalStrategy.produce_sample_results] J -- Mode A --> L[_run_task_in_subprocess]Comments Outside Diff (1)
vero/src/vero/core/budget.py, line 175-193 (link)asyncio.Lockinreserve()_flush()callsPath.write_text()(synchronous) whilereserve()holdsself._lock— anasyncio.Lock. Any concurrent coroutine that awaitsreserve()will be blocked for the full duration of the file write; on a loaded Harbor sidecar with rapid eval requests, this stalls the entire event loop. The fix is to wrap the blocking writes inasyncio.to_thread()(or useaiofiles) andawaitthem insidereserve(), keeping the lock protecting only the in-memory check+decrement step.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "vero: 3-tier split visibility, budget le..." | Re-trigger Greptile