Skip to content

Commit e5b4fb1

Browse files
committed
docs: nightly research report 2026-03-21 (report #15)
Key new findings: - Confirmed 2 broken abc_audit quality-gate checks via pytest (T5 solution leak + R2 contamination both return PASS for bad tasks; duplicate defs) - Cost pipeline never passes model ID to calculate_cost_from_tokens; all costs silently calculated at Opus-4.5 rates (Sonnet runs: 5x overstatement) - claude-sonnet-4-6 and claude-haiku-4-6 absent from MODEL_PRICING table - Zero CI workflows run pytest — 2 confirmed test failures invisible to CI - Hardcoded Feb-2026 staging run IDs in verify_retrieval_eval_smoke.py - Non-atomic credential write in daytona_runner.py:234
1 parent 991fde8 commit e5b4fb1

File tree

4 files changed

+321
-48
lines changed

4 files changed

+321
-48
lines changed

AGENTS.md

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -96,12 +96,10 @@ full operations manual.
9696
- **Verifier lib duplication**: 401 copies of `answer_json_verifier_lib.sh` (13 suites; task copies diverged with extra funcs). 275 copies of `dual_score_lib.sh` (csb/ only). `benchmarks/_shared/` missing; every fix requires touching 401+ files.
9797

9898
### Scripts / Code Quality
99-
- `abc_audit.py`: 4 functions defined twice; Python silently uses last definition.
100-
- `rerun_failed.py`: `shell=True` injection; wrong `sourcegraph_full→deepsearch` mapping; deprecated model.
101-
- `ir_metrics.py:749`: `tt_all_r` set comparison bug (first-relevant, not all-relevant).
102-
- `--skip-completed` requires result.json + task_metrics.json; fix: check only result.json.
103-
- Task registry header: claims 436, actual 274 (`sync_task_metadata.py --fix` doesn't update it).
104-
- `verification_modes`/`use_case_category` missing from all 274 tasks; `--use-case-category` silently returns 0.
99+
- `abc_audit.py`: 4 functions defined twice; Python uses last. `check_t5_no_solution_leak` + `check_r2_no_contamination` broken — `pytest tests/test_abc_audit.py` confirms 2 FAIL / 40 pass. Tasks with solution leaks / sourcegraph refs pass audit silently.
100+
- `rerun_failed.py`: `shell=True` injection; wrong `sourcegraph_full→deepsearch`; deprecated model.
101+
- `ir_metrics.py:749`: `tt_all_r` set comparison bug. `--skip-completed`: check only result.json.
102+
- Task registry header: claims 436, actual 274. `verification_modes`/`use_case_category` missing from all tasks.
105103

106104
### Validation / Scoring
107105
- `validators.py` duplicated in `ccb_build`; update all copies (`sha256sum`).
@@ -150,17 +148,17 @@ full operations manual.
150148
- Secret-detection false-positives: use `--no-verify` when flagged code is detection logic.
151149
- Ralph: `prd.json` single-active; archive as `prd-archive/prd-<feature>-<date>.json`; validate: `python3 -c "import json; json.load(open('prd.json'))"`. Not gitignored.
152150

153-
### Scripts / Code Quality (Mar 17-20 additions)
154-
- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench`; crash on other machines.
155-
- `context_retrieval_agent.py:432+` `shell=True` without allowlist; injection risk.
156-
- Non-atomic writes: `aggregate_status.py:669`, `apply_verifier_fixes.py:103+`; use temp+rename.
151+
### Scripts / Code Quality (Mar 17-21 additions)
152+
- Hardcoded `~/CodeScaleBench`: `apply_verifier_fixes.py:9`, `fix_memory_mb.py:8`, `extract_build_diary.py:121`, `plot_build_diary_supplementary.py:121+`.
153+
- `context_retrieval_agent.py:432+`, `oracle_checks.py:498`: `shell=True` injection risk.
154+
- Non-atomic writes: `aggregate_status.py:669`, `daytona_runner.py:234`; use temp+rename pattern.
157155
- Bare `except:`: `audit_v2_report_data.py:104`, `ds_audit.py:244+`, `extract_v2_report_data.py:144+`.
158-
- FD leaks: 17+ sites; use `with open()`. `export_official_results.py:45` `DEFAULT_REPO_BLOB_BASE`stale org; links 404.
159-
- Ruff S603/S604, SIM115 (skips `Popen(stdout=f)`), BLE001; add `pyproject.toml`; `sanitize_secrets.py` needs S105/S106.
160-
- Hardcoded `/home/stephanie_jarmak/CodeScaleBench`: 5 scripts (`fix_memory_mb.py:8`, `extract_build_diary.py:121`, `plot_build_diary_supplementary.py:121+`, etc.).
161-
- `rerun_fixed_tasks.sh:34`, `rerun_zero_mcp_tasks.sh:29`: deprecated model; Ruff misses `.sh` — add `grep -rn "claude-opus-4-5" scripts/` to CI.
162-
- `run_selected_tasks.sh:648,699,711`: mktemp+mv race — `mv` failure swallowed by subshell, `cp` targets missing dir.
163-
- `csb_metrics/extractors.py:669`: FD leak via `tp.open()` (pathlib form; missed by SIM115 grep sweep).
156+
- FD leaks: 17+ sites; `export_official_results.py:45` stale org URL; `extractors.py:669` pathlib form.
157+
- Deprecated model in shell: `rerun_fixed_tasks.sh:34`, `rerun_zero_mcp_tasks.sh:29`; add grep CI check.
158+
- `run_selected_tasks.sh:648,699,711`: mktemp+mv race — `mv` failure swallowed by subshell.
159+
- **Cost pipeline**: `extract_task_metrics.py:266`, `discovery.py:310` never pass model → all costs at Opus-4.5 rates. `claude-sonnet-4-6` + `claude-haiku-4-6` absent from `MODEL_PRICING` (`extractors.py:1071`). Sonnet runs: 5× overstatement.
160+
- **CI test gap**: 212 tests / 2 confirmed failing / none of the 4 CI workflows run `pytest`.
161+
- `verify_retrieval_eval_smoke.py:26-30`: 5 hardcoded Feb-2026 run IDs; smoke test breaks if staging rotated.
164162

165163
## Maintenance
166164
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.

CLAUDE.md

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -96,12 +96,10 @@ full operations manual.
9696
- **Verifier lib duplication**: 401 copies of `answer_json_verifier_lib.sh` (13 suites; task copies diverged with extra funcs). 275 copies of `dual_score_lib.sh` (csb/ only). `benchmarks/_shared/` missing; every fix requires touching 401+ files.
9797

9898
### Scripts / Code Quality
99-
- `abc_audit.py`: 4 functions defined twice; Python silently uses last definition.
100-
- `rerun_failed.py`: `shell=True` injection; wrong `sourcegraph_full→deepsearch` mapping; deprecated model.
101-
- `ir_metrics.py:749`: `tt_all_r` set comparison bug (first-relevant, not all-relevant).
102-
- `--skip-completed` requires result.json + task_metrics.json; fix: check only result.json.
103-
- Task registry header: claims 436, actual 274 (`sync_task_metadata.py --fix` doesn't update it).
104-
- `verification_modes`/`use_case_category` missing from all 274 tasks; `--use-case-category` silently returns 0.
99+
- `abc_audit.py`: 4 functions defined twice; Python uses last. `check_t5_no_solution_leak` + `check_r2_no_contamination` broken — `pytest tests/test_abc_audit.py` confirms 2 FAIL / 40 pass. Tasks with solution leaks / sourcegraph refs pass audit silently.
100+
- `rerun_failed.py`: `shell=True` injection; wrong `sourcegraph_full→deepsearch`; deprecated model.
101+
- `ir_metrics.py:749`: `tt_all_r` set comparison bug. `--skip-completed`: check only result.json.
102+
- Task registry header: claims 436, actual 274. `verification_modes`/`use_case_category` missing from all tasks.
105103

106104
### Validation / Scoring
107105
- `validators.py` duplicated in `ccb_build`; update all copies (`sha256sum`).
@@ -150,17 +148,17 @@ full operations manual.
150148
- Secret-detection false-positives: use `--no-verify` when flagged code is detection logic.
151149
- Ralph: `prd.json` single-active; archive as `prd-archive/prd-<feature>-<date>.json`; validate: `python3 -c "import json; json.load(open('prd.json'))"`. Not gitignored.
152150

153-
### Scripts / Code Quality (Mar 17-20 additions)
154-
- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench`; crash on other machines.
155-
- `context_retrieval_agent.py:432+` `shell=True` without allowlist; injection risk.
156-
- Non-atomic writes: `aggregate_status.py:669`, `apply_verifier_fixes.py:103+`; use temp+rename.
151+
### Scripts / Code Quality (Mar 17-21 additions)
152+
- Hardcoded `~/CodeScaleBench`: `apply_verifier_fixes.py:9`, `fix_memory_mb.py:8`, `extract_build_diary.py:121`, `plot_build_diary_supplementary.py:121+`.
153+
- `context_retrieval_agent.py:432+`, `oracle_checks.py:498`: `shell=True` injection risk.
154+
- Non-atomic writes: `aggregate_status.py:669`, `daytona_runner.py:234`; use temp+rename pattern.
157155
- Bare `except:`: `audit_v2_report_data.py:104`, `ds_audit.py:244+`, `extract_v2_report_data.py:144+`.
158-
- FD leaks: 17+ sites; use `with open()`. `export_official_results.py:45` `DEFAULT_REPO_BLOB_BASE`stale org; links 404.
159-
- Ruff S603/S604, SIM115 (skips `Popen(stdout=f)`), BLE001; add `pyproject.toml`; `sanitize_secrets.py` needs S105/S106.
160-
- Hardcoded `/home/stephanie_jarmak/CodeScaleBench`: 5 scripts (`fix_memory_mb.py:8`, `extract_build_diary.py:121`, `plot_build_diary_supplementary.py:121+`, etc.).
161-
- `rerun_fixed_tasks.sh:34`, `rerun_zero_mcp_tasks.sh:29`: deprecated model; Ruff misses `.sh` — add `grep -rn "claude-opus-4-5" scripts/` to CI.
162-
- `run_selected_tasks.sh:648,699,711`: mktemp+mv race — `mv` failure swallowed by subshell, `cp` targets missing dir.
163-
- `csb_metrics/extractors.py:669`: FD leak via `tp.open()` (pathlib form; missed by SIM115 grep sweep).
156+
- FD leaks: 17+ sites; `export_official_results.py:45` stale org URL; `extractors.py:669` pathlib form.
157+
- Deprecated model in shell: `rerun_fixed_tasks.sh:34`, `rerun_zero_mcp_tasks.sh:29`; add grep CI check.
158+
- `run_selected_tasks.sh:648,699,711`: mktemp+mv race — `mv` failure swallowed by subshell.
159+
- **Cost pipeline**: `extract_task_metrics.py:266`, `discovery.py:310` never pass model → all costs at Opus-4.5 rates. `claude-sonnet-4-6` + `claude-haiku-4-6` absent from `MODEL_PRICING` (`extractors.py:1071`). Sonnet runs: 5× overstatement.
160+
- **CI test gap**: 212 tests / 2 confirmed failing / none of the 4 CI workflows run `pytest`.
161+
- `verify_retrieval_eval_smoke.py:26-30`: 5 hardcoded Feb-2026 run IDs; smoke test breaks if staging rotated.
164162

165163
## Maintenance
166164
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.

docs/ops/ROOT_AGENT_GUIDE.md

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -96,12 +96,10 @@ full operations manual.
9696
- **Verifier lib duplication**: 401 copies of `answer_json_verifier_lib.sh` (13 suites; task copies diverged with extra funcs). 275 copies of `dual_score_lib.sh` (csb/ only). `benchmarks/_shared/` missing; every fix requires touching 401+ files.
9797

9898
### Scripts / Code Quality
99-
- `abc_audit.py`: 4 functions defined twice; Python silently uses last definition.
100-
- `rerun_failed.py`: `shell=True` injection; wrong `sourcegraph_full→deepsearch` mapping; deprecated model.
101-
- `ir_metrics.py:749`: `tt_all_r` set comparison bug (first-relevant, not all-relevant).
102-
- `--skip-completed` requires result.json + task_metrics.json; fix: check only result.json.
103-
- Task registry header: claims 436, actual 274 (`sync_task_metadata.py --fix` doesn't update it).
104-
- `verification_modes`/`use_case_category` missing from all 274 tasks; `--use-case-category` silently returns 0.
99+
- `abc_audit.py`: 4 functions defined twice; Python uses last. `check_t5_no_solution_leak` + `check_r2_no_contamination` broken — `pytest tests/test_abc_audit.py` confirms 2 FAIL / 40 pass. Tasks with solution leaks / sourcegraph refs pass audit silently.
100+
- `rerun_failed.py`: `shell=True` injection; wrong `sourcegraph_full→deepsearch`; deprecated model.
101+
- `ir_metrics.py:749`: `tt_all_r` set comparison bug. `--skip-completed`: check only result.json.
102+
- Task registry header: claims 436, actual 274. `verification_modes`/`use_case_category` missing from all tasks.
105103

106104
### Validation / Scoring
107105
- `validators.py` duplicated in `ccb_build`; update all copies (`sha256sum`).
@@ -150,17 +148,17 @@ full operations manual.
150148
- Secret-detection false-positives: use `--no-verify` when flagged code is detection logic.
151149
- Ralph: `prd.json` single-active; archive as `prd-archive/prd-<feature>-<date>.json`; validate: `python3 -c "import json; json.load(open('prd.json'))"`. Not gitignored.
152150

153-
### Scripts / Code Quality (Mar 17-20 additions)
154-
- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench`; crash on other machines.
155-
- `context_retrieval_agent.py:432+` `shell=True` without allowlist; injection risk.
156-
- Non-atomic writes: `aggregate_status.py:669`, `apply_verifier_fixes.py:103+`; use temp+rename.
151+
### Scripts / Code Quality (Mar 17-21 additions)
152+
- Hardcoded `~/CodeScaleBench`: `apply_verifier_fixes.py:9`, `fix_memory_mb.py:8`, `extract_build_diary.py:121`, `plot_build_diary_supplementary.py:121+`.
153+
- `context_retrieval_agent.py:432+`, `oracle_checks.py:498`: `shell=True` injection risk.
154+
- Non-atomic writes: `aggregate_status.py:669`, `daytona_runner.py:234`; use temp+rename pattern.
157155
- Bare `except:`: `audit_v2_report_data.py:104`, `ds_audit.py:244+`, `extract_v2_report_data.py:144+`.
158-
- FD leaks: 17+ sites; use `with open()`. `export_official_results.py:45` `DEFAULT_REPO_BLOB_BASE`stale org; links 404.
159-
- Ruff S603/S604, SIM115 (skips `Popen(stdout=f)`), BLE001; add `pyproject.toml`; `sanitize_secrets.py` needs S105/S106.
160-
- Hardcoded `/home/stephanie_jarmak/CodeScaleBench`: 5 scripts (`fix_memory_mb.py:8`, `extract_build_diary.py:121`, `plot_build_diary_supplementary.py:121+`, etc.).
161-
- `rerun_fixed_tasks.sh:34`, `rerun_zero_mcp_tasks.sh:29`: deprecated model; Ruff misses `.sh` — add `grep -rn "claude-opus-4-5" scripts/` to CI.
162-
- `run_selected_tasks.sh:648,699,711`: mktemp+mv race — `mv` failure swallowed by subshell, `cp` targets missing dir.
163-
- `csb_metrics/extractors.py:669`: FD leak via `tp.open()` (pathlib form; missed by SIM115 grep sweep).
156+
- FD leaks: 17+ sites; `export_official_results.py:45` stale org URL; `extractors.py:669` pathlib form.
157+
- Deprecated model in shell: `rerun_fixed_tasks.sh:34`, `rerun_zero_mcp_tasks.sh:29`; add grep CI check.
158+
- `run_selected_tasks.sh:648,699,711`: mktemp+mv race — `mv` failure swallowed by subshell.
159+
- **Cost pipeline**: `extract_task_metrics.py:266`, `discovery.py:310` never pass model → all costs at Opus-4.5 rates. `claude-sonnet-4-6` + `claude-haiku-4-6` absent from `MODEL_PRICING` (`extractors.py:1071`). Sonnet runs: 5× overstatement.
160+
- **CI test gap**: 212 tests / 2 confirmed failing / none of the 4 CI workflows run `pytest`.
161+
- `verify_retrieval_eval_smoke.py:26-30`: 5 hardcoded Feb-2026 run IDs; smoke test breaks if staging rotated.
164162

165163
## Maintenance
166164
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.

0 commit comments

Comments
 (0)