Skip to content

Add Ollama provider for local code review + retune default temperature#23

Closed
Airwhale wants to merge 19 commits into
gemini-cli-extensions:mainfrom
Airwhale:ollama-provider
Closed

Add Ollama provider for local code review + retune default temperature#23
Airwhale wants to merge 19 commits into
gemini-cli-extensions:mainfrom
Airwhale:ollama-provider

Conversation

@Airwhale
Copy link
Copy Markdown

Summary

  • Adds a third --provider ollama path that POSTs to a local Ollama server's OpenAI-compatible /v1/chat/completions endpoint. No API key, no token cost, code never leaves the machine.
  • Refactors model aliases into per-provider tables (MODEL_ALIASES_BY_PROVIDER); adds local / local-pro aliases for Ollama; preserves OpenRouter aliases unchanged; surfaces cross-provider mismatches as typed CONFIG errors that name the right --provider.
  • Lowers default temperature from 0.5 to 0.3 after cross-model integration testing observed google/gemini-2.5-pro hallucinating a HIGH-severity finding referencing a CLI flag that does not exist in the code (the proposed fix would have crashed the runner with AttributeError).
  • Updates README.md, .env.example, and docs/llm-code-review-runbook.md for the new three-provider model and the retuned temperature default. Adds a "Local vs cloud" section documenting the opposing failure modes (cloud hallucinates, local under-reports).

What the integration adds

New surface Default Purpose
--provider ollama Third provider option
--ollama-host URL $OLLAMA_HOST or http://localhost:11434 Override for non-default ports / remote Ollama / WSL without localhost mirroring
OLLAMA_MODEL env qwen3-coder:30b Default model when --provider ollama is selected
OLLAMA_TIMEOUT env 1800 (seconds) HTTP timeout. Default 30 min absorbs CPU cold-starts + thorough reviews
--model local alias qwen3-coder:30b 30B MoE coder, ~3.3B active params (quality/speed sweet spot on CPU)
--model local-pro alias qwen3-coder-next 80B/3B MoE — higher quality, larger download

Why qwen3-coder:30b is the default

It is a Mixture-of-Experts model with 3.3B active parameters per token despite being 30B total. On CPU, active-param count drives inference speed, not total params — so this model runs roughly as fast as a 3B dense model while having the quality of a 30B model. For users with a GPU, the speed advantage is less pronounced but the quality benefit holds.

Why the temperature changed

Cross-model integration testing on the same 31 K-char diff (the diff of this PR, prior to commit) produced opposite failure modes:

  • qwen3-coder:30b via local Ollama: 3 minutes, clean "no issues found." Likely missed legitimate nits but did not fabricate any.
  • google/gemini-2.5-pro via OpenRouter at the previous temperature=0.5: 80 seconds, one HIGH-severity finding referencing args.timeout and --timeout flag. Neither the flag nor the attribute exist in the codebase. Verified by grep returning zero matches. The proposed fix would have crashed the runner with AttributeError: 'Namespace' object has no attribute 'timeout'.

The hallucination at 0.5 motivated reducing temperature to 0.3 — tighter than 0.5 to cut hallucination, looser than the original 0.2 to keep "more findings than too few." The DEFAULT_TEMPERATURE constant docstring carries the full history.

Typed error model for Ollama

The new call_ollama function distinguishes:

  • Connection refusedConfigError (exit 2) suggesting ollama serve and --ollama-host. No retry — config problem.
  • HTTP 404ConfigError (exit 2) with the exact ollama pull <model> command. No retry — model not pulled.
  • HTTP 429 / 413 / 5xx → routed through the shared _classify_http_error so rate-limit / overflow / transport are typed consistently with the cloud providers.
  • Null content with finish_reason=lengthContextOverflow (exit 12). No SafetyRefusal branch — Ollama has no content-filter mode.

Test plan

  • import review succeeds with all three providers and the new constants
  • --help shows --provider {openrouter,gemini,ollama} and --ollama-host
  • Smoke test: --provider ollama --base HEAD against a small diff with default model (qwen3-coder:30b) produces a clean review in under 1 minute
  • Real review: --provider ollama --base HEAD against the entire 31 K-char diff of this PR returns a coherent summary + clean verdict in ~3 min
  • Cross-provider review: same diff via --provider openrouter returns structured output (and surfaced the hallucination that prompted the temperature retune)
  • Pull qwen3-coder-next and verify the local-pro alias dispatch (deferred — large download)
  • CI integration with --provider ollama (deferred — needs Ollama in CI image)

Notes for reviewers

  • The MODEL_ALIASES top-level dict is preserved as a backwards-compat view over MODEL_ALIASES_BY_PROVIDER["openrouter"] in case any external importer relied on it. If you're confident nothing does, the alias can be deleted in a follow-up.
  • The 1800-second default Ollama timeout is intentionally generous. If you want faster failure on stuck inference, override with OLLAMA_TIMEOUT per env.
  • The "Local vs cloud" section in both README and runbook documents the opposing failure modes empirically rather than asserting them — the evidence is the cross-model test in the PR's own diff.

🤖 Generated with Claude Code

Airwhale and others added 19 commits May 14, 2026 20:43
Loads `skills/code-review-commons/SKILL.md` and `commands/code-review.toml`
unchanged and POSTs them to OpenRouter's chat-completions endpoint with
`google/gemini-2.5-pro` (or a configurable model). Matches the GitHub
`/gemini review` bot behavior at ~30-60s per round instead of 5-15
minutes, by bypassing the GitHub webhook -> Google job-queue path.

Files added:
- review.py        single-file uv-runnable runner; --base/--pr/--staged modes
- pyproject.toml   httpx + python-dotenv deps, uv-managed
- .env.example     OPENROUTER_API_KEY and optional model/header overrides
- .gitignore       protects .env and uv cache from leaking
- README.md        documents the fork's mode alongside the upstream extension

Upstream skill/command prompts and gemini-extension.json untouched so the
fork rebases cleanly against `gemini-cli-extensions/code-review` upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a second transport path for the runner: `--provider gemini` calls
Google AI Studio's `generateContent` endpoint directly using
`GEMINI_API_KEY`, alongside the existing `--provider openrouter` path
(default, uses `OPENROUTER_API_KEY`). Both paths send the same upstream
skill + command prompts and the same `gemini-2.5-pro` model by default,
so review behavior is materially equivalent -- callers pick whichever
key / billing relationship they already have.

Model defaults differ by provider because OpenRouter prefixes vendor
names (`google/gemini-2.5-pro`) while the Gemini API takes the bare
name (`gemini-2.5-pro`). Per-provider env-var overrides
(`OPENROUTER_MODEL`, `GEMINI_MODEL`) and an explicit `--model` flag
remain available.

README rewritten to make the fork's modifications obvious up front:
the table at the top calls out each new/modified file, the provider
table documents the trade-offs (including the free-tier
`gemini-2.5-pro` quota gotcha), and the upstream CLI install command
now correctly points at `gemini-cli-extensions/code-review` rather
than this fork.

Smoke-tested both paths end-to-end against this repo's own diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the iteration-partner workflow used in real review cycles
(PRs gemini-cli-extensions#5, gemini-cli-extensions#6, gemini-cli-extensions#7 in the parent project): the four diff modes
(--base, --pr, --staged, default merge-base), the two providers
(openrouter, gemini) and when to pick which, the accept/decline
heuristics that converge to a clean pass without churn, and the
known gotchas (transient None responses, free-tier 429 on
gemini-2.5-pro direct, the two-dot diff semantics behind --base
that lets working-tree edits show up between rounds).

The decline contract is the central operational rule: every declined
finding gets a code comment explaining the rationale, otherwise the
next iteration's model surfaces the same finding again. The comment
is how you teach the reviewer about design decisions it cannot infer
from the diff alone.

Links from the root README so the doc is discoverable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes project-specific references (parent-project PR numbers, real
findings cited verbatim, references to "the parent project") so the
runbook reads as a standalone operational guide that works for anyone
cloning this fork, not just contributors who happen to also work on
the project where the workflow was originally developed.

Changes:
- Strip "real review cycles (PRs gemini-cli-extensions#5, gemini-cli-extensions#6, gemini-cli-extensions#7 in the parent project)"
  framing from the opening; the runbook stands on its own merit.
- Convert the accept/decline examples from specific past findings to
  *shapes* (e.g. "a pair of API endpoints that look redundant but
  are deliberately split for a planned future migration") so the
  pattern transfers to any codebase.
- Replace the per-round-tracking example table's content with
  illustrative-but-generic finding descriptions rather than verbatim
  rows from one specific PR's review log.
- Promote the decline contract to its own section to surface the
  central operational rule earlier and more prominently.
- Drop the gotcha that re-stated the decline contract; one mention
  in its own section is enough.

No change to the operational mechanics, the diff-mode table, the
provider table, the iteration-loop steps, or the gotchas list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promotes the runbook link from a buried section near the end of the
README to a prominent callout right after the intro blockquote so an
LLM agent landing in this repo finds the operational manual without
having to scroll past the comparison tables and setup blocks.

Changes:
- New "For LLM coding agents" section right at the top: declares
  the tool LLM-friendly by design, names the audience explicitly
  (Claude, Codex, Cursor, etc.), and tells agents in a blockquote
  to go straight to docs/llm-code-review-runbook.md as their
  canonical entry point. Frames the rest of the README as the
  general-audience documentation and the runbook as the
  agent-targeted documentation.
- Tighten the existing "Operational runbook" section so it doesn't
  duplicate the top callout; it now just notes the file location
  and that the content serves both humans and agents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub repo was renamed from Airwhale/code-review to
Airwhale/local-gemini-code-review for clarity (the old name collided
visually with the upstream gemini-cli-extensions/code-review). GitHub
serves redirects from the old URL, but the canonical URL should
appear in the four places this repo references itself:

- review.py: default OPENROUTER_HTTP_REFERER header (surfaces in
  OpenRouter's dashboard for attribution)
- .env.example: documented OPENROUTER_HTTP_REFERER override
- README.md: ``git clone`` command in the Quick start; matching
  ``cd`` target updated to the new directory name
- docs/llm-code-review-runbook.md: Repository line in the Setup
  block

Local git remote also updated to the new URL via
``git remote set-url origin``; the parent project (where the runner
is invoked from) had no stale references.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A reader following the README's Quick Start or the runbook's
invocation block would create or expect a directory named
``code-review`` from the ``/path/to/code-review`` placeholder. Now
that the canonical repo slug is ``local-gemini-code-review``, the
placeholder examples match the actual name so a copy-paste produces
something consistent with the rest of the documentation. Five
occurrences updated across README.md and the runbook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two additive features; no behavior change for existing diff modes.

Model aliases (--provider openrouter only)
- Curated table of short names for the most useful "reviewer" models on
  OpenRouter so `--model claude` expands to `anthropic/claude-sonnet-4.5`
  instead of requiring the full vendor-prefixed slug each invocation.
  Aliases shipped: pro / gemini-pro, flash / gemini-flash, claude /
  claude-sonnet, claude-opus, gpt, gpt-mini, deepseek. Raw slugs still
  pass through unchanged so newer-than-the-alias-table models work too.
- Aliases with `--provider gemini` are a category error (the direct
  Gemini API path takes bare Gemini model names only) and exit fast
  with a clear message rather than silently sending an invalid slug.

Whole-codebase mode (--codebase)
- New mode in the existing mutually-exclusive source group (alongside
  --base / --pr / --staged). Bundles tracked files into a single
  delimited payload and reviews them as a whole, for "audit this repo
  I just inherited" or "find bugs in code none of us touched in this
  PR" situations the diff modes cannot serve.
- File selection pipeline: git ls-files (so .gitignore already filters
  node_modules, .venv, build artifacts) -> user --include globs ->
  user --exclude globs -> built-in defensive excludes (lock files,
  minified bundles, common binary asset extensions, */dist/*,
  */build/*) -> drop individual files >100 KB (logged on stderr).
- Pre-flight bundle cap at 700 K chars (~175 K tokens, conservative
  against both Gemini 2.5 Pro and Claude Sonnet 4.5 context limits).
  Exits with the 10 largest files in the current selection so the
  user can target --exclude effectively rather than paying for a
  request that would fail mid-flight on the smaller-context model.
- Cannot reuse the upstream code-review-commons skill because its
  Critical Constraints section says "comment only on lines beginning
  with + or -" -- that forbids commenting on anything in a whole-file
  bundle (no diff markers). Added fork-specific skill at
  skills/code-review-codebase/SKILL.md with the same Principal
  Software Engineer persona and severity rubric but with the
  diff-line constraint replaced with file-path + line-number anchoring
  via the `======== FILE: <path> ========` delimiter. New command at
  commands/codebase-review.toml activates the new skill and defines
  the per-file findings output template.

Architectural-summary output shape: explicit TODO
- Inline TODO comment in review.py's build_codebase_prompts() pointing
  to the runbook's "Future modes" section.
- Dedicated "Future modes" section in docs/llm-code-review-runbook.md
  documenting the proposed --summary flag and three reasons it's
  deferred (hallucination risk, less actionable, token-budget
  contention).

Internal refactor
- Extracted prompt building (build_diff_prompts /
  build_codebase_prompts) out of the call_openrouter / call_gemini
  provider functions so the wire path is mode-agnostic. Both providers
  now take pre-built (system_prompt, user_prompt) pairs directly.
- _resolve_model() helper for alias expansion logic.
- _glob_match() helper that tests patterns against both full POSIX
  paths and basenames so `--exclude 'test_*.py'` catches nested test
  files without requiring the user to also write `--exclude
  '**/test_*.py'`.

Smoke tests run end-to-end against this repo:
- --help renders new flags
- --model flash resolves to google/gemini-2.5-flash and produces output
- --provider gemini --model claude exits cleanly with code 2 and the
  designed error message
- --codebase reviews 15 files (76 KB) and produces structured per-file
  findings
- --codebase --include '*.toml' narrows to 3 files (4.5 KB)

Documentation
- README: fork-additions table updated with the two new files and
  docs/runbook; new "Model aliases" subsection under Provider
  selection; new "Whole-codebase mode (--codebase)" subsection under
  Modes with selection pipeline and bundle-cap behavior.
- docs/llm-code-review-runbook.md: rename "Diff modes" -> "Source
  modes" with --codebase added; new "Model selection" subsection with
  the alias table; new "Whole-codebase mode" subsection; new
  "Future modes (TODO)" section documenting both the
  architectural-summary deferral and the per-file-iteration
  alternative for codebases too big to bundle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The skip-message previously read ``skip (>97 KB): plan.md (203,107
bytes)`` because the formatting used ``MAX_INDIVIDUAL_FILE_BYTES //
1024`` (binary KiB) while the constant itself is named in decimal
``100_000``. User-facing docs and the constant both speak in decimal
KB, so the message was a unit-mismatch the user would have to mentally
reconcile every time they hit the cap.

Adds a single ``_format_size(n_bytes)`` helper that formats with
decimal-prefix units the way OS file managers do (1000-based KB up to
1 MB, then ``N.x MB``). One source of truth for size display means the
messages stay sensible if the constants ever change scale.

Applied to:
- Per-file skip message: ``skip (>100 KB): plan.md (203 KB)`` instead
  of ``skip (>97 KB): plan.md (203,107 bytes)``.
- Largest-files list in the oversize-bundle error: ``86 KB  path/...``
  instead of ``86,157 bytes  path/...``. Easier to scan and budget
  against the cap.

Left unchanged: the bundle-size error line itself, which still reads
``codebase bundle is 1,000,710 chars (limit 700,000)``. That's
literally a char count (not a byte count), and exact char counts let
the user reason precisely about how much they need to drop with
``--exclude``.

Verified end-to-end against the parent project: skip message now
reads ``skip (>100 KB): plan.md (203 KB)``; largest-files list reads
``86 KB  docs/...`` etc. with right-justified alignment. Exit code 1
preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…start

Three accuracy / discoverability fixes:

1. Bundle cap unit was wrong. Said "700 KB ≈ 175 K tokens" but the
   constant is ``MAX_BUNDLE_CHARS = 700_000`` -- 700,000 characters,
   not bytes. For pure-ASCII source these are roughly equivalent but
   the README should match what the code measures. Now reads
   "700,000 chars, ~175 K tokens at the standard 4-chars-per-token
   estimate" with a note explaining the cap is conservative against
   both Gemini 2.5 Pro (1 M-token context) and Claude Sonnet 4.5
   (200 K-token context) so the same selection works regardless of
   which ``--model`` the user targets.

2. Output format section only showed the diff-mode template, even
   though the fork now ships two distinct templates (diff and
   codebase). Added the codebase output shape side-by-side, with a
   note clarifying that codebase-mode line numbers are 1-indexed
   within each file rather than against any synthetic bundle-level
   line counter.

3. Quick start showed only ``--pr 6``, which under-sells the tool's
   breadth. Now demonstrates the four most useful invocation shapes
   on first contact: ``--base origin/main`` for iterative review,
   ``--pr <N>`` for an existing PR, ``--codebase`` for whole-repo
   audit, and ``--model claude`` for the alias path. The
   external-CWD usage is kept but moved below the runner-CWD
   examples since the runner-CWD shape is what users will copy on
   their first try.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The model's output regularly contains Unicode characters that cp1252
(the default stdout encoding on Windows) cannot encode -- arrows,
em-dashes, smart quotes, mermaid syntax. Symptom: after a successful
~60s model call, the runner crashes on the very last line
(``print(output)``) with ``UnicodeEncodeError: 'charmap' codec can't
encode character '\u2192'``. The user has already paid for the
tokens at that point.

Adds ``_ensure_utf8_stdout()`` called at the start of ``main()``. It
reconfigures stdout and stderr to UTF-8 with ``errors="replace"`` if
they aren't already UTF-8. No-op on macOS/Linux (already UTF-8).
``errors="replace"`` keeps the runner robust even against bytes
that aren't valid UTF-8 -- worst case the user sees a replacement
char rather than a crash.

Caught while iterating PR gemini-cli-extensions#8 of the parent project: the diff
included a mermaid arrow in AGENT_NOTES.md and the model's response
echoed it verbatim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to the previous ``_ensure_utf8_stdout`` fix. ``text=True``
in ``subprocess.run`` defaults to ``locale.getpreferredencoding(False)``,
which on Windows is cp1252. Source files containing non-ASCII
characters (em-dashes, arrows, section signs) then arrive in Python
as mojibake -- ``â€"`` for ``--``, ``â†'`` for ``->``, ``§`` for
``§``. The model sees the mojibake in the diff and the next review
iteration flags "character encoding artifacts in documentation
files," a finding that does not exist in the source -- only in the
runner's decoding step.

Caught while iterating PR gemini-cli-extensions#8 of the parent project: a review round
flagged AGENT_NOTES.md, plan.md, and the new ADRs as having
encoding artifacts even though all four files were valid UTF-8 on
disk. The artifacts were introduced by ``_run_git`` and ``pr_diff``
decoding the git/gh subprocess output as cp1252.

Adds explicit ``encoding="utf-8", errors="replace"`` to both
``_run_git`` and ``pr_diff``. ``errors="replace"`` is defensive in
case git ever emits bytes that aren't valid UTF-8 (rare; usually a
corrupted file). No-op on macOS / Linux (already UTF-8).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…5/16000

The previous hardcoded temperature=0.2 (and the model-default ~8K
output cap) produced 1-2 findings per review round on diffs that
empirically contained more issues, requiring 5-7 iterations to
converge on real PRs (e.g. PR gemini-cli-extensions#8 in the parent project: 7 rounds,
12 distinct findings split across 4 cleanup commits).

Tuning analysis ran in chat: higher temperature broadens model
exploration so more findings surface per call, at the cost of a
moderately higher hallucination rate (the decline-comment contract
already handles those). The dollar cost increase is trivial
(~$0.02 per round) since output tokens are cheap and the
max_tokens ceiling only bills for what the model actually emits,
not the unused headroom.

Changes:
- ``DEFAULT_TEMPERATURE = 0.5`` and ``DEFAULT_MAX_TOKENS = 16000``
  as module-level constants with rationale-in-docstring.
- ``--temperature <float>`` and ``--max-tokens <int>`` CLI flags
  (precedence: CLI > env > default). Env vars
  ``$CODE_REVIEW_TEMPERATURE`` / ``$CODE_REVIEW_MAX_TOKENS`` let a
  user pin the defaults at their environment level without
  recompiling. Invalid env values fail fast with a clear message
  rather than silently falling back.
- ``call_openrouter`` and ``call_gemini`` both take ``temperature``
  and ``max_tokens`` as kwargs and inject them into the payload.
  OpenRouter uses ``temperature`` + ``max_tokens`` (snake_case per
  the OpenAI-compatible chat-completions spec); Gemini-direct uses
  ``temperature`` + ``maxOutputTokens`` (camelCase per the v1beta
  generateContent spec). Wire formats stay correct for each
  provider.
- Stderr ``Reviewing ...`` line now shows the active T and
  max_tokens so a user diagnosing a noisy review round can see at
  a glance what settings produced it.

Docs:
- ``.env.example`` documents the two new env vars next to the
  existing provider config.
- README adds a paragraph after the Modes table explaining the two
  knobs and the empirical rationale for the new defaults.
- Runbook gets a "Tuning sampling" subsection under Model
  selection with a tradeoffs table (when to raise / lower each).

Smoke-tested against the parent project's PR gemini-cli-extensions#8 diff (89K chars,
``--model flash --temperature 0.6 --max-tokens 8000``) -- stderr
correctly surfaced ``T=0.6, max_tokens=8000``, output produced as
expected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…allers

Reshape the runner around being called by LLM agents in a loop, not just
by humans at a shell. Two failure modes were ambiguous before:

  1. Refusals returned content=None instead of a typed error, so callers
     couldn't distinguish "model refused" from "rate limit" from "diff
     too big" from "network down" without parsing prose.
  2. Real-world diffs with security/AML/policy vocabulary tripped
     content filters even when the code was plainly benign.

This commit:

- Introduces a typed ReviewError hierarchy with stable exit codes
  (0/1/2/10-14) covering CONFIG, SAFETY_REFUSAL, RATE_LIMIT,
  CONTEXT_OVERFLOW, PROVIDER_HICCUP, and TRANSPORT.
- Classifies null content from both providers via finish_reason
  (OpenRouter: safety/length/content_filter; Gemini: SAFETY/MAX_TOKENS/
  promptFeedback.blockReason) so SAFETY_REFUSAL and CONTEXT_OVERFLOW
  surface as distinct codes instead of UNKNOWN.
- Adds a DEFAULT_CONTEXT prefix wrapped in <CONTEXT_FOR_REVIEWER>
  framing the request as authorized code review with benign-but-
  adversarial-looking vocabulary. Overridable via --context /
  $CODE_REVIEW_CONTEXT, disable with --no-context.
- Auto-retries once on PROVIDER_HICCUP and TRANSPORT (recoverable);
  surfaces SAFETY_REFUSAL / RATE_LIMIT / CONTEXT_OVERFLOW / CONFIG
  immediately because retry-without-changes doesn't help.
- Emits a stable stderr prefix "ERROR: <CATEGORY> [exit <N>]" for
  cheap grep-based category extraction by agent callers.
- Documents the full contract in README (Safety context + Error model
  for LLM callers sections, with exit-code table, stderr format,
  retry policy, and decision tree) and cross-references it from the
  runbook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The whole-codebase bundle had no per-line anchors, so the model had to
count lines to produce a line-number reference. LLMs are bad at that:
findings on long files (>500 lines) drifted by 50-150 lines, with
direction varying (sometimes too high, sometimes too low). Drift
correlated with file size, consistent with "the model estimates from
visual position rather than counting."

Diff mode never had this problem because ``git diff -U5`` already
embeds ``@@ -L,N +L,N @@`` anchors plus context lines -- the model
transcribes from those instead of counting.

Fix: prefix every content line in the bundle with its 1-indexed line
number, ``cat -n`` style, e.g. ``   808: def _markdown_to_html(...):``.
Turns the line-number reference task from arithmetic (which the model
cannot do reliably) into transcription (which it does well). Skill +
command prompts updated to tell the model to transcribe the prefix
verbatim instead of counting.

Token cost: ~5-7 chars per line. A 50K-char bundle with ~1500 lines
adds ~10K chars (~2.5K tokens, ~6% input overhead). Cheap.

Verified on a 1087-line file: every finding's line number was exactly
correct. Previous behavior on the same scope would have been off by
50-150 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README's "For LLM coding agents" section previously just pointed
at the runbook. After running real review cycles, the flow is stable
enough that an LLM agent landing on the README should get the
procedure inline -- not a redirect that requires a second doc read
before the first invocation.

Adds to README:
- 9-step canonical iteration flow with explicit stop conditions
  (clean output / hallucination-only round / 4-round ceiling).
- ``git add -N`` trick for reviewing untracked files (codebase mode
  uses ``git ls-files`` and skips untracked files otherwise).
- Hedged findings ("if X is true, this is HIGH") as a decline signal
  -- empirically they resolve as false positives more often than as
  real bugs.
- Self-refuting findings (suggested fix matches existing code) as a
  no-action hallucination pattern.
- Reference to the line-number prefix fix (b124501) so a caller knows
  codebase-mode line numbers are now reliable.
- Per-round ledger template the cycle output settles into.

Runbook expanded in parallel:
- Iteration-loop pseudocode now includes the ``git add -N`` step and
  the multi-condition stop check.
- Accept/decline heuristics gain the "hedged" and "self-refuting"
  shapes with concrete examples drawn from real review cycles.
- Known gotchas note that codebase-mode line drift was a real
  pre-b124501 issue; if it returns, that's a regression worth
  investigating not a workaround to write.

No code changes. Documentation only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ran ``review.py --codebase`` against itself for two rounds. Six total
findings, one applied, five declined with decline-comments so future
self-review rounds don't churn on the same nits.

Applied (round 1):
- ``--include`` / ``--exclude`` argparse args now use ``default=[]``
  instead of ``default=None``. The argparse append-mutates-default
  quirk only matters across multiple ``parse_args()`` calls in one
  process, which a one-shot CLI never does. Drops two ``or []``
  call-site normalizations.

Declined-with-comments (preserves runbook's "decline comment as
contract with the next review round"):
- Round 1 L545: ``Sequence[str]`` would silently accept a ``str``
  (which iterates as single-char strings) and hide a real caller
  bug; explicit ``tuple | list`` blocks it at type-check time.
- Round 1 L551: Reviewer's "fnmatch ``*`` doesn't match ``/``"
  correction is empirically false (``fnmatch.fnmatch("a/b.py",
  "*.py") == True``). Existing docstring is correct; comment
  expanded to cite Python docs and the verified behavior.
- Round 2 L280: GPT-5 / GPT-5-mini aliases flagged as nonexistent.
  Reviewer's training cutoff predates their release; OpenRouter
  catalog has them today. Comment now calls out that an older
  reviewer may re-flag.
- Round 2 L1203: "Extract a helper for config resolution." Two
  call sites is too few to justify a 6-param abstraction that
  hides the env-var names from grep.
- Round 2 L1277: "Avoid redundant ``stat()`` calls." Cold error
  path; changing ``gather_codebase_files`` return type from
  ``list[Path]`` to ``list[tuple[Path, int]]`` for one syscall
  per file isn't a net win.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
validation for sampling envs, safer overflow-message fallback

Round 3 (gemini-2.5-pro) and round 4 (claude-sonnet-4.5) of self-review
on review.py. The two reviewers found different things: gemini caught
one HIGH typed-error contract violation, claude caught five MEDIUM-LOW
defensive-coding improvements plus one hallucination.

Accepted:

1. Round 3 L490: ``_run_git`` and ``pr_diff`` were calling ``sys.exit()``
   on subprocess failure, bypassing the typed error model documented in
   the README. An LLM caller saw raw subprocess exit codes (often 128
   for git, anything for gh) instead of the contract's ``ERROR: CONFIG
   [exit 2]``. Both now raise ``ConfigError`` with the subprocess
   stderr in ``detail``. Verified: ``--base nonexistent-ref`` now
   surfaces the documented CONFIG/2 contract.

2. Round 4 L722: ``_classify_http_error`` matched only three
   context-overflow phrases (``context_length``, ``too long``,
   ``exceeds the maximum``). Added ``token limit``, ``input too
   large``, ``payload size`` -- each observed in real provider error
   bodies. Moved the phrase set to a named tuple so adding a variant
   is a one-line edit.

3. Round 4 L824: OpenRouter response classification used
   ``finish_reason or native_finish_reason`` (truthy fallback). A
   safety-blocked completion that surfaces as normalized
   ``finish_reason="stop"`` + raw ``native_finish_reason="safety"``
   would miss the SafetyRefusal classification entirely. Now classifies
   by the UNION of both fields so a safety signal from either source
   wins over a generic "stop".

4. Round 4 L1227 + L1240: ``CODE_REVIEW_TEMPERATURE`` and
   ``CODE_REVIEW_MAX_TOKENS`` were converted to numbers but not range-
   validated. A user setting ``temperature=999`` or ``max_tokens=-1``
   would get an opaque provider 4xx (UNKNOWN/1) instead of CONFIG/2.
   Both now validate range and raise ConfigError before the call.
   Verified: ``CODE_REVIEW_TEMPERATURE=999`` and ``=0`` both produce
   typed CONFIG errors with actionable Reason text.

5. Round 4 L1302: Re-stat in the ContextOverflow error path could
   raise OSError if a file disappeared between ``bundle_codebase``
   and the size-rendering branch -- the error path would then crash
   with an OSError that buries the original overflow message. Wrap
   the stat in a helper that returns None on OSError and skip
   missing files.

Declined-with-comments:

- Round 3 L27 (SKILL.md typo): reviewer's "fix" added an unmatched
  ``)`` to a function signature example, making it itself broken.
  Hallucination; no action.
- Round 4 L938 (Gemini uppercase / OpenRouter lowercase): casing
  matches each provider's wire format intentionally; normalizing
  would create divergence from the actual API response.
- Round 4 L6 (placeholder XML tag): reviewer agrees it's stylistic
  not a bug; current substitution shape works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New --provider ollama path posts to a local Ollama server via its
  OpenAI-compat /v1/chat/completions endpoint. No API key required;
  configurable via OLLAMA_HOST / OLLAMA_MODEL / OLLAMA_TIMEOUT in .env.
  Default model qwen3-coder:30b (30B MoE coder with ~3.3B active params --
  quality/speed sweet spot on CPU because active params drive inference
  speed, not total params).
- Refactor MODEL_ALIASES into per-provider MODEL_ALIASES_BY_PROVIDER.
  Ollama gets `local` (qwen3-coder:30b) and `local-pro` (qwen3-coder-next).
  OpenRouter aliases unchanged; gemini-direct remains aliasless. Cross-
  provider mismatches now surface as typed CONFIG errors that name the
  right --provider instead of sending an invalid slug upstream. A flat
  backwards-compat MODEL_ALIASES view over the OpenRouter slice is
  preserved for any external importer.
- New call_ollama() function with surgical error mapping: connection
  refused -> ConfigError pointing at `ollama serve`; HTTP 404 -> ConfigError
  with the exact `ollama pull <model>` command to run; everything else
  routes through the shared _classify_http_error mapper. No SafetyRefusal
  branch (Ollama doesn't have a content-filter mode).
- Longer default timeout for the Ollama provider (DEFAULT_OLLAMA_TIMEOUT
  = 1800 s) to absorb CPU cold-starts (10-60 s model load on first call)
  and thorough reviews of large diffs.
- Default temperature lowered from 0.5 to 0.3 after cross-model integration
  testing observed `google/gemini-2.5-pro` hallucinating a HIGH-severity
  finding referencing a CLI flag and "help text" that do not exist in the
  codebase. The suggested fix would have crashed the runner with
  AttributeError. The constant's docstring carries the full history
  (0.2 too conservative -> 0.5 hallucinated -> 0.3 compromise).
- New --ollama-host CLI flag for non-default ports / hosts / WSL setups
  without localhost mirroring.
- README + .env.example + docs/llm-code-review-runbook.md updated for the
  three-provider model, the per-provider alias tables, the retuned
  temperature default, and a new "Local vs cloud" section documenting
  the opposing failure modes (cloud hallucinates, local under-reports)
  with the specific cross-model test result as evidence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 20, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@Airwhale Airwhale closed this May 20, 2026
@Airwhale
Copy link
Copy Markdown
Author

Airwhale commented May 20, 2026

Apologies, this was opened against upstream by mistake : the changes are fork-specific to my local fork and accidentally opened the PR upstream. Thank you to the authors and maintainers of this useful tool! Been amazing to adjust it to my own workflow, local LLM's, etc...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant