Add Ollama provider for local code review + retune default temperature#23
Closed
Airwhale wants to merge 19 commits into
Closed
Add Ollama provider for local code review + retune default temperature#23Airwhale wants to merge 19 commits into
Airwhale wants to merge 19 commits into
Conversation
Loads `skills/code-review-commons/SKILL.md` and `commands/code-review.toml` unchanged and POSTs them to OpenRouter's chat-completions endpoint with `google/gemini-2.5-pro` (or a configurable model). Matches the GitHub `/gemini review` bot behavior at ~30-60s per round instead of 5-15 minutes, by bypassing the GitHub webhook -> Google job-queue path. Files added: - review.py single-file uv-runnable runner; --base/--pr/--staged modes - pyproject.toml httpx + python-dotenv deps, uv-managed - .env.example OPENROUTER_API_KEY and optional model/header overrides - .gitignore protects .env and uv cache from leaking - README.md documents the fork's mode alongside the upstream extension Upstream skill/command prompts and gemini-extension.json untouched so the fork rebases cleanly against `gemini-cli-extensions/code-review` upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a second transport path for the runner: `--provider gemini` calls Google AI Studio's `generateContent` endpoint directly using `GEMINI_API_KEY`, alongside the existing `--provider openrouter` path (default, uses `OPENROUTER_API_KEY`). Both paths send the same upstream skill + command prompts and the same `gemini-2.5-pro` model by default, so review behavior is materially equivalent -- callers pick whichever key / billing relationship they already have. Model defaults differ by provider because OpenRouter prefixes vendor names (`google/gemini-2.5-pro`) while the Gemini API takes the bare name (`gemini-2.5-pro`). Per-provider env-var overrides (`OPENROUTER_MODEL`, `GEMINI_MODEL`) and an explicit `--model` flag remain available. README rewritten to make the fork's modifications obvious up front: the table at the top calls out each new/modified file, the provider table documents the trade-offs (including the free-tier `gemini-2.5-pro` quota gotcha), and the upstream CLI install command now correctly points at `gemini-cli-extensions/code-review` rather than this fork. Smoke-tested both paths end-to-end against this repo's own diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the iteration-partner workflow used in real review cycles (PRs gemini-cli-extensions#5, gemini-cli-extensions#6, gemini-cli-extensions#7 in the parent project): the four diff modes (--base, --pr, --staged, default merge-base), the two providers (openrouter, gemini) and when to pick which, the accept/decline heuristics that converge to a clean pass without churn, and the known gotchas (transient None responses, free-tier 429 on gemini-2.5-pro direct, the two-dot diff semantics behind --base that lets working-tree edits show up between rounds). The decline contract is the central operational rule: every declined finding gets a code comment explaining the rationale, otherwise the next iteration's model surfaces the same finding again. The comment is how you teach the reviewer about design decisions it cannot infer from the diff alone. Links from the root README so the doc is discoverable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes project-specific references (parent-project PR numbers, real findings cited verbatim, references to "the parent project") so the runbook reads as a standalone operational guide that works for anyone cloning this fork, not just contributors who happen to also work on the project where the workflow was originally developed. Changes: - Strip "real review cycles (PRs gemini-cli-extensions#5, gemini-cli-extensions#6, gemini-cli-extensions#7 in the parent project)" framing from the opening; the runbook stands on its own merit. - Convert the accept/decline examples from specific past findings to *shapes* (e.g. "a pair of API endpoints that look redundant but are deliberately split for a planned future migration") so the pattern transfers to any codebase. - Replace the per-round-tracking example table's content with illustrative-but-generic finding descriptions rather than verbatim rows from one specific PR's review log. - Promote the decline contract to its own section to surface the central operational rule earlier and more prominently. - Drop the gotcha that re-stated the decline contract; one mention in its own section is enough. No change to the operational mechanics, the diff-mode table, the provider table, the iteration-loop steps, or the gotchas list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promotes the runbook link from a buried section near the end of the README to a prominent callout right after the intro blockquote so an LLM agent landing in this repo finds the operational manual without having to scroll past the comparison tables and setup blocks. Changes: - New "For LLM coding agents" section right at the top: declares the tool LLM-friendly by design, names the audience explicitly (Claude, Codex, Cursor, etc.), and tells agents in a blockquote to go straight to docs/llm-code-review-runbook.md as their canonical entry point. Frames the rest of the README as the general-audience documentation and the runbook as the agent-targeted documentation. - Tighten the existing "Operational runbook" section so it doesn't duplicate the top callout; it now just notes the file location and that the content serves both humans and agents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub repo was renamed from Airwhale/code-review to Airwhale/local-gemini-code-review for clarity (the old name collided visually with the upstream gemini-cli-extensions/code-review). GitHub serves redirects from the old URL, but the canonical URL should appear in the four places this repo references itself: - review.py: default OPENROUTER_HTTP_REFERER header (surfaces in OpenRouter's dashboard for attribution) - .env.example: documented OPENROUTER_HTTP_REFERER override - README.md: ``git clone`` command in the Quick start; matching ``cd`` target updated to the new directory name - docs/llm-code-review-runbook.md: Repository line in the Setup block Local git remote also updated to the new URL via ``git remote set-url origin``; the parent project (where the runner is invoked from) had no stale references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A reader following the README's Quick Start or the runbook's invocation block would create or expect a directory named ``code-review`` from the ``/path/to/code-review`` placeholder. Now that the canonical repo slug is ``local-gemini-code-review``, the placeholder examples match the actual name so a copy-paste produces something consistent with the rest of the documentation. Five occurrences updated across README.md and the runbook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two additive features; no behavior change for existing diff modes. Model aliases (--provider openrouter only) - Curated table of short names for the most useful "reviewer" models on OpenRouter so `--model claude` expands to `anthropic/claude-sonnet-4.5` instead of requiring the full vendor-prefixed slug each invocation. Aliases shipped: pro / gemini-pro, flash / gemini-flash, claude / claude-sonnet, claude-opus, gpt, gpt-mini, deepseek. Raw slugs still pass through unchanged so newer-than-the-alias-table models work too. - Aliases with `--provider gemini` are a category error (the direct Gemini API path takes bare Gemini model names only) and exit fast with a clear message rather than silently sending an invalid slug. Whole-codebase mode (--codebase) - New mode in the existing mutually-exclusive source group (alongside --base / --pr / --staged). Bundles tracked files into a single delimited payload and reviews them as a whole, for "audit this repo I just inherited" or "find bugs in code none of us touched in this PR" situations the diff modes cannot serve. - File selection pipeline: git ls-files (so .gitignore already filters node_modules, .venv, build artifacts) -> user --include globs -> user --exclude globs -> built-in defensive excludes (lock files, minified bundles, common binary asset extensions, */dist/*, */build/*) -> drop individual files >100 KB (logged on stderr). - Pre-flight bundle cap at 700 K chars (~175 K tokens, conservative against both Gemini 2.5 Pro and Claude Sonnet 4.5 context limits). Exits with the 10 largest files in the current selection so the user can target --exclude effectively rather than paying for a request that would fail mid-flight on the smaller-context model. - Cannot reuse the upstream code-review-commons skill because its Critical Constraints section says "comment only on lines beginning with + or -" -- that forbids commenting on anything in a whole-file bundle (no diff markers). Added fork-specific skill at skills/code-review-codebase/SKILL.md with the same Principal Software Engineer persona and severity rubric but with the diff-line constraint replaced with file-path + line-number anchoring via the `======== FILE: <path> ========` delimiter. New command at commands/codebase-review.toml activates the new skill and defines the per-file findings output template. Architectural-summary output shape: explicit TODO - Inline TODO comment in review.py's build_codebase_prompts() pointing to the runbook's "Future modes" section. - Dedicated "Future modes" section in docs/llm-code-review-runbook.md documenting the proposed --summary flag and three reasons it's deferred (hallucination risk, less actionable, token-budget contention). Internal refactor - Extracted prompt building (build_diff_prompts / build_codebase_prompts) out of the call_openrouter / call_gemini provider functions so the wire path is mode-agnostic. Both providers now take pre-built (system_prompt, user_prompt) pairs directly. - _resolve_model() helper for alias expansion logic. - _glob_match() helper that tests patterns against both full POSIX paths and basenames so `--exclude 'test_*.py'` catches nested test files without requiring the user to also write `--exclude '**/test_*.py'`. Smoke tests run end-to-end against this repo: - --help renders new flags - --model flash resolves to google/gemini-2.5-flash and produces output - --provider gemini --model claude exits cleanly with code 2 and the designed error message - --codebase reviews 15 files (76 KB) and produces structured per-file findings - --codebase --include '*.toml' narrows to 3 files (4.5 KB) Documentation - README: fork-additions table updated with the two new files and docs/runbook; new "Model aliases" subsection under Provider selection; new "Whole-codebase mode (--codebase)" subsection under Modes with selection pipeline and bundle-cap behavior. - docs/llm-code-review-runbook.md: rename "Diff modes" -> "Source modes" with --codebase added; new "Model selection" subsection with the alias table; new "Whole-codebase mode" subsection; new "Future modes (TODO)" section documenting both the architectural-summary deferral and the per-file-iteration alternative for codebases too big to bundle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The skip-message previously read ``skip (>97 KB): plan.md (203,107 bytes)`` because the formatting used ``MAX_INDIVIDUAL_FILE_BYTES // 1024`` (binary KiB) while the constant itself is named in decimal ``100_000``. User-facing docs and the constant both speak in decimal KB, so the message was a unit-mismatch the user would have to mentally reconcile every time they hit the cap. Adds a single ``_format_size(n_bytes)`` helper that formats with decimal-prefix units the way OS file managers do (1000-based KB up to 1 MB, then ``N.x MB``). One source of truth for size display means the messages stay sensible if the constants ever change scale. Applied to: - Per-file skip message: ``skip (>100 KB): plan.md (203 KB)`` instead of ``skip (>97 KB): plan.md (203,107 bytes)``. - Largest-files list in the oversize-bundle error: ``86 KB path/...`` instead of ``86,157 bytes path/...``. Easier to scan and budget against the cap. Left unchanged: the bundle-size error line itself, which still reads ``codebase bundle is 1,000,710 chars (limit 700,000)``. That's literally a char count (not a byte count), and exact char counts let the user reason precisely about how much they need to drop with ``--exclude``. Verified end-to-end against the parent project: skip message now reads ``skip (>100 KB): plan.md (203 KB)``; largest-files list reads ``86 KB docs/...`` etc. with right-justified alignment. Exit code 1 preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…start Three accuracy / discoverability fixes: 1. Bundle cap unit was wrong. Said "700 KB ≈ 175 K tokens" but the constant is ``MAX_BUNDLE_CHARS = 700_000`` -- 700,000 characters, not bytes. For pure-ASCII source these are roughly equivalent but the README should match what the code measures. Now reads "700,000 chars, ~175 K tokens at the standard 4-chars-per-token estimate" with a note explaining the cap is conservative against both Gemini 2.5 Pro (1 M-token context) and Claude Sonnet 4.5 (200 K-token context) so the same selection works regardless of which ``--model`` the user targets. 2. Output format section only showed the diff-mode template, even though the fork now ships two distinct templates (diff and codebase). Added the codebase output shape side-by-side, with a note clarifying that codebase-mode line numbers are 1-indexed within each file rather than against any synthetic bundle-level line counter. 3. Quick start showed only ``--pr 6``, which under-sells the tool's breadth. Now demonstrates the four most useful invocation shapes on first contact: ``--base origin/main`` for iterative review, ``--pr <N>`` for an existing PR, ``--codebase`` for whole-repo audit, and ``--model claude`` for the alias path. The external-CWD usage is kept but moved below the runner-CWD examples since the runner-CWD shape is what users will copy on their first try. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The model's output regularly contains Unicode characters that cp1252 (the default stdout encoding on Windows) cannot encode -- arrows, em-dashes, smart quotes, mermaid syntax. Symptom: after a successful ~60s model call, the runner crashes on the very last line (``print(output)``) with ``UnicodeEncodeError: 'charmap' codec can't encode character '\u2192'``. The user has already paid for the tokens at that point. Adds ``_ensure_utf8_stdout()`` called at the start of ``main()``. It reconfigures stdout and stderr to UTF-8 with ``errors="replace"`` if they aren't already UTF-8. No-op on macOS/Linux (already UTF-8). ``errors="replace"`` keeps the runner robust even against bytes that aren't valid UTF-8 -- worst case the user sees a replacement char rather than a crash. Caught while iterating PR gemini-cli-extensions#8 of the parent project: the diff included a mermaid arrow in AGENT_NOTES.md and the model's response echoed it verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to the previous ``_ensure_utf8_stdout`` fix. ``text=True`` in ``subprocess.run`` defaults to ``locale.getpreferredencoding(False)``, which on Windows is cp1252. Source files containing non-ASCII characters (em-dashes, arrows, section signs) then arrive in Python as mojibake -- ``â€"`` for ``--``, ``â†'`` for ``->``, ``§`` for ``§``. The model sees the mojibake in the diff and the next review iteration flags "character encoding artifacts in documentation files," a finding that does not exist in the source -- only in the runner's decoding step. Caught while iterating PR gemini-cli-extensions#8 of the parent project: a review round flagged AGENT_NOTES.md, plan.md, and the new ADRs as having encoding artifacts even though all four files were valid UTF-8 on disk. The artifacts were introduced by ``_run_git`` and ``pr_diff`` decoding the git/gh subprocess output as cp1252. Adds explicit ``encoding="utf-8", errors="replace"`` to both ``_run_git`` and ``pr_diff``. ``errors="replace"`` is defensive in case git ever emits bytes that aren't valid UTF-8 (rare; usually a corrupted file). No-op on macOS / Linux (already UTF-8). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…5/16000 The previous hardcoded temperature=0.2 (and the model-default ~8K output cap) produced 1-2 findings per review round on diffs that empirically contained more issues, requiring 5-7 iterations to converge on real PRs (e.g. PR gemini-cli-extensions#8 in the parent project: 7 rounds, 12 distinct findings split across 4 cleanup commits). Tuning analysis ran in chat: higher temperature broadens model exploration so more findings surface per call, at the cost of a moderately higher hallucination rate (the decline-comment contract already handles those). The dollar cost increase is trivial (~$0.02 per round) since output tokens are cheap and the max_tokens ceiling only bills for what the model actually emits, not the unused headroom. Changes: - ``DEFAULT_TEMPERATURE = 0.5`` and ``DEFAULT_MAX_TOKENS = 16000`` as module-level constants with rationale-in-docstring. - ``--temperature <float>`` and ``--max-tokens <int>`` CLI flags (precedence: CLI > env > default). Env vars ``$CODE_REVIEW_TEMPERATURE`` / ``$CODE_REVIEW_MAX_TOKENS`` let a user pin the defaults at their environment level without recompiling. Invalid env values fail fast with a clear message rather than silently falling back. - ``call_openrouter`` and ``call_gemini`` both take ``temperature`` and ``max_tokens`` as kwargs and inject them into the payload. OpenRouter uses ``temperature`` + ``max_tokens`` (snake_case per the OpenAI-compatible chat-completions spec); Gemini-direct uses ``temperature`` + ``maxOutputTokens`` (camelCase per the v1beta generateContent spec). Wire formats stay correct for each provider. - Stderr ``Reviewing ...`` line now shows the active T and max_tokens so a user diagnosing a noisy review round can see at a glance what settings produced it. Docs: - ``.env.example`` documents the two new env vars next to the existing provider config. - README adds a paragraph after the Modes table explaining the two knobs and the empirical rationale for the new defaults. - Runbook gets a "Tuning sampling" subsection under Model selection with a tradeoffs table (when to raise / lower each). Smoke-tested against the parent project's PR gemini-cli-extensions#8 diff (89K chars, ``--model flash --temperature 0.6 --max-tokens 8000``) -- stderr correctly surfaced ``T=0.6, max_tokens=8000``, output produced as expected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…allers
Reshape the runner around being called by LLM agents in a loop, not just
by humans at a shell. Two failure modes were ambiguous before:
1. Refusals returned content=None instead of a typed error, so callers
couldn't distinguish "model refused" from "rate limit" from "diff
too big" from "network down" without parsing prose.
2. Real-world diffs with security/AML/policy vocabulary tripped
content filters even when the code was plainly benign.
This commit:
- Introduces a typed ReviewError hierarchy with stable exit codes
(0/1/2/10-14) covering CONFIG, SAFETY_REFUSAL, RATE_LIMIT,
CONTEXT_OVERFLOW, PROVIDER_HICCUP, and TRANSPORT.
- Classifies null content from both providers via finish_reason
(OpenRouter: safety/length/content_filter; Gemini: SAFETY/MAX_TOKENS/
promptFeedback.blockReason) so SAFETY_REFUSAL and CONTEXT_OVERFLOW
surface as distinct codes instead of UNKNOWN.
- Adds a DEFAULT_CONTEXT prefix wrapped in <CONTEXT_FOR_REVIEWER>
framing the request as authorized code review with benign-but-
adversarial-looking vocabulary. Overridable via --context /
$CODE_REVIEW_CONTEXT, disable with --no-context.
- Auto-retries once on PROVIDER_HICCUP and TRANSPORT (recoverable);
surfaces SAFETY_REFUSAL / RATE_LIMIT / CONTEXT_OVERFLOW / CONFIG
immediately because retry-without-changes doesn't help.
- Emits a stable stderr prefix "ERROR: <CATEGORY> [exit <N>]" for
cheap grep-based category extraction by agent callers.
- Documents the full contract in README (Safety context + Error model
for LLM callers sections, with exit-code table, stderr format,
retry policy, and decision tree) and cross-references it from the
runbook.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The whole-codebase bundle had no per-line anchors, so the model had to count lines to produce a line-number reference. LLMs are bad at that: findings on long files (>500 lines) drifted by 50-150 lines, with direction varying (sometimes too high, sometimes too low). Drift correlated with file size, consistent with "the model estimates from visual position rather than counting." Diff mode never had this problem because ``git diff -U5`` already embeds ``@@ -L,N +L,N @@`` anchors plus context lines -- the model transcribes from those instead of counting. Fix: prefix every content line in the bundle with its 1-indexed line number, ``cat -n`` style, e.g. `` 808: def _markdown_to_html(...):``. Turns the line-number reference task from arithmetic (which the model cannot do reliably) into transcription (which it does well). Skill + command prompts updated to tell the model to transcribe the prefix verbatim instead of counting. Token cost: ~5-7 chars per line. A 50K-char bundle with ~1500 lines adds ~10K chars (~2.5K tokens, ~6% input overhead). Cheap. Verified on a 1087-line file: every finding's line number was exactly correct. Previous behavior on the same scope would have been off by 50-150 lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README's "For LLM coding agents" section previously just pointed
at the runbook. After running real review cycles, the flow is stable
enough that an LLM agent landing on the README should get the
procedure inline -- not a redirect that requires a second doc read
before the first invocation.
Adds to README:
- 9-step canonical iteration flow with explicit stop conditions
(clean output / hallucination-only round / 4-round ceiling).
- ``git add -N`` trick for reviewing untracked files (codebase mode
uses ``git ls-files`` and skips untracked files otherwise).
- Hedged findings ("if X is true, this is HIGH") as a decline signal
-- empirically they resolve as false positives more often than as
real bugs.
- Self-refuting findings (suggested fix matches existing code) as a
no-action hallucination pattern.
- Reference to the line-number prefix fix (b124501) so a caller knows
codebase-mode line numbers are now reliable.
- Per-round ledger template the cycle output settles into.
Runbook expanded in parallel:
- Iteration-loop pseudocode now includes the ``git add -N`` step and
the multi-condition stop check.
- Accept/decline heuristics gain the "hedged" and "self-refuting"
shapes with concrete examples drawn from real review cycles.
- Known gotchas note that codebase-mode line drift was a real
pre-b124501 issue; if it returns, that's a regression worth
investigating not a workaround to write.
No code changes. Documentation only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ran ``review.py --codebase`` against itself for two rounds. Six total
findings, one applied, five declined with decline-comments so future
self-review rounds don't churn on the same nits.
Applied (round 1):
- ``--include`` / ``--exclude`` argparse args now use ``default=[]``
instead of ``default=None``. The argparse append-mutates-default
quirk only matters across multiple ``parse_args()`` calls in one
process, which a one-shot CLI never does. Drops two ``or []``
call-site normalizations.
Declined-with-comments (preserves runbook's "decline comment as
contract with the next review round"):
- Round 1 L545: ``Sequence[str]`` would silently accept a ``str``
(which iterates as single-char strings) and hide a real caller
bug; explicit ``tuple | list`` blocks it at type-check time.
- Round 1 L551: Reviewer's "fnmatch ``*`` doesn't match ``/``"
correction is empirically false (``fnmatch.fnmatch("a/b.py",
"*.py") == True``). Existing docstring is correct; comment
expanded to cite Python docs and the verified behavior.
- Round 2 L280: GPT-5 / GPT-5-mini aliases flagged as nonexistent.
Reviewer's training cutoff predates their release; OpenRouter
catalog has them today. Comment now calls out that an older
reviewer may re-flag.
- Round 2 L1203: "Extract a helper for config resolution." Two
call sites is too few to justify a 6-param abstraction that
hides the env-var names from grep.
- Round 2 L1277: "Avoid redundant ``stat()`` calls." Cold error
path; changing ``gather_codebase_files`` return type from
``list[Path]`` to ``list[tuple[Path, int]]`` for one syscall
per file isn't a net win.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
validation for sampling envs, safer overflow-message fallback Round 3 (gemini-2.5-pro) and round 4 (claude-sonnet-4.5) of self-review on review.py. The two reviewers found different things: gemini caught one HIGH typed-error contract violation, claude caught five MEDIUM-LOW defensive-coding improvements plus one hallucination. Accepted: 1. Round 3 L490: ``_run_git`` and ``pr_diff`` were calling ``sys.exit()`` on subprocess failure, bypassing the typed error model documented in the README. An LLM caller saw raw subprocess exit codes (often 128 for git, anything for gh) instead of the contract's ``ERROR: CONFIG [exit 2]``. Both now raise ``ConfigError`` with the subprocess stderr in ``detail``. Verified: ``--base nonexistent-ref`` now surfaces the documented CONFIG/2 contract. 2. Round 4 L722: ``_classify_http_error`` matched only three context-overflow phrases (``context_length``, ``too long``, ``exceeds the maximum``). Added ``token limit``, ``input too large``, ``payload size`` -- each observed in real provider error bodies. Moved the phrase set to a named tuple so adding a variant is a one-line edit. 3. Round 4 L824: OpenRouter response classification used ``finish_reason or native_finish_reason`` (truthy fallback). A safety-blocked completion that surfaces as normalized ``finish_reason="stop"`` + raw ``native_finish_reason="safety"`` would miss the SafetyRefusal classification entirely. Now classifies by the UNION of both fields so a safety signal from either source wins over a generic "stop". 4. Round 4 L1227 + L1240: ``CODE_REVIEW_TEMPERATURE`` and ``CODE_REVIEW_MAX_TOKENS`` were converted to numbers but not range- validated. A user setting ``temperature=999`` or ``max_tokens=-1`` would get an opaque provider 4xx (UNKNOWN/1) instead of CONFIG/2. Both now validate range and raise ConfigError before the call. Verified: ``CODE_REVIEW_TEMPERATURE=999`` and ``=0`` both produce typed CONFIG errors with actionable Reason text. 5. Round 4 L1302: Re-stat in the ContextOverflow error path could raise OSError if a file disappeared between ``bundle_codebase`` and the size-rendering branch -- the error path would then crash with an OSError that buries the original overflow message. Wrap the stat in a helper that returns None on OSError and skip missing files. Declined-with-comments: - Round 3 L27 (SKILL.md typo): reviewer's "fix" added an unmatched ``)`` to a function signature example, making it itself broken. Hallucination; no action. - Round 4 L938 (Gemini uppercase / OpenRouter lowercase): casing matches each provider's wire format intentionally; normalizing would create divergence from the actual API response. - Round 4 L6 (placeholder XML tag): reviewer agrees it's stylistic not a bug; current substitution shape works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New --provider ollama path posts to a local Ollama server via its OpenAI-compat /v1/chat/completions endpoint. No API key required; configurable via OLLAMA_HOST / OLLAMA_MODEL / OLLAMA_TIMEOUT in .env. Default model qwen3-coder:30b (30B MoE coder with ~3.3B active params -- quality/speed sweet spot on CPU because active params drive inference speed, not total params). - Refactor MODEL_ALIASES into per-provider MODEL_ALIASES_BY_PROVIDER. Ollama gets `local` (qwen3-coder:30b) and `local-pro` (qwen3-coder-next). OpenRouter aliases unchanged; gemini-direct remains aliasless. Cross- provider mismatches now surface as typed CONFIG errors that name the right --provider instead of sending an invalid slug upstream. A flat backwards-compat MODEL_ALIASES view over the OpenRouter slice is preserved for any external importer. - New call_ollama() function with surgical error mapping: connection refused -> ConfigError pointing at `ollama serve`; HTTP 404 -> ConfigError with the exact `ollama pull <model>` command to run; everything else routes through the shared _classify_http_error mapper. No SafetyRefusal branch (Ollama doesn't have a content-filter mode). - Longer default timeout for the Ollama provider (DEFAULT_OLLAMA_TIMEOUT = 1800 s) to absorb CPU cold-starts (10-60 s model load on first call) and thorough reviews of large diffs. - Default temperature lowered from 0.5 to 0.3 after cross-model integration testing observed `google/gemini-2.5-pro` hallucinating a HIGH-severity finding referencing a CLI flag and "help text" that do not exist in the codebase. The suggested fix would have crashed the runner with AttributeError. The constant's docstring carries the full history (0.2 too conservative -> 0.5 hallucinated -> 0.3 compromise). - New --ollama-host CLI flag for non-default ports / hosts / WSL setups without localhost mirroring. - README + .env.example + docs/llm-code-review-runbook.md updated for the three-provider model, the per-provider alias tables, the retuned temperature default, and a new "Local vs cloud" section documenting the opposing failure modes (cloud hallucinates, local under-reports) with the specific cross-model test result as evidence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Author
|
Apologies, this was opened against upstream by mistake : the changes are fork-specific to my local fork and accidentally opened the PR upstream. Thank you to the authors and maintainers of this useful tool! Been amazing to adjust it to my own workflow, local LLM's, etc... |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--provider ollamapath that POSTs to a local Ollama server's OpenAI-compatible/v1/chat/completionsendpoint. No API key, no token cost, code never leaves the machine.MODEL_ALIASES_BY_PROVIDER); addslocal/local-proaliases for Ollama; preserves OpenRouter aliases unchanged; surfaces cross-provider mismatches as typedCONFIGerrors that name the right--provider.temperaturefrom0.5to0.3after cross-model integration testing observedgoogle/gemini-2.5-prohallucinating a HIGH-severity finding referencing a CLI flag that does not exist in the code (the proposed fix would have crashed the runner withAttributeError).README.md,.env.example, anddocs/llm-code-review-runbook.mdfor the new three-provider model and the retuned temperature default. Adds a "Local vs cloud" section documenting the opposing failure modes (cloud hallucinates, local under-reports).What the integration adds
--provider ollama--ollama-host URL$OLLAMA_HOSTorhttp://localhost:11434OLLAMA_MODELenvqwen3-coder:30b--provider ollamais selectedOLLAMA_TIMEOUTenv1800(seconds)--model localaliasqwen3-coder:30b--model local-proaliasqwen3-coder-nextWhy
qwen3-coder:30bis the defaultIt is a Mixture-of-Experts model with 3.3B active parameters per token despite being 30B total. On CPU, active-param count drives inference speed, not total params — so this model runs roughly as fast as a 3B dense model while having the quality of a 30B model. For users with a GPU, the speed advantage is less pronounced but the quality benefit holds.
Why the temperature changed
Cross-model integration testing on the same 31 K-char diff (the diff of this PR, prior to commit) produced opposite failure modes:
qwen3-coder:30bvia local Ollama: 3 minutes, clean "no issues found." Likely missed legitimate nits but did not fabricate any.google/gemini-2.5-provia OpenRouter at the previoustemperature=0.5: 80 seconds, one HIGH-severity finding referencingargs.timeoutand--timeoutflag. Neither the flag nor the attribute exist in the codebase. Verified bygrepreturning zero matches. The proposed fix would have crashed the runner withAttributeError: 'Namespace' object has no attribute 'timeout'.The hallucination at
0.5motivated reducing temperature to0.3— tighter than0.5to cut hallucination, looser than the original0.2to keep "more findings than too few." TheDEFAULT_TEMPERATUREconstant docstring carries the full history.Typed error model for Ollama
The new
call_ollamafunction distinguishes:ConfigError(exit 2) suggestingollama serveand--ollama-host. No retry — config problem.ConfigError(exit 2) with the exactollama pull <model>command. No retry — model not pulled._classify_http_errorso rate-limit / overflow / transport are typed consistently with the cloud providers.finish_reason=length→ContextOverflow(exit 12). NoSafetyRefusalbranch — Ollama has no content-filter mode.Test plan
import reviewsucceeds with all three providers and the new constants--helpshows--provider {openrouter,gemini,ollama}and--ollama-host--provider ollama --base HEADagainst a small diff with default model (qwen3-coder:30b) produces a clean review in under 1 minute--provider ollama --base HEADagainst the entire 31 K-char diff of this PR returns a coherent summary + clean verdict in ~3 min--provider openrouterreturns structured output (and surfaced the hallucination that prompted the temperature retune)qwen3-coder-nextand verify thelocal-proalias dispatch (deferred — large download)--provider ollama(deferred — needs Ollama in CI image)Notes for reviewers
MODEL_ALIASEStop-level dict is preserved as a backwards-compat view overMODEL_ALIASES_BY_PROVIDER["openrouter"]in case any external importer relied on it. If you're confident nothing does, the alias can be deleted in a follow-up.OLLAMA_TIMEOUTper env.🤖 Generated with Claude Code