|
| 1 | +# Evalbuff — Brainstorm |
| 2 | + |
| 3 | +> Generate evals for *your* codebase. Not generic benchmarks — codebase-specific e2e testing, review, and context for AI coding agents. |
| 4 | +
|
| 5 | +## What is Evalbuff? |
| 6 | + |
| 7 | +A CLI tool that helps teams build, run, and improve end-to-end evaluations for their codebase. It's intended to be used by: |
| 8 | + |
| 9 | +- **The coding agent** — to check its own changes in a review step |
| 10 | +- **CI** — to run core flows and grade output quality |
| 11 | +- **The human developer** — to define flows, dump knowledge, and tune evals |
| 12 | + |
| 13 | +Evalbuff is **not a coding agent**. It evaluates, reviews, and provides context. This means it complements any coding agent (Codebuff, Claude Code, Cursor, Copilot, etc.) without competing with them. |
| 14 | + |
| 15 | +## Commands |
| 16 | + |
| 17 | +| Command | Audience | Description | |
| 18 | +|---------|----------|-------------| |
| 19 | +| `evalbuff` | Human | Fancy TUI for browsing/editing knowledge, evals, and results | |
| 20 | +| `evalbuff init` | Human | Initialize evalbuff in a project | |
| 21 | +| `evalbuff context <prompt>` | Agent / Human | Return relevant files, knowledge, and gotchas for a prompt | |
| 22 | +| `evalbuff review [prompt]` | Agent / CI / Human | Review a change e2e, give rich structured feedback. Optional prompt describes what was requested so the reviewer can verify intent. | |
| 23 | +| `evalbuff run [task]` | CI / Human | Run eval tasks and output graded results | |
| 24 | +| `evalbuff learn` | CI / Human | Self-improvement: iterate on evals, knowledge, and context quality | |
| 25 | +| `evalbuff refresh` | CI (nightly) | Scan recent commits, update knowledge and eval subagents | |
| 26 | + |
| 27 | +## Phase 1 — Context + Review (Immediate Value, Zero Setup) |
| 28 | + |
| 29 | +The `context` and `review` commands are useful on day one with minimal configuration and can be a product in themselves. |
| 30 | + |
| 31 | +### `evalbuff context` |
| 32 | + |
| 33 | +Takes a prompt, returns everything a coding agent needs to work on it: |
| 34 | + |
| 35 | +- **Relevant files** with summaries (leveraging an excellent file picker) |
| 36 | +- **Background knowledge** of the systems involved |
| 37 | +- **Lessons and gotchas** learned from past work |
| 38 | + |
| 39 | +This is like a dynamic, project-specific skill that's better than any static AGENTS.md. Any coding agent can call this to get oriented before making changes. |
| 40 | + |
| 41 | +### `evalbuff review [prompt]` |
| 42 | + |
| 43 | +Given file diffs, uncommitted changes, or a branch: |
| 44 | + |
| 45 | +- Outputs rich, structured feedback on what went wrong and why |
| 46 | +- Feedback is designed to be easy to feed back into a coding agent for a fix |
| 47 | +- Can check against project conventions, known patterns, and past mistakes |
| 48 | + |
| 49 | +Both commands naturally build up the `.agents/knowledge/` directory, which makes everything better over time. |
| 50 | + |
| 51 | +### Skill Installation — Teaching the Coding Agent About Evalbuff |
| 52 | + |
| 53 | +For `context` and `review` to be useful to coding agents, the agent needs to *know* they exist and how to call them. Evalbuff solves this by installing a skill into the user's project. |
| 54 | + |
| 55 | +`evalbuff init` (or a dedicated `evalbuff install-skill`) writes a `SKILL.md` file into both: |
| 56 | + |
| 57 | +- `.agents/skills/evalbuff/SKILL.md` — for Codebuff and SDK-based agents |
| 58 | +- `.claude/skills/evalbuff/SKILL.md` — for Claude Code compatibility |
| 59 | + |
| 60 | +The skill teaches the coding agent: |
| 61 | + |
| 62 | +- **When to call `evalbuff context <prompt>`** — at the start of a task, to get relevant files, background knowledge, and gotchas before making changes |
| 63 | +- **When to call `evalbuff review`** — after making changes, to get structured feedback before committing |
| 64 | +- **Expected output format** — so the agent knows how to parse and act on the results |
| 65 | +- **How to feed review feedback back** — close the loop by using review output to fix issues |
| 66 | + |
| 67 | +This is the critical glue that makes evalbuff work with *any* coding agent that supports skills (Codebuff, Claude Code, and anything built on the Codebuff SDK). The skill acts as a lightweight integration layer — no plugin system, no API integration, just a markdown file that the agent reads. |
| 68 | + |
| 69 | +Example skill content (draft): |
| 70 | + |
| 71 | +```markdown |
| 72 | +--- |
| 73 | +name: evalbuff |
| 74 | +description: Use evalbuff to get project context before coding and review changes before committing |
| 75 | +--- |
| 76 | + |
| 77 | +# Evalbuff |
| 78 | + |
| 79 | +This project uses evalbuff for context gathering and change review. |
| 80 | + |
| 81 | +## Before starting a task |
| 82 | + |
| 83 | +Run `evalbuff context "<description of what you're about to do>"` to get: |
| 84 | +- Relevant files you should read |
| 85 | +- Background knowledge about the systems involved |
| 86 | +- Known gotchas and lessons from past work |
| 87 | + |
| 88 | +## After making changes |
| 89 | + |
| 90 | +Run `evalbuff review "<what the user asked>"` to get structured feedback on your uncommitted changes. The prompt helps the reviewer verify the changes match the original intent. |
| 91 | +If the review surfaces issues, fix them before considering the task complete. |
| 92 | +``` |
| 93 | + |
| 94 | +## Phase 2 — E2E Eval Creation + Running |
| 95 | + |
| 96 | +### The Incremental Approach |
| 97 | + |
| 98 | +E2E setups are bespoke. Some projects need a full production-like environment (multiple backend servers, databases, third-party services). Setting up everything at once is wasteful and overwhelming. |
| 99 | + |
| 100 | +**Instead, evalbuff builds e2e infrastructure incrementally:** |
| 101 | + |
| 102 | +1. User describes ONE concrete e2e flow to check (e.g. "user signs up and creates a project") |
| 103 | +2. An agent (defined via codebuff SDK) analyzes the codebase and figures out what's needed to test that one flow |
| 104 | +3. Outputs a plan — walks the developer through manual steps, automates what it can |
| 105 | +4. Creates the task definition in `.agents/evals/tasks/signup-flow/PROMPT.md` |
| 106 | +5. When the user adds another flow, the agent diffs what's already set up and only adds what's missing |
| 107 | + |
| 108 | +This way we never set up unnecessary infrastructure. Each new flow is additive. |
| 109 | + |
| 110 | +### `evalbuff run` |
| 111 | + |
| 112 | +- Define core flows for the app that should be tested |
| 113 | +- Grade output quality with LLM judges |
| 114 | +- Run in CI or locally |
| 115 | +- Optimize over time for speed and cost |
| 116 | + |
| 117 | +## Phase 3 — Self-Improvement Flywheel |
| 118 | + |
| 119 | +### `evalbuff learn` |
| 120 | + |
| 121 | +Runs a coding agent + evals, then iterates on its own evals and knowledge to make them: |
| 122 | + |
| 123 | +- **More discerning** — better at catching real issues |
| 124 | +- **More efficient** — faster, cheaper to run |
| 125 | +- Improves `evalbuff context` by saving more knowledge and configuring subagents |
| 126 | + |
| 127 | +The key insight: improving evals and knowledge is more important than updating skills/AGENTS.md. `evalbuff context` is a dynamic skill that's better than a fixed one, and `evalbuff review` handles the rest. |
| 128 | + |
| 129 | +### `evalbuff refresh` |
| 130 | + |
| 131 | +Intended to run nightly from CI (e.g. GitHub Actions): |
| 132 | + |
| 133 | +- Looks through commits since last refresh point |
| 134 | +- Updates eval subagent knowledge |
| 135 | +- Updates skills and known patterns |
| 136 | +- Keeps evals fresh as the codebase evolves |
| 137 | + |
| 138 | +## Directory Structure |
| 139 | + |
| 140 | +### Evalbuff Package Structure |
| 141 | + |
| 142 | +``` |
| 143 | +evalbuff/ |
| 144 | +├── cli/ # TUI + commands (inspired by codebuff/cli) |
| 145 | +├── core/ # Shared logic: context gathering, review, eval running |
| 146 | +├── agents/ # Built-in agent definitions (uses codebuff SDK) |
| 147 | +├── skills/ # Skill templates to install into user projects |
| 148 | +│ └── evalbuff/ |
| 149 | +│ └── SKILL.md # The skill that teaches agents how to use evalbuff |
| 150 | +├── BRAINSTORM.md |
| 151 | +└── README.md |
| 152 | +``` |
| 153 | + |
| 154 | +### What Evalbuff Manages in the User's Project |
| 155 | + |
| 156 | +``` |
| 157 | +.agents/ |
| 158 | +├── skills/ |
| 159 | +│ └── evalbuff/ |
| 160 | +│ └── SKILL.md # Installed by `evalbuff init` — teaches agents to use evalbuff |
| 161 | +├── evals/ |
| 162 | +│ ├── evalbuff.json # Config (LLM provider, settings) |
| 163 | +│ ├── tasks/ # E2E flow definitions |
| 164 | +│ │ └── <task-short-name>/ |
| 165 | +│ │ ├── PROMPT.md # What to check + success criteria (or SPEC.md) |
| 166 | +│ │ └── traces/ # Historical run traces |
| 167 | +│ └── review-tasks/ # Review-specific eval tasks |
| 168 | +├── agent-definitions/ # Custom subagents |
| 169 | +└── knowledge/ |
| 170 | + └── *.md # Project knowledge, lessons, gotchas |
| 171 | +
|
| 172 | +.claude/ |
| 173 | +└── skills/ |
| 174 | + └── evalbuff/ |
| 175 | + └── SKILL.md # Same skill, for Claude Code compatibility |
| 176 | +``` |
| 177 | + |
| 178 | +## Key Ideas |
| 179 | + |
| 180 | +### Evals Are Never Done |
| 181 | + |
| 182 | +> "Everything could be an eval and then the rest of the system optimizes for it." — Alex |
| 183 | +
|
| 184 | +> "Even human vibes can be encoded." |
| 185 | +
|
| 186 | +There are always ways to improve evals. The `learn` command creates a flywheel that manual tests never have. |
| 187 | + |
| 188 | +### Decoupled from the Coding Agent |
| 189 | + |
| 190 | +Evalbuff runs separately from the coding agent. This: |
| 191 | + |
| 192 | +- Gets around the subsidized coding agent pricing problem |
| 193 | +- Works with ANY coding agent, not just Codebuff |
| 194 | +- Makes `evalbuff context` a viral hook — it makes every coding agent better |
| 195 | + |
| 196 | +### The Context Command as a Trojan Horse |
| 197 | + |
| 198 | +`evalbuff context` is the easiest entry point. No eval setup required. Just install and immediately get better results from whatever coding tool you already use. Once teams see the value, they naturally want `review`, then `run`, then the full flywheel. |
| 199 | + |
| 200 | +## Open Questions |
| 201 | + |
| 202 | +- How should LLM provider configuration work? API keys from the user vs. evalbuff-hosted? |
| 203 | +- Should `evalbuff run` spin up infrastructure itself, or just validate that the user has set it up? |
| 204 | +- What's the pricing model? Per-eval-run? Subscription? Free tier for `context` + `review`? |
| 205 | +- How much of the codebuff SDK can we reuse vs. what needs to be evalbuff-specific? |
| 206 | +- Should traces be stored locally, in the cloud, or both? |
| 207 | +- How do we handle projects with existing test infrastructure (Playwright, Cypress, etc.) — integrate or replace? |
0 commit comments