Skip to content

Commit 619cdd7

Browse files
committed
evalbuff brainstorm
1 parent 0d81b93 commit 619cdd7

File tree

3 files changed

+1105
-0
lines changed

3 files changed

+1105
-0
lines changed

evalbuff/BRAINSTORM.md

Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
# Evalbuff — Brainstorm
2+
3+
> Generate evals for *your* codebase. Not generic benchmarks — codebase-specific e2e testing, review, and context for AI coding agents.
4+
5+
## What is Evalbuff?
6+
7+
A CLI tool that helps teams build, run, and improve end-to-end evaluations for their codebase. It's intended to be used by:
8+
9+
- **The coding agent** — to check its own changes in a review step
10+
- **CI** — to run core flows and grade output quality
11+
- **The human developer** — to define flows, dump knowledge, and tune evals
12+
13+
Evalbuff is **not a coding agent**. It evaluates, reviews, and provides context. This means it complements any coding agent (Codebuff, Claude Code, Cursor, Copilot, etc.) without competing with them.
14+
15+
## Commands
16+
17+
| Command | Audience | Description |
18+
|---------|----------|-------------|
19+
| `evalbuff` | Human | Fancy TUI for browsing/editing knowledge, evals, and results |
20+
| `evalbuff init` | Human | Initialize evalbuff in a project |
21+
| `evalbuff context <prompt>` | Agent / Human | Return relevant files, knowledge, and gotchas for a prompt |
22+
| `evalbuff review [prompt]` | Agent / CI / Human | Review a change e2e, give rich structured feedback. Optional prompt describes what was requested so the reviewer can verify intent. |
23+
| `evalbuff run [task]` | CI / Human | Run eval tasks and output graded results |
24+
| `evalbuff learn` | CI / Human | Self-improvement: iterate on evals, knowledge, and context quality |
25+
| `evalbuff refresh` | CI (nightly) | Scan recent commits, update knowledge and eval subagents |
26+
27+
## Phase 1 — Context + Review (Immediate Value, Zero Setup)
28+
29+
The `context` and `review` commands are useful on day one with minimal configuration and can be a product in themselves.
30+
31+
### `evalbuff context`
32+
33+
Takes a prompt, returns everything a coding agent needs to work on it:
34+
35+
- **Relevant files** with summaries (leveraging an excellent file picker)
36+
- **Background knowledge** of the systems involved
37+
- **Lessons and gotchas** learned from past work
38+
39+
This is like a dynamic, project-specific skill that's better than any static AGENTS.md. Any coding agent can call this to get oriented before making changes.
40+
41+
### `evalbuff review [prompt]`
42+
43+
Given file diffs, uncommitted changes, or a branch:
44+
45+
- Outputs rich, structured feedback on what went wrong and why
46+
- Feedback is designed to be easy to feed back into a coding agent for a fix
47+
- Can check against project conventions, known patterns, and past mistakes
48+
49+
Both commands naturally build up the `.agents/knowledge/` directory, which makes everything better over time.
50+
51+
### Skill Installation — Teaching the Coding Agent About Evalbuff
52+
53+
For `context` and `review` to be useful to coding agents, the agent needs to *know* they exist and how to call them. Evalbuff solves this by installing a skill into the user's project.
54+
55+
`evalbuff init` (or a dedicated `evalbuff install-skill`) writes a `SKILL.md` file into both:
56+
57+
- `.agents/skills/evalbuff/SKILL.md` — for Codebuff and SDK-based agents
58+
- `.claude/skills/evalbuff/SKILL.md` — for Claude Code compatibility
59+
60+
The skill teaches the coding agent:
61+
62+
- **When to call `evalbuff context <prompt>`** — at the start of a task, to get relevant files, background knowledge, and gotchas before making changes
63+
- **When to call `evalbuff review`** — after making changes, to get structured feedback before committing
64+
- **Expected output format** — so the agent knows how to parse and act on the results
65+
- **How to feed review feedback back** — close the loop by using review output to fix issues
66+
67+
This is the critical glue that makes evalbuff work with *any* coding agent that supports skills (Codebuff, Claude Code, and anything built on the Codebuff SDK). The skill acts as a lightweight integration layer — no plugin system, no API integration, just a markdown file that the agent reads.
68+
69+
Example skill content (draft):
70+
71+
```markdown
72+
---
73+
name: evalbuff
74+
description: Use evalbuff to get project context before coding and review changes before committing
75+
---
76+
77+
# Evalbuff
78+
79+
This project uses evalbuff for context gathering and change review.
80+
81+
## Before starting a task
82+
83+
Run `evalbuff context "<description of what you're about to do>"` to get:
84+
- Relevant files you should read
85+
- Background knowledge about the systems involved
86+
- Known gotchas and lessons from past work
87+
88+
## After making changes
89+
90+
Run `evalbuff review "<what the user asked>"` to get structured feedback on your uncommitted changes. The prompt helps the reviewer verify the changes match the original intent.
91+
If the review surfaces issues, fix them before considering the task complete.
92+
```
93+
94+
## Phase 2 — E2E Eval Creation + Running
95+
96+
### The Incremental Approach
97+
98+
E2E setups are bespoke. Some projects need a full production-like environment (multiple backend servers, databases, third-party services). Setting up everything at once is wasteful and overwhelming.
99+
100+
**Instead, evalbuff builds e2e infrastructure incrementally:**
101+
102+
1. User describes ONE concrete e2e flow to check (e.g. "user signs up and creates a project")
103+
2. An agent (defined via codebuff SDK) analyzes the codebase and figures out what's needed to test that one flow
104+
3. Outputs a plan — walks the developer through manual steps, automates what it can
105+
4. Creates the task definition in `.agents/evals/tasks/signup-flow/PROMPT.md`
106+
5. When the user adds another flow, the agent diffs what's already set up and only adds what's missing
107+
108+
This way we never set up unnecessary infrastructure. Each new flow is additive.
109+
110+
### `evalbuff run`
111+
112+
- Define core flows for the app that should be tested
113+
- Grade output quality with LLM judges
114+
- Run in CI or locally
115+
- Optimize over time for speed and cost
116+
117+
## Phase 3 — Self-Improvement Flywheel
118+
119+
### `evalbuff learn`
120+
121+
Runs a coding agent + evals, then iterates on its own evals and knowledge to make them:
122+
123+
- **More discerning** — better at catching real issues
124+
- **More efficient** — faster, cheaper to run
125+
- Improves `evalbuff context` by saving more knowledge and configuring subagents
126+
127+
The key insight: improving evals and knowledge is more important than updating skills/AGENTS.md. `evalbuff context` is a dynamic skill that's better than a fixed one, and `evalbuff review` handles the rest.
128+
129+
### `evalbuff refresh`
130+
131+
Intended to run nightly from CI (e.g. GitHub Actions):
132+
133+
- Looks through commits since last refresh point
134+
- Updates eval subagent knowledge
135+
- Updates skills and known patterns
136+
- Keeps evals fresh as the codebase evolves
137+
138+
## Directory Structure
139+
140+
### Evalbuff Package Structure
141+
142+
```
143+
evalbuff/
144+
├── cli/ # TUI + commands (inspired by codebuff/cli)
145+
├── core/ # Shared logic: context gathering, review, eval running
146+
├── agents/ # Built-in agent definitions (uses codebuff SDK)
147+
├── skills/ # Skill templates to install into user projects
148+
│ └── evalbuff/
149+
│ └── SKILL.md # The skill that teaches agents how to use evalbuff
150+
├── BRAINSTORM.md
151+
└── README.md
152+
```
153+
154+
### What Evalbuff Manages in the User's Project
155+
156+
```
157+
.agents/
158+
├── skills/
159+
│ └── evalbuff/
160+
│ └── SKILL.md # Installed by `evalbuff init` — teaches agents to use evalbuff
161+
├── evals/
162+
│ ├── evalbuff.json # Config (LLM provider, settings)
163+
│ ├── tasks/ # E2E flow definitions
164+
│ │ └── <task-short-name>/
165+
│ │ ├── PROMPT.md # What to check + success criteria (or SPEC.md)
166+
│ │ └── traces/ # Historical run traces
167+
│ └── review-tasks/ # Review-specific eval tasks
168+
├── agent-definitions/ # Custom subagents
169+
└── knowledge/
170+
└── *.md # Project knowledge, lessons, gotchas
171+
172+
.claude/
173+
└── skills/
174+
└── evalbuff/
175+
└── SKILL.md # Same skill, for Claude Code compatibility
176+
```
177+
178+
## Key Ideas
179+
180+
### Evals Are Never Done
181+
182+
> "Everything could be an eval and then the rest of the system optimizes for it." — Alex
183+
184+
> "Even human vibes can be encoded."
185+
186+
There are always ways to improve evals. The `learn` command creates a flywheel that manual tests never have.
187+
188+
### Decoupled from the Coding Agent
189+
190+
Evalbuff runs separately from the coding agent. This:
191+
192+
- Gets around the subsidized coding agent pricing problem
193+
- Works with ANY coding agent, not just Codebuff
194+
- Makes `evalbuff context` a viral hook — it makes every coding agent better
195+
196+
### The Context Command as a Trojan Horse
197+
198+
`evalbuff context` is the easiest entry point. No eval setup required. Just install and immediately get better results from whatever coding tool you already use. Once teams see the value, they naturally want `review`, then `run`, then the full flywheel.
199+
200+
## Open Questions
201+
202+
- How should LLM provider configuration work? API keys from the user vs. evalbuff-hosted?
203+
- Should `evalbuff run` spin up infrastructure itself, or just validate that the user has set it up?
204+
- What's the pricing model? Per-eval-run? Subscription? Free tier for `context` + `review`?
205+
- How much of the codebuff SDK can we reuse vs. what needs to be evalbuff-specific?
206+
- Should traces be stored locally, in the cloud, or both?
207+
- How do we handle projects with existing test infrastructure (Playwright, Cypress, etc.) — integrate or replace?

0 commit comments

Comments
 (0)