Goal
Measure whether agents make better decisions with graph shards than without. Timeline: 2 days.
First result
undb #2236 — Cross-Space IDOR in Webhook and Invitation Delete/Update
- Repo: undb-io/undb (1,391 TS files, 15,039 nodes, 25,134 relationships)
- Task: Add space_id authorization checks to webhook and invitation delete/update endpoints
Both agents produced the same correct fix: 5 methods across 2 files, adding space_id WHERE clause.
|
Baseline |
With shards |
Delta |
| Duration |
160s |
145s |
10% faster |
| Messages |
53 |
37 |
30% fewer |
| Solution |
correct |
correct |
same |
| Token cost |
unavailable (subscription billing) |
unavailable |
— |
Install
```bash
macOS/Linux
curl -fsSL https://supermodeltools.com/install.sh | sh
Or via Go
go install github.com/supermodeltools/cli@v0.4.2
Verify
supermodel version
```
Setup
```bash
cd /path/to/your/repo
Login (gets API key)
supermodel login
First run — generates graph + shards
supermodel analyze
Start the watcher (keeps shards fresh as you code)
supermodel
Install the Claude Code hook (auto-triggers incremental on every edit)
Add to .claude/settings.json:
```
```json
{
"hooks": {
"PostToolUse": [{
"matcher": "Write|Edit",
"hooks": [{"type": "command", "command": "supermodel hook"}]
}]
}
}
```
The agent discovers the `.graph.*` shards through its normal grep/rg workflow.
Benchmark methodology
Pick 5-10 real issues from open source repos. Issues where the fix requires understanding cross-file relationships — who calls what, what depends on what, what's the blast radius. Not typos or single-file bugs.
For each issue, run two sessions:
- Without shards: Fresh clone, no `.graph.*` files, agent works with grep only
- With shards: Same clone, run `supermodel analyze` first, agent has graph data in the filesystem
Measure:
- Tool calls: How many grep/read operations did the agent use to navigate?
- Files touched: Did the agent find the right files?
- Fix quality: Was the fix scoped correctly? Did it account for callers/dependents?
- Time to first correct edit: How long before the agent started writing the right code?
- Tokens consumed: Total input/output tokens
Control matters
Same model, same prompt, same issue, same repo state. The only variable is whether `.graph.*` files exist in the repo. Use `--no-shards` flag or `supermodel clean` to remove shards for the baseline run.
Candidate repos and issues
Need issues that are:
- Multi-file (fix spans 2+ files)
- Require understanding call chains or dependency relationships
- Have a clear "correct" fix that can be evaluated
- On repos with 500+ files (small repos don't benefit much from graph context)
Previous A/B tests showed the strongest signal on:
- Large repos (4000+ files): 2.5x faster, 42% cheaper, correct vs broken fix (Directus)
- Security fixes: Cross-file authorization checks (undb)
- Bug fixes requiring call chain understanding: Flow trigger cache key (Directus #26929)
Goal
Measure whether agents make better decisions with graph shards than without. Timeline: 2 days.
First result
undb #2236 — Cross-Space IDOR in Webhook and Invitation Delete/Update
Both agents produced the same correct fix: 5 methods across 2 files, adding space_id WHERE clause.
Install
```bash
macOS/Linux
curl -fsSL https://supermodeltools.com/install.sh | sh
Or via Go
go install github.com/supermodeltools/cli@v0.4.2
Verify
supermodel version
```
Setup
```bash
cd /path/to/your/repo
Login (gets API key)
supermodel login
First run — generates graph + shards
supermodel analyze
Start the watcher (keeps shards fresh as you code)
supermodel
Install the Claude Code hook (auto-triggers incremental on every edit)
Add to .claude/settings.json:
```
```json
{
"hooks": {
"PostToolUse": [{
"matcher": "Write|Edit",
"hooks": [{"type": "command", "command": "supermodel hook"}]
}]
}
}
```
The agent discovers the `.graph.*` shards through its normal grep/rg workflow.
Benchmark methodology
Pick 5-10 real issues from open source repos. Issues where the fix requires understanding cross-file relationships — who calls what, what depends on what, what's the blast radius. Not typos or single-file bugs.
For each issue, run two sessions:
Measure:
Control matters
Same model, same prompt, same issue, same repo state. The only variable is whether `.graph.*` files exist in the repo. Use `--no-shards` flag or `supermodel clean` to remove shards for the baseline run.
Candidate repos and issues
Need issues that are:
Previous A/B tests showed the strongest signal on: