Benchmark: A/B test with/without shards on real open-source issues

## Goal

Measure whether agents make better decisions with graph shards than without. Timeline: 2 days.

## First result

**undb #2236 — Cross-Space IDOR in Webhook and Invitation Delete/Update**
- Repo: undb-io/undb (1,391 TS files, 15,039 nodes, 25,134 relationships)
- Task: Add space_id authorization checks to webhook and invitation delete/update endpoints

Both agents produced the same correct fix: 5 methods across 2 files, adding space_id WHERE clause.

| | Baseline | With shards | Delta |
|---|---|---|---|
| Duration | 160s | 145s | 10% faster |
| Messages | 53 | 37 | 30% fewer |
| Solution | correct | correct | same |
| Token cost | unavailable (subscription billing) | unavailable | — |

## Install

\`\`\`bash
# macOS/Linux
curl -fsSL https://supermodeltools.com/install.sh | sh

# Or via Go
go install github.com/supermodeltools/cli@v0.4.2

# Verify
supermodel version
\`\`\`

## Setup

\`\`\`bash
cd /path/to/your/repo

# Login (gets API key)
supermodel login

# First run — generates graph + shards
supermodel analyze

# Start the watcher (keeps shards fresh as you code)
supermodel

# Install the Claude Code hook (auto-triggers incremental on every edit)
# Add to .claude/settings.json:
\`\`\`
\`\`\`json
{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Write|Edit",
      "hooks": [{"type": "command", "command": "supermodel hook"}]
    }]
  }
}
\`\`\`

The agent discovers the \`.graph.*\` shards through its normal grep/rg workflow.

## Benchmark methodology

Pick 5-10 real issues from open source repos. Issues where the fix requires understanding cross-file relationships — who calls what, what depends on what, what's the blast radius. Not typos or single-file bugs.

### For each issue, run two sessions:
1. **Without shards:** Fresh clone, no \`.graph.*\` files, agent works with grep only
2. **With shards:** Same clone, run \`supermodel analyze\` first, agent has graph data in the filesystem

### Measure:
- **Tool calls:** How many grep/read operations did the agent use to navigate?
- **Files touched:** Did the agent find the right files?
- **Fix quality:** Was the fix scoped correctly? Did it account for callers/dependents?
- **Time to first correct edit:** How long before the agent started writing the right code?
- **Tokens consumed:** Total input/output tokens

### Control matters
Same model, same prompt, same issue, same repo state. The only variable is whether \`.graph.*\` files exist in the repo. Use \`--no-shards\` flag or \`supermodel clean\` to remove shards for the baseline run.

## Candidate repos and issues

Need issues that are:
- Multi-file (fix spans 2+ files)
- Require understanding call chains or dependency relationships
- Have a clear "correct" fix that can be evaluated
- On repos with 500+ files (small repos don't benefit much from graph context)

Previous A/B tests showed the strongest signal on:
- **Large repos (4000+ files):** 2.5x faster, 42% cheaper, correct vs broken fix (Directus)
- **Security fixes:** Cross-file authorization checks (undb)
- **Bug fixes requiring call chain understanding:** Flow trigger cache key (Directus #26929)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: A/B test with/without shards on real open-source issues #113

Goal

First result

Install

macOS/Linux

Or via Go

Verify

Setup

Login (gets API key)

First run — generates graph + shards

Start the watcher (keeps shards fresh as you code)

Install the Claude Code hook (auto-triggers incremental on every edit)

Add to .claude/settings.json:

Benchmark methodology

For each issue, run two sessions:

Measure:

Control matters

Candidate repos and issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Baseline	With shards	Delta
Duration	160s	145s	10% faster
Messages	53	37	30% fewer
Solution	correct	correct	same
Token cost	unavailable (subscription billing)	unavailable	—

Benchmark: A/B test with/without shards on real open-source issues #113

Description

Goal

First result

Install

macOS/Linux

Or via Go

Verify

Setup

Login (gets API key)

First run — generates graph + shards

Start the watcher (keeps shards fresh as you code)

Install the Claude Code hook (auto-triggers incremental on every edit)

Add to .claude/settings.json:

Benchmark methodology

For each issue, run two sessions:

Measure:

Control matters

Candidate repos and issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions