Skip to content

Benchmark: A/B test with/without shards on real open-source issues #113

@jonathanpopham

Description

@jonathanpopham

Goal

Measure whether agents make better decisions with graph shards than without. Timeline: 2 days.

First result

undb #2236 — Cross-Space IDOR in Webhook and Invitation Delete/Update

  • Repo: undb-io/undb (1,391 TS files, 15,039 nodes, 25,134 relationships)
  • Task: Add space_id authorization checks to webhook and invitation delete/update endpoints

Both agents produced the same correct fix: 5 methods across 2 files, adding space_id WHERE clause.

Baseline With shards Delta
Duration 160s 145s 10% faster
Messages 53 37 30% fewer
Solution correct correct same
Token cost unavailable (subscription billing) unavailable

Install

```bash

macOS/Linux

curl -fsSL https://supermodeltools.com/install.sh | sh

Or via Go

go install github.com/supermodeltools/cli@v0.4.2

Verify

supermodel version
```

Setup

```bash
cd /path/to/your/repo

Login (gets API key)

supermodel login

First run — generates graph + shards

supermodel analyze

Start the watcher (keeps shards fresh as you code)

supermodel

Install the Claude Code hook (auto-triggers incremental on every edit)

Add to .claude/settings.json:

```
```json
{
"hooks": {
"PostToolUse": [{
"matcher": "Write|Edit",
"hooks": [{"type": "command", "command": "supermodel hook"}]
}]
}
}
```

The agent discovers the `.graph.*` shards through its normal grep/rg workflow.

Benchmark methodology

Pick 5-10 real issues from open source repos. Issues where the fix requires understanding cross-file relationships — who calls what, what depends on what, what's the blast radius. Not typos or single-file bugs.

For each issue, run two sessions:

  1. Without shards: Fresh clone, no `.graph.*` files, agent works with grep only
  2. With shards: Same clone, run `supermodel analyze` first, agent has graph data in the filesystem

Measure:

  • Tool calls: How many grep/read operations did the agent use to navigate?
  • Files touched: Did the agent find the right files?
  • Fix quality: Was the fix scoped correctly? Did it account for callers/dependents?
  • Time to first correct edit: How long before the agent started writing the right code?
  • Tokens consumed: Total input/output tokens

Control matters

Same model, same prompt, same issue, same repo state. The only variable is whether `.graph.*` files exist in the repo. Use `--no-shards` flag or `supermodel clean` to remove shards for the baseline run.

Candidate repos and issues

Need issues that are:

  • Multi-file (fix spans 2+ files)
  • Require understanding call chains or dependency relationships
  • Have a clear "correct" fix that can be evaluated
  • On repos with 500+ files (small repos don't benefit much from graph context)

Previous A/B tests showed the strongest signal on:

  • Large repos (4000+ files): 2.5x faster, 42% cheaper, correct vs broken fix (Directus)
  • Security fixes: Cross-file authorization checks (undb)
  • Bug fixes requiring call chain understanding: Flow trigger cache key (Directus #26929)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions