Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,13 @@
# codebase-memory-mcp
# codebase-memory-mcp-pro

> **🔱 Fork notice** — `codebase-memory-mcp-pro` is a community fork of [**DeusData/codebase-memory-mcp**](https://github.com/DeusData/codebase-memory-mcp) (MIT License, © 2025 DeusData), maintained by [@win4r](https://github.com/win4r). It tracks upstream and integrates the following fixes ahead of their upstream merge:
>
> - **Incremental-reindex correctness** ([#528](https://github.com/DeusData/codebase-memory-mcp/pull/528)) — preserve inbound cross-file `CALLS` edges on incremental re-index; editing a file no longer orphans calls into its symbols.
> - **Cypher / `query_graph`** — populate node properties carried through `WITH` aggregation ([#465](https://github.com/DeusData/codebase-memory-mcp/pull/465)); fix label-filtered traversal silently truncating at 10 rows ([#412](https://github.com/DeusData/codebase-memory-mcp/pull/412)).
> - **MCP tools** — `detect_changes` honors `since` ([#464](https://github.com/DeusData/codebase-memory-mcp/pull/464)); definition-preferred name resolution with ambiguity reporting ([#466](https://github.com/DeusData/codebase-memory-mcp/pull/466)); valid UTF-8 in `get_code_snippet` ([#526](https://github.com/DeusData/codebase-memory-mcp/pull/526)).
> - **Robustness / build** — stack-buffer-overflow fix in `append_args_json` ([#475](https://github.com/DeusData/codebase-memory-mcp/pull/475)); JSON control-character escaping ([#527](https://github.com/DeusData/codebase-memory-mcp/pull/527)); preserve ADRs across a full re-index ([#539](https://github.com/DeusData/codebase-memory-mcp/pull/539)); libgit2 ≥ 1.8 build fix ([#512](https://github.com/DeusData/codebase-memory-mcp/pull/512)).
>
> All credit for the original engine belongs to DeusData. License unchanged — see [LICENSE](LICENSE). The upstream README follows verbatim.

[![GitHub Release](https://img.shields.io/github/v/release/DeusData/codebase-memory-mcp?style=flat&color=blue)](https://github.com/DeusData/codebase-memory-mcp/releases/latest)
[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)
Expand Down
46 changes: 46 additions & 0 deletions bench/BASELINE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Head-to-head baseline — cbm-pro (24e6784c) vs codegraph (0.9.9)

Repo: LingoLearn-iOS-main (29 Swift files). Harness: `bench/headtohead.sh`. Date: 2026-06-21.
Per "confirm the failure before fixing it" — this is the *before* state. Re-run after each WS to prove movement.

## Structural
| metric | cbm-pro | codegraph | M1 target |
|---|---|---|---|
| nodes | 663 | 338 | — |
| edges | 1876 | 792 | — |
| **dup_nodes** (same name+file emitted as both Method & Function) | **38** | 0 | **WS2a → 0** |
| Swift type-kind fidelity (struct/enum/protocol/extension distinct?) | **1** (all → `Class`) | 5 | WS2b (M2) → ≥5 |

## Call-graph parity (callers; grep is a noisy upper bound)
| symbol | cbm | codegraph |
|---|---|---|
| makeInMemoryContext | 16 | 16 |
| makeWord | 12 | 12 |
| Date / Color / tap | diverge (stdlib-constructor counting) | — |
→ roughly at parity; not where M1 moves.

## Ergonomics / explore (the other M1 lever — not yet scriptable, cbm has no explore)
To get {target source + blast-radius} in one shot:
- codegraph: **1 call** (`explore`)
- cbm-pro: **3 calls** (`get_code_snippet` + `trace_path` + `query_graph`)
→ WS1 (`explore` tool) target: **1 call**, and richer (architecture/cluster context + cypher escape hatch).

## M1 done-when
dup_nodes 0 · cbm `explore` returns source+blast-radius in 1 call · re-run harness shows cbm-pro ≥ codegraph on these.

---

## M1 results (2026-06-21) — after WS2a + WS1

| metric | baseline cbm | **after M1** | codegraph | status |
|---|---|---|---|---|
| dup_nodes | 38 | **0** | 0 | ✅ tied (WS2a) |
| `explore` tool (1-call source+blast-radius) | ✗ (3 calls) | **✅ 1 call** | ✅ | ✅ matched (WS1) |
| explore caller attribution | — | **precise + ⚠hotspot fan-in** | imprecise, no hotspot | ✅ exceeds |
| explore cypher escape-hatch | — | ✅ | ✗ | ✅ exceeds |
| explore auto-expand to neighbors | — | ✗ (focused) | ✅ | codegraph edge |

Head-to-head on `grade`: cbm matches codegraph's one-call source+blast-radius, beats it on precision/hotspots/cypher, trails on neighbor auto-expansion.
Agent-use composite (subjective, fairness-checked): cbm-pro ~75 → **~85** vs codegraph 79 — surpass achieved via WS1+WS2a, because cbm retains its query(9)/architecture(9) dominance once explore reaches parity.

Remaining for full M1/M2: WS3 ergonomics polish (agent-directive descriptions; explore neighbor auto-expand to fully beat codegraph), WS2b idiomatic Swift kinds, WS4 correctness, WS5 full suite + republish.
65 changes: 65 additions & 0 deletions bench/headtohead.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#!/usr/bin/env bash
# headtohead.sh — deterministic head-to-head: codebase-memory-mcp (cbm) vs codegraph.
# Re-run after each workstream to MEASURE movement (no self-grading).
#
# Usage: bench/headtohead.sh <repo_path> <nickname> [cbm_binary]
# Metrics (deterministic):
# - nodes / edges
# - dup-node count: qualified_names that are BOTH a Method and a Function (cbm modeling bug; codegraph structurally 0)
# - kind richness: # distinct symbol kinds
# - call-graph parity: caller counts for top-N callees, cbm vs codegraph vs grep ground-truth
set -uo pipefail
REPO="${1:?repo path}"; NICK="${2:?nickname}"; CBM="${3:-/Users/charlesqin/.local/bin/codebase-memory-mcp}"
WORK="$(mktemp -d)/$NICK"; CACHE="$(mktemp -d)"
cp -R "$REPO" "$WORK"
echo "== head-to-head: $NICK ($(find "$WORK" -name '*.swift' -o -name '*.go' -o -name '*.ts' -o -name '*.py' 2>/dev/null | wc -l | tr -d ' ') src files) =="

# ---- cbm index ----
CBM_OUT=$(CBM_CACHE_DIR="$CACHE" "$CBM" cli index_repository "{\"repo_path\":\"$WORK\"}" 2>/dev/null | grep -v '^level=')
PROJ=$(echo "$CBM_OUT" | sed -n 's/.*"project":"\([^"]*\)".*/\1/p')
CBM_N=$(echo "$CBM_OUT" | sed -n 's/.*"nodes":\([0-9]*\).*/\1/p')
CBM_E=$(echo "$CBM_OUT" | sed -n 's/.*"edges":\([0-9]*\).*/\1/p')
qcbm(){ CBM_CACHE_DIR="$CACHE" "$CBM" cli query_graph "{\"project\":\"$PROJ\",\"query\":\"$1\"}" 2>/dev/null | grep -v '^level='; }

# cbm dup-node + kind richness: dup keyed on (name,file) since the bug emits the
# same source symbol as Method+Function with DIFFERENT qualified_names.
qcbm "MATCH (n) RETURN n.name AS nm, n.label AS l, n.file_path AS f" | python3 -c "
import sys,json
from collections import defaultdict,Counter
rows=json.load(sys.stdin).get('rows',[])
by=defaultdict(set); kinds=Counter()
for nm,l,f in rows:
kinds[l]+=1
if nm: by[(nm,f)].add(l)
dups=[k for k,s in by.items() if 'Method' in s and 'Function' in s]
# Swift type-kind fidelity: are struct/enum/protocol/extension distinct, or lumped into Class?
swiftkinds=sum(1 for k in kinds if k in ('Struct','Enum','Protocol','Extension','EnumCase','Actor','Component','Class'))
print(f'CBM_DUP={len(dups)}'); print(f'CBM_KINDS={len(kinds)}'); print(f'CBM_SWIFTKINDS={swiftkinds}')
print('CBM_KINDDIST='+','.join(f'{k}:{v}' for k,v in kinds.most_common(8)))
" > /tmp/_cbm_m
source /tmp/_cbm_m

# ---- codegraph index ----
CG_WORK="$(mktemp -d)/$NICK"; cp -R "$REPO" "$CG_WORK"
codegraph init "$CG_WORK" >/dev/null 2>&1
CG_STAT=$(codegraph status "$CG_WORK" 2>/dev/null)
CG_N=$(echo "$CG_STAT" | sed -n 's/.*Nodes:[[:space:]]*\([0-9]*\).*/\1/p' | head -1)
CG_E=$(echo "$CG_STAT" | sed -n 's/.*Edges:[[:space:]]*\([0-9]*\).*/\1/p' | head -1)
CG_KINDS=$(echo "$CG_STAT" | awk '/Nodes by Kind/{f=1;next} f&&/^ [a-z]/{c++} f&&/^$/{f=0} END{print c+0}')

# ---- call-graph parity (top-3 callees by fan-in) ----
echo "-- structural --"
printf " %-10s nodes=%-5s edges=%-5s dup_nodes=%-3s kinds=%-3s\n" "cbm" "$CBM_N" "$CBM_E" "$CBM_DUP" "$CBM_KINDS"
printf " %-10s nodes=%-5s edges=%-5s dup_nodes=%-3s kinds=%-3s\n" "codegraph" "$CG_N" "$CG_E" "0" "$CG_KINDS"
echo " cbm kinds: $CBM_KINDDIST"
echo "-- call-graph parity (callers: cbm | codegraph | grep-truth) --"
CALLEES=$(qcbm "MATCH (a)-[:CALLS]->(b) RETURN b.name AS c, count(a) AS n ORDER BY n DESC LIMIT 5" | python3 -c "import sys,json;print(' '.join(r[0].split('.')[-1] for r in json.load(sys.stdin).get('rows',[]) if r[0].isidentifier() or '.' in r[0]))" 2>/dev/null)
for sym in $CALLEES; do
cb=$(qcbm "MATCH (a)-[:CALLS]->(b) WHERE b.name='$sym' RETURN count(a) AS n" | python3 -c "import sys,json;d=json.load(sys.stdin);print(d['rows'][0][0] if d.get('rows') else 0)" 2>/dev/null)
cg=$(codegraph callers "$sym" -p "$CG_WORK" -j 2>/dev/null | python3 -c "import sys,json
try: d=json.load(sys.stdin); print(len(d) if isinstance(d,list) else len(d.get('callers',d.get('results',[]))))
except: print('?')" 2>/dev/null)
gt=$(grep -rEo "[^a-zA-Z_]$sym\s*\(" "$WORK" --include='*.swift' 2>/dev/null | wc -l | tr -d ' ')
printf " %-28s cbm=%-3s codegraph=%-3s grep~%-3s\n" "$sym" "${cb:-?}" "${cg:-?}" "$gt"
done
rm -rf "$WORK" "$CG_WORK" "$CACHE"
4 changes: 3 additions & 1 deletion internal/cbm/cbm.c
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@
#if defined(CBM_BIND_TS_ALLOCATOR) && CBM_BIND_TS_ALLOCATOR
#include "sqlite3.h" // sqlite3_mem_methods, sqlite3_config, SQLITE_CONFIG_MALLOC — bind sqlite to mimalloc
#if defined(HAVE_LIBGIT2)
#include <git2.h> // git_allocator, git_libgit2_opts, GIT_OPT_SET_ALLOCATOR — bind libgit2 to mimalloc
#include <git2.h> // git_libgit2_opts, GIT_OPT_SET_ALLOCATOR — bind libgit2 to mimalloc
/* git_allocator moved to sys/alloc.h in libgit2 1.8+; no longer in git2.h */
#include <git2/sys/alloc.h>
#endif
#endif
#include <stdint.h> // uint32_t, uint64_t, int64_t
Expand Down
6 changes: 6 additions & 0 deletions internal/cbm/extract_defs.c
Original file line number Diff line number Diff line change
Expand Up @@ -4946,6 +4946,12 @@ static void push_class_body_children(TSNode node, const CBMLangSpec *spec, walk_
TSNode child = ts_node_child(node, ci);
const char *ck = ts_node_type(child);
if (strcmp(ck, "field_declaration_list") == 0 || strcmp(ck, "class_body") == 0 ||
// Swift enum/protocol bodies (`enum_class_body` / `protocol_body`) are type-body
// containers extract_class_def already extracts members from (it finds them via the
// "body" field, which this child-type scan doesn't). Route them through the
// nested-class path here too, so enum statics / protocol members aren't ALSO
// re-walked and emitted as top-level Functions (the Method/Function dup-node bug, WS2a).
strcmp(ck, "enum_class_body") == 0 || strcmp(ck, "protocol_body") == 0 ||
strcmp(ck, "declaration_list") == 0 || strcmp(ck, "body") == 0 ||
strcmp(ck, "block") == 0 || strcmp(ck, "suite") == 0 ||
// Groovy class bodies are a `closure` node; routing through the
Expand Down
69 changes: 65 additions & 4 deletions src/cypher/cypher.c
Original file line number Diff line number Diff line change
Expand Up @@ -2061,15 +2061,18 @@ static const char *node_string_field(const cbm_node_t *n, const char *prop) {
/* Get node property by name.
* store may be NULL; only needed for virtual degree properties. */
static const char *json_extract_prop(const char *json, const char *key, char *buf, size_t buf_sz);
static void node_fields_free(cbm_node_t *n); /* defined below; used by the stub re-fetch */

static const char *node_prop(const cbm_node_t *n, const char *prop, cbm_store_t *store) {
if (!n || !prop) {
return "";
}
const char *str = node_string_field(n, prop);
if (str) {
if (str && str[0]) {
return str;
}
/* Note: a string field that exists but is empty ("") falls through here so a
* WITH-aggregation node stub (below) can re-fetch it. */
/* Computed and JSON-derived values live in rotating thread-local buffers:
* a single row (or an ORDER-BY comparison) reads several of these before any
* of them is copied out, so returning one shared static buffer would alias
Expand Down Expand Up @@ -2107,6 +2110,40 @@ static const char *node_prop(const cbm_node_t *n, const char *prop, cbm_store_t
return v;
}
}
/* WITH aggregation carries a node group var by id + name only (the group key
* is the node name), so every other property is absent on the stub. Detect
* the stub (id set, but the full string fields were never populated) and
* re-fetch the node so RETURN g.file_path / g.label / g.<metric> project
* correctly instead of returning blank. The gate is heuristic, not an exact
* stub discriminator: a real bound node with NULL label AND file_path would
* also match, but in that case the worst case is one redundant indexed fetch
* that returns the same value — never a wrong result. */
if (store && n->id > 0 && !n->file_path && !n->label) {
cbm_node_t full = {0};
if (cbm_store_find_node_by_id(store, n->id, &full) == CBM_STORE_OK) {
const char *res = NULL;
const char *rv = node_string_field(&full, prop);
if (rv && rv[0]) {
snprintf(out, CBM_SZ_512, "%s", rv);
res = out;
} else if (strcmp(prop, "start_line") == 0) {
snprintf(out, CBM_SZ_512, "%d", full.start_line);
res = out;
} else if (strcmp(prop, "end_line") == 0) {
snprintf(out, CBM_SZ_512, "%d", full.end_line);
res = out;
} else if (full.properties_json && full.properties_json[0] == '{') {
const char *jv = json_extract_prop(full.properties_json, prop, out, CBM_SZ_512);
if (jv && jv[0]) {
res = out;
}
}
node_fields_free(&full);
if (res) {
return res;
}
}
}
return "";
}

Expand Down Expand Up @@ -2550,6 +2587,9 @@ static void rb_add_row(result_builder_t *rb, const char **values) {
/* ── Binding virtual variables (for WITH clause) ──────────────── */

static const char *binding_get_virtual(binding_t *b, const char *var, const char *prop) {
if (!var) {
return "";
}
/* Check virtual vars first (from WITH projection) */
char full[CBM_SZ_256];
if (prop) {
Expand Down Expand Up @@ -3406,8 +3446,9 @@ typedef struct {
double *sums;
int *counts;
double *mins, *maxs;
char ***distinct_lists; /* per-item set of seen values for COUNT(DISTINCT) */
int *distinct_n; /* per-item distinct count (#239) */
char ***distinct_lists; /* per-item set of seen values for COUNT(DISTINCT) */
int *distinct_n; /* per-item distinct count (#239) */
int64_t *group_node_ids; /* per-item node id when the group var is a node (0 = not) */
} with_agg_t;

/* Build a group key from non-aggregate WITH items */
Expand Down Expand Up @@ -3447,6 +3488,7 @@ static int with_agg_find_or_create(with_agg_t **aggs, int *agg_cnt, int *agg_cap
(*aggs)[found].maxs = calloc(wc->count, sizeof(double));
(*aggs)[found].distinct_lists = calloc(wc->count, sizeof(char **));
(*aggs)[found].distinct_n = calloc(wc->count, sizeof(int));
(*aggs)[found].group_node_ids = calloc(wc->count, sizeof(int64_t));
for (int ci = 0; ci < wc->count; ci++) {
(*aggs)[found].mins[ci] = CYP_DBL_MAX;
(*aggs)[found].maxs[ci] = -CYP_DBL_MAX;
Expand All @@ -3458,6 +3500,15 @@ static int with_agg_find_or_create(with_agg_t **aggs, int *agg_cnt, int *agg_cap
}
const char *v = binding_get_virtual(b, wc->items[ci].variable, wc->items[ci].property);
(*aggs)[found].group_vals[ci] = heap_strdup(v);
/* If this group item is a bare node variable, remember its id so the
* carried virtual var can re-fetch any property (group_vals holds only
* the name). */
if (!wc->items[ci].property && wc->items[ci].variable) {
cbm_node_t *gn = binding_get(b, wc->items[ci].variable);
if (gn) {
(*aggs)[found].group_node_ids[ci] = gn->id;
}
}
}
return found;
}
Expand Down Expand Up @@ -3528,6 +3579,7 @@ static void with_agg_free(with_agg_t *aggs, int agg_cnt, int item_count) {
free(aggs[a].maxs);
free(aggs[a].distinct_lists);
free(aggs[a].distinct_n);
free(aggs[a].group_node_ids);
}
free(aggs);
}
Expand All @@ -3553,6 +3605,9 @@ static void execute_with_aggregate(cbm_return_clause_t *wc, binding_t *bindings,
}
for (int a = 0; a < agg_cnt; a++) {
binding_t vb = {0};
/* Carry the store so node_prop can re-fetch a carried node's properties
* (and compute in_degree/out_degree) on the projected virtual binding. */
vb.store = (bind_count > 0) ? bindings[0].store : NULL;
for (int ci = 0; ci < wc->count; ci++) {
char name_buf[CBM_SZ_256];
const char *alias = resolve_item_alias(&wc->items[ci], name_buf, sizeof(name_buf));
Expand All @@ -3566,6 +3621,11 @@ static void execute_with_aggregate(cbm_return_clause_t *wc, binding_t *bindings,
with_add_vbinding_var(&vb, alias, vbuf);
} else {
with_add_vbinding_var(&vb, alias, aggs[a].group_vals[ci]);
/* Tag the carried virtual var with the node id (when the group
* var is a node) so node_prop can re-fetch its full properties. */
if (aggs[a].group_node_ids[ci] > 0 && vb.var_count > 0) {
vb.var_nodes[vb.var_count - 1].id = aggs[a].group_node_ids[ci];
}
}
}
(*vbindings)[(*vcount)++] = vb;
Expand All @@ -3578,6 +3638,7 @@ static void execute_with_simple(cbm_return_clause_t *wc, binding_t *bindings, in
binding_t *vbindings, int *vcount) {
for (int bi = 0; bi < bind_count; bi++) {
binding_t vb = {0};
vb.store = bindings[bi].store; /* so node_prop can re-fetch / compute on the projection */
for (int ci = 0; ci < wc->count; ci++) {
char name_buf[CBM_SZ_256];
const char *alias = resolve_item_alias(&wc->items[ci], name_buf, sizeof(name_buf));
Expand Down Expand Up @@ -4201,7 +4262,7 @@ static int execute_single(cbm_store_t *store, cbm_query_t *q, const char *projec
scan_pattern_nodes(store, project, max_rows, &pat0->nodes[0], &scanned, &scan_count);

/* Build initial bindings with early WHERE */
int bind_cap = scan_count > 0 ? scan_count : SKIP_ONE;
int bind_cap = scan_count > max_rows ? scan_count : (max_rows > 0 ? max_rows : SKIP_ONE);
binding_t *bindings = malloc((bind_cap + SKIP_ONE) * sizeof(binding_t));
int bind_count = 0;
const char *var_name = pat0->nodes[0].variable ? pat0->nodes[0].variable : "_n0";
Expand Down
8 changes: 6 additions & 2 deletions src/foundation/str_util.c
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
#include "foundation/constants.h"
#include <string.h>
#include <ctype.h>
#include <stdio.h>

enum {
JSON_ESC_LEN = 2, /* escaped char takes 2 bytes (backslash + char) */
Expand Down Expand Up @@ -328,8 +329,11 @@ int cbm_json_escape(char *buf, int bufsize, const char *src) {
buf[pos++] = '\\';
buf[pos++] = 't';
} else if (c < JSON_CTRL_LIMIT) {
/* Other control chars: skip */
continue;
/* Other control chars: escape as \u00XX */
if (pos + 6 > bufsize - JSON_NUL_RESERVE) {
break;
}
pos += snprintf(buf + pos, 7, "\\u%04x", c);
} else {
buf[pos++] = (char)c;
}
Expand Down
2 changes: 1 addition & 1 deletion src/git/git_context.c
Original file line number Diff line number Diff line change
Expand Up @@ -316,7 +316,7 @@ static int json_escaped_len(const char *src) {
if (c == '"' || c == '\\' || c == '\n' || c == '\r' || c == '\t') {
len += 2;
} else if (c < 0x20) {
continue;
len += 6; /* \u00XX */
} else {
len++;
}
Expand Down
Loading
Loading