Skip to content

Dump-phase crash on large TS monorepo (v0.6.1, darwin-arm64) — pipeline completes through semantic_edges then terminates abnormally before gbuf.dump #317

@hjdesai2

Description

@hjdesai2

Summary

codebase-memory-mcp v0.6.1 reliably crashes during the dump phase when indexing a large TypeScript monorepo on macOS arm64. The pipeline completes 12 passes — including parallel_extract, parallel_resolve, similarity, semantic_edges — and then terminates abnormally before gbuf.dump ever fires. No DB is persisted. Three runs reproduce deterministically across two different input sizes.

This appears to be the same dump-phase crash reported in #189 (large Java project, v0.5.7) but on a different platform/language combination, on the current release. Distinct from #141, which crashes earlier in parallel_extract.

Environment

Version codebase-memory-mcp 0.6.1 (latest release, prebuilt darwin-arm64 binary, SHA-256 verified against checksums.txt)
OS macOS arm64, ample free RAM (32 GB total system, indexer hits ~5.5 GB peak)
Repo a large internal TypeScript monorepo

Internal mem budget on startup: level=info msg=mem.init budget_mb=16384 total_ram_mb=32768 — peak RSS observed during the run is well under that 16 GB budget.

Reproduction

Two inputs, two scale points, same crash:

Input Files Result
Full monorepo ~138 K TS/TSX + 16 K .py crash
harbor/ workspace subset only ~107 K TS/TSX + 16 K .py crash
Single package within the monorepo (~2,540 TS files) small succeeds — full pipeline + DB persisted

Command:

codebase-memory-mcp cli index_repository '{"repo_path":"/path/to/large/ts-monorepo","name":"my-monorepo"}'

Pass-by-pass timeline (harbor run)

pass=structure          75 ms
pass=parallel_extract   16,577 ms   (742,694 nodes extracted, 206 errors)
pass=registry_build     1,008 ms
pass=parallel_resolve   11,088 ms
pass=k8s                71 ms
pass=tests              290 ms
pass=githistory_compute 2,546 ms
pass=decorator_tags     38 ms
pass=configlink         962 ms
pass=route_match        134 ms
pass=similarity         190 ms      (28,112 fingerprints → 21,253 SIMILAR_TO edges)
pass=semantic_edges     6,960 ms    (81,503 functions, vectors stored, LSH built, 14 edges)
                        <<<  CRASH HERE  >>>
                        no `pass.start pass=dump` line
                        no `gbuf.dump nodes=X edges=Y` line
                        no `pipeline.done` line
time: command terminated abnormally
65 s real, 5.5 GB peak RSS

Same crash signature on the full-monorepo run: dies after pass=semantic_edges completes (elapsed_ms=4894), before dump. 39 s real, 5.2 GB peak RSS.

Diagnostic notes

  • Reproducible across two distinct repo subsets, so it's not corrupt input on one specific file.
  • Single mid-size package indexes cleanly through gbuf.dump and pipeline.done, so the issue isn't in the dump code path's correctness — it's a scale threshold somewhere between ~30 K and ~80 K functions, or between ~57 K and ~? edges (the larger run never reaches the edge-count totals because gbuf.dump is where those get computed).
  • Peak RSS ~5-5.5 GB, well under the 16 GB internal budget and the 32 GB system RAM. So it's not classic OS-level OOM.
  • time: command terminated abnormally with 0 signals received in the wrapper output — process exits via something the shell doesn't classify as a signal. Suggests an internal abort/assert/explicit exit rather than SIGKILL from the OS.
  • 200+ extraction errors during parallel_extract are logged but explicitly handled (level=info not error) and the pipeline continues normally, so they're not load-bearing on this crash.

Related issues

Hypothesis

The dump-phase write of ~80 K function semantic vectors + similarity/CALLS/IMPORTS edges into the in-memory SQLite (level=info msg=gbuf.dump) is the most likely failure surface. Given #189's pointer-auth trap signature on the same code path and the consistent ~5 GB RSS at crash, my guess is a memory-corruption bug in the buffer-management between pass=semantic_edges and the SQLite dump (perhaps a buffer reuse / lifetime issue that only manifests once a particular buffer is large enough).

Useful for triage

Happy to share any additional logs (verbose flag if available, partial cmm.db file from a crashed run, or run with specific instrumentation). I have a clean macOS arm64 + 32 GB RAM environment that reproduces this in 30-60 s, so quick to test fixes against.

In the meantime we're working around it by indexing per-package — every individual package indexes cleanly. Useful for our deployment scenario but loses cross-package call/import edges, so a fix would unlock real value.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions