Dump-phase crash on large TS monorepo (v0.6.1, darwin-arm64) — pipeline completes through semantic_edges then terminates abnormally before gbuf.dump

## Summary

`codebase-memory-mcp` v0.6.1 reliably crashes during the **dump phase** when indexing a large TypeScript monorepo on macOS arm64. The pipeline completes 12 passes — including `parallel_extract`, `parallel_resolve`, `similarity`, `semantic_edges` — and then terminates abnormally **before `gbuf.dump` ever fires**. No DB is persisted. Three runs reproduce deterministically across two different input sizes.

This appears to be the same dump-phase crash reported in #189 (large Java project, v0.5.7) but on a different platform/language combination, on the current release. Distinct from #141, which crashes earlier in `parallel_extract`.

## Environment

| | |
|---|---|
| Version | `codebase-memory-mcp 0.6.1` (latest release, prebuilt darwin-arm64 binary, SHA-256 verified against `checksums.txt`) |
| OS | macOS arm64, ample free RAM (32 GB total system, indexer hits ~5.5 GB peak) |
| Repo | a large internal TypeScript monorepo |

Internal mem budget on startup: `level=info msg=mem.init budget_mb=16384 total_ram_mb=32768` — peak RSS observed during the run is well under that 16 GB budget.

## Reproduction

Two inputs, two scale points, same crash:

| Input | Files | Result |
|---|--:|---|
| Full monorepo | ~138 K TS/TSX + 16 K .py | crash |
| `harbor/` workspace subset only | ~107 K TS/TSX + 16 K .py | crash |
| Single package within the monorepo (~2,540 TS files) | small | **succeeds** — full pipeline + DB persisted |

Command:
```
codebase-memory-mcp cli index_repository '{"repo_path":"/path/to/large/ts-monorepo","name":"my-monorepo"}'
```

## Pass-by-pass timeline (harbor run)

```
pass=structure          75 ms
pass=parallel_extract   16,577 ms   (742,694 nodes extracted, 206 errors)
pass=registry_build     1,008 ms
pass=parallel_resolve   11,088 ms
pass=k8s                71 ms
pass=tests              290 ms
pass=githistory_compute 2,546 ms
pass=decorator_tags     38 ms
pass=configlink         962 ms
pass=route_match        134 ms
pass=similarity         190 ms      (28,112 fingerprints → 21,253 SIMILAR_TO edges)
pass=semantic_edges     6,960 ms    (81,503 functions, vectors stored, LSH built, 14 edges)
                        <<<  CRASH HERE  >>>
                        no `pass.start pass=dump` line
                        no `gbuf.dump nodes=X edges=Y` line
                        no `pipeline.done` line
time: command terminated abnormally
65 s real, 5.5 GB peak RSS
```

Same crash signature on the full-monorepo run: dies after `pass=semantic_edges` completes (`elapsed_ms=4894`), before `dump`. 39 s real, 5.2 GB peak RSS.

## Diagnostic notes

- **Reproducible across two distinct repo subsets**, so it's not corrupt input on one specific file.
- **Single mid-size package indexes cleanly** through `gbuf.dump` and `pipeline.done`, so the issue isn't in the dump code path's correctness — it's a scale threshold somewhere between ~30 K and ~80 K functions, or between ~57 K and ~? edges (the larger run never reaches the edge-count totals because `gbuf.dump` is where those get computed).
- **Peak RSS ~5-5.5 GB**, well under the 16 GB internal budget and the 32 GB system RAM. So it's not classic OS-level OOM.
- **`time: command terminated abnormally`** with `0 signals received` in the wrapper output — process exits via something the shell doesn't classify as a signal. Suggests an internal abort/assert/explicit exit rather than SIGKILL from the OS.
- 200+ extraction errors during `parallel_extract` are logged but explicitly handled (`level=info` not error) and the pipeline continues normally, so they're not load-bearing on this crash.

## Related issues

- #189 — "Crash on Large Java Project: Memory Corruption Causes Database Corruption" — same dump-phase crash, macOS arm64, v0.5.7, results in DB corruption with SIGTRAP / pointer authentication failure. Same code path; bug appears to persist into v0.6.1.
- #141 — "Full-repo indexing crashes in parallel.extract on large AOSP-scale repository" — distinct (crashes much earlier, in `parallel_extract`).

## Hypothesis

The dump-phase write of ~80 K function semantic vectors + similarity/CALLS/IMPORTS edges into the in-memory SQLite (`level=info msg=gbuf.dump`) is the most likely failure surface. Given #189's pointer-auth trap signature on the same code path and the consistent ~5 GB RSS at crash, my guess is a memory-corruption bug in the buffer-management between `pass=semantic_edges` and the SQLite dump (perhaps a buffer reuse / lifetime issue that only manifests once a particular buffer is large enough).

## Useful for triage

Happy to share any additional logs (verbose flag if available, partial cmm.db file from a crashed run, or run with specific instrumentation). I have a clean macOS arm64 + 32 GB RAM environment that reproduces this in 30-60 s, so quick to test fixes against.

In the meantime we're working around it by indexing per-package — every individual package indexes cleanly. Useful for our deployment scenario but loses cross-package call/import edges, so a fix would unlock real value.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dump-phase crash on large TS monorepo (v0.6.1, darwin-arm64) — pipeline completes through semantic_edges then terminates abnormally before gbuf.dump #317

Summary

Environment

Reproduction

Pass-by-pass timeline (harbor run)

Diagnostic notes

Related issues

Hypothesis

Useful for triage

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development


Version	`codebase-memory-mcp 0.6.1` (latest release, prebuilt darwin-arm64 binary, SHA-256 verified against `checksums.txt`)
OS	macOS arm64, ample free RAM (32 GB total system, indexer hits ~5.5 GB peak)
Repo	a large internal TypeScript monorepo

Input	Files	Result
Full monorepo	~138 K TS/TSX + 16 K .py	crash
`harbor/` workspace subset only	~107 K TS/TSX + 16 K .py	crash
Single package within the monorepo (~2,540 TS files)	small	succeeds — full pipeline + DB persisted

Dump-phase crash on large TS monorepo (v0.6.1, darwin-arm64) — pipeline completes through semantic_edges then terminates abnormally before gbuf.dump #317

Description

Summary

Environment

Reproduction

Pass-by-pass timeline (harbor run)

Diagnostic notes

Related issues

Hypothesis

Useful for triage

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions