diff --git a/.gitignore b/.gitignore index 5860fb1b6..d9b8639ad 100644 --- a/.gitignore +++ b/.gitignore @@ -49,6 +49,9 @@ build/ node_modules/ graph-ui/dist/ +# Generated by scripts/embed-frontend.sh during the --with-ui build +src/ui/embedded_assets.c + # Generated reports BENCHMARK_REPORT.md TEST_PLAN.md diff --git a/Makefile.cbm b/Makefile.cbm index 3ff50b81f..0b37ac5e5 100644 --- a/Makefile.cbm +++ b/Makefile.cbm @@ -89,9 +89,16 @@ ifeq ($(STATIC),1) STATIC_FLAGS := -static endif -LDFLAGS = -lm -lstdc++ -lpthread -lz $(LIBGIT2_LIBS) $(WIN32_LIBS) $(STATIC_FLAGS) -LDFLAGS_TEST = -lm -lstdc++ -lpthread -lz $(SANITIZE) $(LIBGIT2_LIBS) $(WIN32_LIBS) -LDFLAGS_TSAN = -lm -lstdc++ -lpthread -lz -fsanitize=thread $(LIBGIT2_LIBS) $(WIN32_LIBS) +# libdl: SQLite's dlopen/dlsym live in a separate libdl on glibc < 2.34. +# Linked on non-Windows (harmless stub on glibc >= 2.34 and macOS); MinGW has no libdl. +DL_LIBS := +ifneq ($(IS_MINGW),yes) +DL_LIBS := -ldl +endif + +LDFLAGS = -lm -lstdc++ -lpthread -lz $(DL_LIBS) $(LIBGIT2_LIBS) $(WIN32_LIBS) $(STATIC_FLAGS) +LDFLAGS_TEST = -lm -lstdc++ -lpthread -lz $(DL_LIBS) $(SANITIZE) $(LIBGIT2_LIBS) $(WIN32_LIBS) +LDFLAGS_TSAN = -lm -lstdc++ -lpthread -lz $(DL_LIBS) -fsanitize=thread $(LIBGIT2_LIBS) $(WIN32_LIBS) # ── Source files ───────────────────────────────────────────────── diff --git a/docs/SECURITY_ASSESSMENT.html b/docs/SECURITY_ASSESSMENT.html new file mode 100644 index 000000000..38b00b6fa --- /dev/null +++ b/docs/SECURITY_ASSESSMENT.html @@ -0,0 +1,290 @@ + + +
+ + +
+ Scope: data exfiltration risk to proprietary source, remote code execution potential, and
+ project legitimacy. Reviewed at git e599df1 (branch main).
+ Assessment date: 2026-06-18.
+
+ codebase-memory-mcp is a code-intelligence engine written in pure C (zero runtime
+ dependencies). It parses a repository with tree-sitter (plus a hybrid LSP layer), builds a
+ persistent knowledge graph in a local SQLite database, and exposes ~14 MCP tools so AI coding
+ agents can query code structure instead of reading files one by one. By design it reads your
+ entire codebase, writes to agent config files, and spawns local helper processes
+ (git, grep). All indexing happens locally; the graph is stored in
+ local *.db files.
+
| Signal | Finding |
|---|---|
| License | MIT, Copyright (c) 2025 DeusData. Permissive, auditable. |
| Repository activity | 862 commits, ~4 months of active history (2026-02-25 to 2026-06-12). |
| Authorship | Primary author Martin Vogel (696 commits), plus DeusData, Shane McCarron, Dependabot, and several outside contributors. Not a single-commit drop. |
| Provenance | README advertises SLSA 3 build provenance, Sigstore signatures, SHA-256 checksums, VirusTotal scan per release, and an OpenSSF Scorecard badge. |
| Research backing | Linked arXiv preprint describing the design and benchmarks. |
| Transparency | SECURITY.md explicitly discloses the filesystem access patterns, invites researchers to find RCE, and states "your code never leaves your machine." |
+ Conclusion: this is a real, actively maintained open-source project, not malware or a typosquat. + The README's "code never leaves your machine" claim is consistent with what the source actually does + (see section 3). +
+ +The entire C codebase makes exactly four outbound network calls. All target GitHub, and all are + fetch/download only - none send a request body, and none transmit any indexed code, file paths, + symbols, or graph data.
+| # | Location | What it does | Leaks code? |
|---|---|---|---|
| 1 | src/mcp/mcp.c:4369 | curl -sf GET to api.github.com/.../releases/latest to read the latest version tag (background update check on startup). | No - GET, no body |
| 2 | src/cli/cli.c:2804 | Downloads checksums.txt from GitHub releases (only during update). | No - download |
| 3 | src/cli/cli.c:3801 | Downloads the release binary archive from GitHub releases (only during update). | No - download |
| 4 | src/cli/cli.c:3874 | curl -sfI HEAD request to the releases page to detect a newer version. | No - HEAD |
src/semantic/semantic.c computes embeddings locally. No OpenAI / Anthropic / Cohere / HuggingFace calls, no API keys, no bearer tokens.127.0.0.1 (htonl(0x7F000001), src/ui/httpd.c:141). It is not reachable from your network, so the indexed graph cannot be read by other machines.git diff, git ls-files, grep, pgrep, macOS codesign/xattr. No data leaves the machine.The highest-risk path is the search_code tool, which feeds user input into a
+ grep invocation. The design neutralizes injection:
write_pattern_file) and passed via grep -f <file>. It is never interpolated into the shell command, so metacharacters like ; or $(...) are treated as literal grep pattern bytes.root_path and file_pattern - are gated by validate_search_args before use.cbm_validate_shell_arg (src/foundation/str_util.c:245) rejects any argument containing ' " ; | & $ ` < >, CR, LF, and (on non-Windows) backslash.git diff, git ls-files) and the watcher apply the same validator to base_branch, root_path, and session_root before building commands.cbm_exec_no_shell, which uses fork+execvp (POSIX) or _spawnvp (Windows) directly - no shell parses the URL or arguments.Net: an attacker who can supply MCP tool arguments or CLI arguments cannot achieve + arbitrary command execution through the reviewed paths.
+ +The update subcommand downloads a release archive and installs it as the running
+ binary. This is user-initiated only - the automatic startup check merely sets a one-shot
+ notice string ("run: codebase-memory-mcp update"); it never downloads or
+ swaps anything on its own.
The downloaded archive's SHA-256 is compared against checksums.txt from the same
+ release. FIXED Previously, the caller logic
+ (download_verify_install) aborted only on an explicit checksum mismatch:
+ if verification could not be performed at all - the checksums.txt download failed, the
+ archive name was missing from it, or no sha256sum/shasum tool was present -
+ verify_download_checksum returned "could not verify" and the install proceeded anyway
+ with only a stderr warning.
The installer now fails closed: any non-zero verification result aborts the update and deletes
+ the downloaded archive. An explicit mismatch always aborts. The "could not verify" case also aborts
+ unless an operator explicitly opts in for an offline/airgapped host by setting
+ CBM_ALLOW_UNVERIFIED_UPDATE=1 (which prints a loud warning). Current logic in
+ src/cli/cli.c:
int crc = verify_download_checksum(tmp_archive, archive_name);
+if (crc == CLI_TRUE) { /* mismatch -> always abort */
+ cbm_unlink(tmp_archive);
+ return CLI_TRUE;
+}
+if (crc != 0) { /* "could not verify" -> fail closed */
+ const char *allow = cbm_safe_getenv("CBM_ALLOW_UNVERIFIED_UPDATE", ...);
+ if (!allow || strcmp(allow, "1") != 0) {
+ cbm_unlink(tmp_archive);
+ return CLI_TRUE; /* abort: refuse to install unverified binary */
+ }
+ /* else: install with a loud "UNVERIFIED" warning (explicit opt-in) */
+}
+ Residual (recommended, defense in depth): the update path verifies only the
+ SHA-256 from the same origin; it does not verify the Sigstore signature or SLSA provenance the
+ project publishes for releases. So integrity now rests on enforced SHA-256 + HTTPS + GitHub release
+ integrity, but not yet on a cryptographic signature check. Verifying the signature/provenance would
+ close the remaining same-origin-trust gap.
With the fix, the only way to install a tampered binary is a checksum collision against
+ a same-origin checksums.txt (i.e. a compromised GitHub release or HTTPS break that
+ rewrites both files consistently), and it still triggers only when the user manually runs
+ update. The CBM_ALLOW_UNVERIFIED_UPDATE=1 opt-out is the one path that
+ re-enables unverified installs, by explicit operator choice.
No exfiltration path exists. All processing is local; outbound traffic is version checks and + (on demand) binary downloads from GitHub. Matches the project's own claim.
+Search patterns are passed out-of-band via grep -f; interpolated paths are validated;
+ process spawns avoid the shell. No command injection found in the reviewed paths.
Where: download_verify_install in src/cli/cli.c.
Previous issue: "could not verify" (missing checksums.txt, missing sha256 tool, name not + listed) was treated as a non-fatal warning, so an unverified binary was still installed and executed.
+Fix applied: the installer now fails closed - any non-zero result from
+ verify_download_checksum (mismatch or "could not verify") aborts the update and
+ deletes the downloaded archive. An explicit checksum mismatch always aborts. For genuinely
+ offline/airgapped hosts, an operator can opt in to skipping verification by setting
+ CBM_ALLOW_UNVERIFIED_UPDATE=1, which also prints a loud warning.
int crc = verify_download_checksum(tmp_archive, archive_name);
+if (crc == CLI_TRUE) { /* mismatch -> always abort */
+ cbm_unlink(tmp_archive);
+ return CLI_TRUE;
+}
+if (crc != 0) { /* "could not verify" -> fail closed */
+ /* abort unless CBM_ALLOW_UNVERIFIED_UPDATE=1 is set */
+}
+ Still recommended (defense in depth): verify the Sigstore signature / SLSA provenance rather + than only a same-origin SHA-256, and prefer installing updates from independently verified release + artifacts (section 6).
+Where: start_update_check (src/mcp/mcp.c:4417), launched at startup
+ (src/mcp/mcp.c:4496).
Issue: The server contacts api.github.com on its own. It discloses only your IP
+ and that you run the tool - no code - but there is no built-in opt-out, unlike the downloads which
+ honor the CBM_DOWNLOAD_URL env var.
Recommendation: in restricted-egress / air-gapped environments, block egress at the firewall
+ (section 6). Optionally patch out start_update_check if building from source.
# Linux: drop outbound for the process user (example) +sudo -u cbmuser firejail --net=none codebase-memory-mcp serve +# or run inside a container with no network: +docker run --network none -v "$PWD":/repo:ro your-cbm-image+
-v "$PWD":/repo:ro) so the indexer can
+ read but not modify source. Note it still needs a writable location for its *.db index.api.github.com and github.com egress for this process.update command for the integrity guarantee (see Medium finding).127.0.0.1; keep it that way and do not put it
+ behind a reverse proxy that exposes it.Bottom line: This is a legitimate tool that does not exfiltrate your code and is not + remotely exploitable through its MCP/CLI inputs. The only meaningful hardening item is the + best-effort self-update; pair "build-from-source or manually-verified binary" with + "run with egress blocked" and you can point it at proprietary code with confidence.
+{error}
-- {filteredData.nodes.length.toLocaleString()} nodes /{" "} - {filteredData.edges.length.toLocaleString()} edges + {filteredData.nodes.length.toLocaleString()} groups /{" "} + {filteredData.edges.length.toLocaleString()} links +
++ click a node to drill in - size = code volume
- {data.nodes.length > filteredData.nodes.length && ( -- filtered from {data.nodes.length.toLocaleString()} -
- )} {highlightedIds && highlightedIds.size > 0 && ( -- {highlightedIds.size} selected -
+{highlightedIds.size} selected
)}No matches
+ ) : ( + rows.map((n) => ( +