fix(embed): resolve source files from DB root, not cwd#992
Conversation
When `codegraph embed --db <abs-path>` is invoked without a positional directory, the positional `rootDir` defaults to the current working directory. The embed loop then joined every relative node file path against cwd, causing every file read to fail and the command to exit 0 with a misleading "Stored 0 embeddings". Fix: - Persist the repo root (`root_dir`) in the `build_meta` table during build (both WASM/JS finalize and the Rust native orchestrator). - In `buildEmbeddings`, resolve relative file paths against `build_meta.root_dir` first, falling back to the DB's parent directory (`<root>/.codegraph/graph.db` → `<root>`) for DBs built before this change, then finally the caller-provided rootDir. - When every file read fails, throw `DbError` with a clear message so the CLI exits non-zero instead of printing "Stored 0 embeddings" and returning success. Closes #983 Impact: 3 functions changed, 5 affected
|
Claude finished @carlos-alm's task —— View job Code Review CompleteRoot Cause AnalysisProblem Being Solved: Issue #983 where Why This Approach: The fix persists Risk Assessment: Low risk. Changes are purely additive with graceful fallbacks for older DBs. Error handling is improved (throws Backlog Compliance
Critical ConcernsNone identified. This is a well-implemented bug fix with:
Code Quality NotesRust Implementation (build_pipeline.rs:712): let _ = stmt.execute(["root_dir", &root_canon]);Uses canonical path resolution which is appropriate. TypeScript Implementation (finalize.ts:117): root_dir: rootDirToWrite,Consistent with Rust approach using Generator Logic (generator.ts:81-83): const metaRoot = getBuildMeta(db, 'root_dir');
const dbParent = path.dirname(path.dirname(path.resolve(dbPath)));
const resolvedRoot = metaRoot || dbParent || rootDir;Smart fallback chain handles all scenarios properly. Error Handling (generator.ts:155-163): Final Recommendation
|
Greptile SummaryThis PR fixes issue #983 by persisting Confidence Score: 5/5Safe to merge; all previously flagged P0/P1 issues are resolved and the one remaining finding is a pre-existing P2. All three prior review concerns were properly addressed across the fixup commits. The only new finding is a pre-existing destructive-delete ordering issue (P2) that was present before this PR — the new throw makes it more visible but doesn't worsen the data-loss semantics materially. src/domain/search/generator.ts (DELETE-before-validate ordering, P2) Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI
participant buildEmbeddings
participant DB as SQLite DB
participant FS as File System
CLI->>buildEmbeddings: rootDir (cwd), dbPath (abs)
buildEmbeddings->>DB: openDb(dbPath)
buildEmbeddings->>DB: getBuildMeta('root_dir') → metaRoot
note over buildEmbeddings: resolvedRoot = metaRoot || dbParent(.codegraph layout) || rootDir
buildEmbeddings->>DB: DELETE embeddings/meta/fts (pre-validation!)
buildEmbeddings->>DB: SELECT nodes (functions/methods/classes)
loop for each file
buildEmbeddings->>FS: readFileSync(resolvedRoot + file)
alt file readable
FS-->>buildEmbeddings: lines[]
else file missing
FS-->>buildEmbeddings: ENOENT → warn, filesSkipped++
end
end
alt byFile.size > 0 && filesRead === 0
buildEmbeddings->>DB: closeDb
buildEmbeddings-->>CLI: throw DbError (exit non-zero)
else some files read
buildEmbeddings->>buildEmbeddings: embed(texts)
buildEmbeddings->>DB: INSERT embeddings (transaction)
buildEmbeddings->>DB: closeDb
buildEmbeddings-->>CLI: success
end
Reviews (2): Last reviewed commit: "test(embed): wrap temp-dir tests in try/..." | Re-trigger Greptile |
| const dbParent = path.dirname(path.dirname(path.resolve(dbPath))); | ||
| const resolvedRoot = metaRoot || dbParent || rootDir; |
There was a problem hiding this comment.
|| rootDir fallback is dead code
path.dirname(path.dirname(path.resolve(dbPath))) always returns a non-empty string ('.' at minimum, never falsy), so dbParent is always truthy. This makes the || rootDir branch unreachable, silently ignoring any explicit positional <dir> argument the caller passes for legacy DBs that lack root_dir metadata and live at a non-standard path.
Concretely: a user with a DB at /custom/my.db (no root_dir, no conventional .codegraph/ nesting) who runs codegraph embed --db /custom/my.db /actual/source would get dbParent = '/custom' instead of the correct /actual/source.
| const dbParent = path.dirname(path.dirname(path.resolve(dbPath))); | |
| const resolvedRoot = metaRoot || dbParent || rootDir; | |
| const dbParent = path.dirname(path.dirname(path.resolve(dbPath))); | |
| // dbParent is always a non-empty string, so use it only when it looks like | |
| // it corresponds to the conventional <root>/.codegraph layout. | |
| const resolvedRoot = metaRoot || rootDir || dbParent; |
Note: reversing the priority to metaRoot || rootDir || dbParent preserves the "use metadata first" intent, restores the explicit positional-arg path, and keeps the heuristic as a last resort.
There was a problem hiding this comment.
Fixed in 9f20bda. You're right that the || rootDir branch was unreachable — path.dirname always yields a non-empty string so the explicit caller-provided <dir> was silently ignored for legacy DBs.
Rather than reverse the ordering (which would reintroduce #983 for legacy DBs at the conventional <root>/.codegraph/graph.db layout when invoked from a foreign cwd), I gated dbParent on the conventional layout check — only use it when the DB's parent directory is literally named .codegraph. Outside that layout we fall through to the caller-provided rootDir, which matches what your example (/custom/my.db /actual/source) needs.
| test('exits non-zero (throws) when no source files can be read', async () => { | ||
| // Build a DB pointing at files that no longer exist | ||
| const ghostRepo = fs.mkdtempSync(path.join(os.tmpdir(), 'codegraph-embed983-ghost-')); | ||
| const ghostDbDir = path.join(ghostRepo, '.codegraph'); | ||
| fs.mkdirSync(ghostDbDir, { recursive: true }); | ||
| const ghostDb = path.join(ghostDbDir, 'graph.db'); | ||
|
|
||
| const db = new Database(ghostDb); | ||
| db.pragma('journal_mode = WAL'); | ||
| initSchema(db); | ||
| insertNode(db, 'missing', 'function', 'does-not-exist.js', 1, 1); | ||
| db.prepare('INSERT OR REPLACE INTO build_meta (key, value) VALUES (?, ?)').run( | ||
| 'root_dir', | ||
| path.resolve(ghostRepo), | ||
| ); | ||
| db.close(); | ||
|
|
||
| EMBEDDED_TEXTS.length = 0; | ||
| await expect(buildEmbeddings(ghostRepo, 'minilm', ghostDb)).rejects.toThrow( | ||
| /could not read any of the .* source files/, | ||
| ); | ||
|
|
||
| // No embeddings were persisted (they would have been overwritten via DELETE) | ||
| const readDb = new Database(ghostDb, { readonly: true }); | ||
| const count = readDb.prepare('SELECT COUNT(*) as c FROM embeddings').get().c; | ||
| readDb.close(); | ||
| expect(count).toBe(0); | ||
|
|
||
| fs.rmSync(ghostRepo, { recursive: true, force: true }); | ||
| }); |
There was a problem hiding this comment.
ghostRepo (and similarly legacyRepo in the test above) is cleaned up inside the test body. If buildEmbeddings throws unexpectedly before reaching fs.rmSync, the temp directory is leaked. Both should be created in beforeAll and cleaned in afterAll, or at minimum use a try/finally block.
There was a problem hiding this comment.
Fixed in f99f10e. Wrapped both ghostRepo and legacyRepo tests in try/finally blocks so the temp dirs are always cleaned up even if buildEmbeddings throws unexpectedly mid-test.
| let root_canon = std::fs::canonicalize(root_dir) | ||
| .ok() | ||
| .and_then(|p| p.to_str().map(|s| s.to_string())) | ||
| .unwrap_or_else(|| root_dir.to_string()); | ||
| let _ = stmt.execute(["root_dir", &root_canon]); |
There was a problem hiding this comment.
Symlink inconsistency between Rust and JS
root_dir writes
The Rust finalize_build persists root_dir using std::fs::canonicalize (resolves symlinks), but the JS persistBuildMetadata in finalize.ts then overwrites the same key with path.resolve(ctx.rootDir) (which does not resolve symlinks) for any full build using the native engine. On systems where the project root is behind a symlink, these two values can differ — and the JS write wins for full builds, potentially re-introducing the wrong path just after Rust correctly wrote the canonical one.
There was a problem hiding this comment.
Fixed in e48690c. Good catch — for native full builds the JS persistBuildMetadata runs after finalize_build and would overwrite the canonical Rust value. Switched the JS write to fs.realpathSync(path.resolve(ctx.rootDir)) so both engines agree on a symlink-resolved canonical path; kept a safe fallback to path.resolve if realpath throws (e.g. path removed mid-build).
Codegraph Impact Analysis3 functions changed → 5 callers affected across 4 files
|
…yout (#992) The prior 'metaRoot || dbParent || rootDir' chain was unreachable past dbParent because path.dirname always returns a non-empty string. This silently ignored an explicit positional <dir> argument for legacy DBs at non-standard locations. Only use dbParent when the DB actually lives at the conventional <root>/.codegraph/graph.db layout; otherwise fall through to the caller-provided rootDir. Impact: 1 functions changed, 1 affected
…992) For native full builds, Rust's finalize_build writes root_dir via std::fs::canonicalize (symlink-resolving) but the JS persistBuildMetadata then overwrites it with path.resolve(), which does not resolve symlinks. On systems where the project root is behind a symlink the two values diverge and the JS write wins, reintroducing a non-canonical path. Use fs.realpathSync to match the Rust behaviour, with a safe fallback to path.resolve() if realpath throws. Impact: 1 functions changed, 3 affected
The ghostRepo and legacyRepo temp dirs were cleaned up inline in the test body, so an unexpected throw before fs.rmSync would leak them. Wrap each test in try/finally so cleanup always runs.
Summary
codegraph embed --db <abs-path>(without a positional dir) silently stored 0 embeddings because every file read resolved against cwd instead of the DB's root directory.root_dirin thebuild_metatable during build (WASM/JSfinalizeand Rust native orchestrator), and threads it throughbuildEmbeddingswith a fallback to the DB's parent directory for pre-existing DBs.DbErrorso the CLI exits non-zero instead of printing "Stored 0 embeddings" with exit 0.Test plan
tests/search/embedding-strategy.test.tscovers the resolution and error-propagation pathsnpm testcodegraph embed --db <abs-path>from an unrelated cwd — embeddings should be written (not zero)Closes #983