Skip to content

fix(embed): resolve source files from DB root, not cwd#992

Merged
carlos-alm merged 5 commits intomainfrom
fix/embed-resolve-source-from-db-root
Apr 21, 2026
Merged

fix(embed): resolve source files from DB root, not cwd#992
carlos-alm merged 5 commits intomainfrom
fix/embed-resolve-source-from-db-root

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Fixes bug: codegraph embed --db <path> resolves source files from cwd instead of DB root #983: codegraph embed --db <abs-path> (without a positional dir) silently stored 0 embeddings because every file read resolved against cwd instead of the DB's root directory.
  • Persists root_dir in the build_meta table during build (WASM/JS finalize and Rust native orchestrator), and threads it through buildEmbeddings with a fallback to the DB's parent directory for pre-existing DBs.
  • When every file read fails, throws DbError so the CLI exits non-zero instead of printing "Stored 0 embeddings" with exit 0.

Test plan

  • New tests/search/embedding-strategy.test.ts covers the resolution and error-propagation paths
  • npm test
  • Manual: build a repo, move the DB elsewhere, run codegraph embed --db <abs-path> from an unrelated cwd — embeddings should be written (not zero)

Closes #983

When `codegraph embed --db <abs-path>` is invoked without a positional
directory, the positional `rootDir` defaults to the current working
directory. The embed loop then joined every relative node file path
against cwd, causing every file read to fail and the command to exit
0 with a misleading "Stored 0 embeddings".

Fix:
- Persist the repo root (`root_dir`) in the `build_meta` table during
  build (both WASM/JS finalize and the Rust native orchestrator).
- In `buildEmbeddings`, resolve relative file paths against
  `build_meta.root_dir` first, falling back to the DB's parent
  directory (`<root>/.codegraph/graph.db` → `<root>`) for DBs built
  before this change, then finally the caller-provided rootDir.
- When every file read fails, throw `DbError` with a clear message so
  the CLI exits non-zero instead of printing "Stored 0 embeddings"
  and returning success.

Closes #983

Impact: 3 functions changed, 5 affected
@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Claude finished @carlos-alm's task —— View job


Code Review Complete

Root Cause Analysis

Problem Being Solved: Issue #983 where codegraph embed --db <abs-path> from a different working directory silently stored 0 embeddings because file reads resolved against cwd instead of the DB's root directory. This is a real infrastructure issue, not a symptom being masked.

Why This Approach: The fix persists root_dir in build_meta during build (both Rust native and WASM/JS engines), then uses it in the embedding generator with a fallback chain: metadata → DB parent dir → caller-provided rootDir. This ensures file resolution works regardless of the embedding command's working directory.

Risk Assessment: Low risk. Changes are purely additive with graceful fallbacks for older DBs. Error handling is improved (throws DbError when no files can be read instead of silently succeeding).

Backlog Compliance

  • Zero-dep: ✓ — no new runtime dependencies
  • Foundation-aligned: ✓ — improves tool reliability without violating any principles
  • Problem-fit: 4/5 — directly addresses a real usability issue where the tool fails silently, improving reliability for CI/agent workflows
  • Breaking: No — purely additive with backward compatibility
  • Tier: 1

Critical Concerns

None identified. This is a well-implemented bug fix with:

  • ✅ Proper dual-engine implementation (both Rust and TypeScript)
  • ✅ Graceful fallback for pre-existing DBs
  • ✅ Improved error handling with specific guidance
  • ✅ Comprehensive test coverage including edge cases
  • ✅ Clear error propagation instead of silent failures

Code Quality Notes

Rust Implementation (build_pipeline.rs:712):

let _ = stmt.execute(["root_dir", &root_canon]);

Uses canonical path resolution which is appropriate.

TypeScript Implementation (finalize.ts:117):

root_dir: rootDirToWrite,

Consistent with Rust approach using path.resolve().

Generator Logic (generator.ts:81-83):

const metaRoot = getBuildMeta(db, 'root_dir');
const dbParent = path.dirname(path.dirname(path.resolve(dbPath)));
const resolvedRoot = metaRoot || dbParent || rootDir;

Smart fallback chain handles all scenarios properly.

Error Handling (generator.ts:155-163):
Excellent improvement - throws descriptive DbError instead of silent success when all files fail to read, with actionable guidance for users.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: Solid bug fix that addresses a real usability issue with proper dual-engine implementation, comprehensive tests, and improved error handling. The fallback strategy is well-designed and maintains backward compatibility.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR fixes issue #983 by persisting root_dir in build_meta at build time (both Rust finalize_build and JS persistBuildMetadata) and reading it back in buildEmbeddings to resolve relative file paths regardless of the invoking cwd. Three prior review concerns — the unreachable || rootDir fallback, temp-dir leaks in tests, and symlink inconsistency between the Rust and JS writes — have all been addressed in this series of fixup commits.

Confidence Score: 5/5

Safe to merge; all previously flagged P0/P1 issues are resolved and the one remaining finding is a pre-existing P2.

All three prior review concerns were properly addressed across the fixup commits. The only new finding is a pre-existing destructive-delete ordering issue (P2) that was present before this PR — the new throw makes it more visible but doesn't worsen the data-loss semantics materially.

src/domain/search/generator.ts (DELETE-before-validate ordering, P2)

Important Files Changed

Filename Overview
src/domain/search/generator.ts Core fix: reads root_dir from build_meta, gates .codegraph-layout fallback, and throws DbError when all files are unreadable. One P2: DELETEs run before file-read validation, leaving no embeddings if the error path is triggered.
src/domain/graph/builder/stages/finalize.ts Persists root_dir in build_meta using fs.realpathSync (with path.resolve fallback) to produce a symlink-resolved canonical path that matches Rust's canonicalize output.
crates/codegraph-core/src/build_pipeline.rs Rust finalize_build now persists root_dir via std::fs::canonicalize, with a graceful fallback to the raw string if canonicalize fails.
tests/search/embedding-strategy.test.ts New describe block covers the #983 fix (root_dir metadata, .codegraph-layout fallback, total-failure throw); both ghost/legacy temp dirs are wrapped in try/finally blocks as requested.

Sequence Diagram

sequenceDiagram
    participant CLI
    participant buildEmbeddings
    participant DB as SQLite DB
    participant FS as File System

    CLI->>buildEmbeddings: rootDir (cwd), dbPath (abs)
    buildEmbeddings->>DB: openDb(dbPath)
    buildEmbeddings->>DB: getBuildMeta('root_dir') → metaRoot
    note over buildEmbeddings: resolvedRoot = metaRoot || dbParent(.codegraph layout) || rootDir
    buildEmbeddings->>DB: DELETE embeddings/meta/fts (pre-validation!)
    buildEmbeddings->>DB: SELECT nodes (functions/methods/classes)
    loop for each file
        buildEmbeddings->>FS: readFileSync(resolvedRoot + file)
        alt file readable
            FS-->>buildEmbeddings: lines[]
        else file missing
            FS-->>buildEmbeddings: ENOENT → warn, filesSkipped++
        end
    end
    alt byFile.size > 0 && filesRead === 0
        buildEmbeddings->>DB: closeDb
        buildEmbeddings-->>CLI: throw DbError (exit non-zero)
    else some files read
        buildEmbeddings->>buildEmbeddings: embed(texts)
        buildEmbeddings->>DB: INSERT embeddings (transaction)
        buildEmbeddings->>DB: closeDb
        buildEmbeddings-->>CLI: success
    end
Loading

Fix All in Claude Code

Reviews (2): Last reviewed commit: "test(embed): wrap temp-dir tests in try/..." | Re-trigger Greptile

Comment thread src/domain/search/generator.ts Outdated
Comment on lines +82 to +83
const dbParent = path.dirname(path.dirname(path.resolve(dbPath)));
const resolvedRoot = metaRoot || dbParent || rootDir;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 || rootDir fallback is dead code

path.dirname(path.dirname(path.resolve(dbPath))) always returns a non-empty string ('.' at minimum, never falsy), so dbParent is always truthy. This makes the || rootDir branch unreachable, silently ignoring any explicit positional <dir> argument the caller passes for legacy DBs that lack root_dir metadata and live at a non-standard path.

Concretely: a user with a DB at /custom/my.db (no root_dir, no conventional .codegraph/ nesting) who runs codegraph embed --db /custom/my.db /actual/source would get dbParent = '/custom' instead of the correct /actual/source.

Suggested change
const dbParent = path.dirname(path.dirname(path.resolve(dbPath)));
const resolvedRoot = metaRoot || dbParent || rootDir;
const dbParent = path.dirname(path.dirname(path.resolve(dbPath)));
// dbParent is always a non-empty string, so use it only when it looks like
// it corresponds to the conventional <root>/.codegraph layout.
const resolvedRoot = metaRoot || rootDir || dbParent;

Note: reversing the priority to metaRoot || rootDir || dbParent preserves the "use metadata first" intent, restores the explicit positional-arg path, and keeps the heuristic as a last resort.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9f20bda. You're right that the || rootDir branch was unreachable — path.dirname always yields a non-empty string so the explicit caller-provided <dir> was silently ignored for legacy DBs.

Rather than reverse the ordering (which would reintroduce #983 for legacy DBs at the conventional <root>/.codegraph/graph.db layout when invoked from a foreign cwd), I gated dbParent on the conventional layout check — only use it when the DB's parent directory is literally named .codegraph. Outside that layout we fall through to the caller-provided rootDir, which matches what your example (/custom/my.db /actual/source) needs.

Comment on lines +422 to +451
test('exits non-zero (throws) when no source files can be read', async () => {
// Build a DB pointing at files that no longer exist
const ghostRepo = fs.mkdtempSync(path.join(os.tmpdir(), 'codegraph-embed983-ghost-'));
const ghostDbDir = path.join(ghostRepo, '.codegraph');
fs.mkdirSync(ghostDbDir, { recursive: true });
const ghostDb = path.join(ghostDbDir, 'graph.db');

const db = new Database(ghostDb);
db.pragma('journal_mode = WAL');
initSchema(db);
insertNode(db, 'missing', 'function', 'does-not-exist.js', 1, 1);
db.prepare('INSERT OR REPLACE INTO build_meta (key, value) VALUES (?, ?)').run(
'root_dir',
path.resolve(ghostRepo),
);
db.close();

EMBEDDED_TEXTS.length = 0;
await expect(buildEmbeddings(ghostRepo, 'minilm', ghostDb)).rejects.toThrow(
/could not read any of the .* source files/,
);

// No embeddings were persisted (they would have been overwritten via DELETE)
const readDb = new Database(ghostDb, { readonly: true });
const count = readDb.prepare('SELECT COUNT(*) as c FROM embeddings').get().c;
readDb.close();
expect(count).toBe(0);

fs.rmSync(ghostRepo, { recursive: true, force: true });
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Temp dir leak on test failure

ghostRepo (and similarly legacyRepo in the test above) is cleaned up inside the test body. If buildEmbeddings throws unexpectedly before reaching fs.rmSync, the temp directory is leaked. Both should be created in beforeAll and cleaned in afterAll, or at minimum use a try/finally block.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f99f10e. Wrapped both ghostRepo and legacyRepo tests in try/finally blocks so the temp dirs are always cleaned up even if buildEmbeddings throws unexpectedly mid-test.

Comment on lines +708 to +712
let root_canon = std::fs::canonicalize(root_dir)
.ok()
.and_then(|p| p.to_str().map(|s| s.to_string()))
.unwrap_or_else(|| root_dir.to_string());
let _ = stmt.execute(["root_dir", &root_canon]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Symlink inconsistency between Rust and JS root_dir writes

The Rust finalize_build persists root_dir using std::fs::canonicalize (resolves symlinks), but the JS persistBuildMetadata in finalize.ts then overwrites the same key with path.resolve(ctx.rootDir) (which does not resolve symlinks) for any full build using the native engine. On systems where the project root is behind a symlink, these two values can differ — and the JS write wins for full builds, potentially re-introducing the wrong path just after Rust correctly wrote the canonical one.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e48690c. Good catch — for native full builds the JS persistBuildMetadata runs after finalize_build and would overwrite the canonical Rust value. Switched the JS write to fs.realpathSync(path.resolve(ctx.rootDir)) so both engines agree on a symlink-resolved canonical path; kept a safe fallback to path.resolve if realpath throws (e.g. path removed mid-build).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Codegraph Impact Analysis

3 functions changed5 callers affected across 4 files

  • finalize_build in crates/codegraph-core/src/build_pipeline.rs:688 (1 transitive callers)
  • persistBuildMetadata in src/domain/graph/builder/stages/finalize.ts:77 (3 transitive callers)
  • buildEmbeddings in src/domain/search/generator.ts:57 (1 transitive callers)

…yout (#992)

The prior 'metaRoot || dbParent || rootDir' chain was unreachable past
dbParent because path.dirname always returns a non-empty string. This
silently ignored an explicit positional <dir> argument for legacy DBs
at non-standard locations.

Only use dbParent when the DB actually lives at the conventional
<root>/.codegraph/graph.db layout; otherwise fall through to the
caller-provided rootDir.

Impact: 1 functions changed, 1 affected
…992)

For native full builds, Rust's finalize_build writes root_dir via
std::fs::canonicalize (symlink-resolving) but the JS persistBuildMetadata
then overwrites it with path.resolve(), which does not resolve symlinks.
On systems where the project root is behind a symlink the two values
diverge and the JS write wins, reintroducing a non-canonical path.

Use fs.realpathSync to match the Rust behaviour, with a safe fallback
to path.resolve() if realpath throws.

Impact: 1 functions changed, 3 affected
The ghostRepo and legacyRepo temp dirs were cleaned up inline in the
test body, so an unexpected throw before fs.rmSync would leak them.
Wrap each test in try/finally so cleanup always runs.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 2014151 into main Apr 21, 2026
21 checks passed
@carlos-alm carlos-alm deleted the fix/embed-resolve-source-from-db-root branch April 21, 2026 07:39
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 21, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: codegraph embed --db <path> resolves source files from cwd instead of DB root

1 participant