Skip to content

fix(bench): update elixir/julia/objc expected-edges to module-qualified names#1496

Open
carlos-alm wants to merge 4 commits into
fix/field-typemap-class-scope-1458from
fix/v3120-regressions-1446-1447
Open

fix(bench): update elixir/julia/objc expected-edges to module-qualified names#1496
carlos-alm wants to merge 4 commits into
fix/field-typemap-class-scope-1458from
fix/v3120-regressions-1446-1447

Conversation

@carlos-alm

Copy link
Copy Markdown
Contributor

Summary

Changes:

  • Updated all three expected-edges.json to use actual extractor output names
  • Ratcheted THRESHOLDS: elixir 0%→100% precision / 80% recall; julia 0%→100% precision / 70% recall; objc 0%→100% precision / 40% recall
  • Resolution benchmark: 176/176 tests pass, full suite 3044/3074 pass

Closes #1447

…ed names

The resolution benchmark uses WASM-built graphs where the Elixir, Julia,
and Objective-C extractors emit module-qualified symbol names (Main.run,
App.main, UserService.create_user, etc.). The expected-edges manifests
were written with bare unqualified names (run, main, create_user), so
every correctly-resolved edge appeared as a false positive and every
expected edge appeared as a false negative — causing all three languages
to show 0% precision even though resolution was working correctly.

Root cause: starting in v3.12.0, cross-module call resolution began working
for these languages (via the improved receiver-dispatch and same-class
fallback in resolveByMethodOrGlobal / build-edges.ts). With 0 edges
previously resolved, the name mismatch was invisible; once edges started
resolving, the manifests showed 17 FP (elixir), 11 FP (julia), 6 FP
(objc) — all correctly resolved edges misidentified as false positives.

Fix:
- Update all three expected-edges.json manifests to use the
  module-qualified names matching actual extractor output:
  elixir: Main.run, UserService.create_user, Validators.validate_user, etc.
  julia:  App.main, Service.create_user, Repository.new_repo, etc.
  objc:   full ObjC selectors (createUserWithId:name:email:, isValidEmail:, etc.)
          plus add main -> run (plain C call correctly resolved)
- Ratchet THRESHOLDS for all three:
  elixir: precision 0.0 -> 1.0, recall 0.0 -> 0.8  (17/21 resolved)
  julia:  precision 0.0 -> 1.0, recall 0.0 -> 0.7  (11/15 resolved)
  objc:   precision 0.0 -> 1.0, recall 0.0 -> 0.4   (6/13 resolved)

Remaining FNs are genuine unresolved edges (same-file bare calls in
elixir/julia, receiver-typed message sends in objc) — not regressions.

Closes #1447
@claude

claude Bot commented Jun 13, 2026

Copy link
Copy Markdown

Claude finished @carlos-alm's task —— View job


PR Review — fix(bench): update elixir/julia/objc expected-edges to module-qualified names

  • Read changed files
  • Phase 1: Root cause analysis
  • Phase 2: Configuration change detection
  • Phase 3: Test integrity check
  • Phase 4: Code quality
  • Phase 5: Backlog compliance
  • Write final assessment

Root Cause Analysis

Problem Being Solved: The expected-edges manifests for elixir, julia, and objc contained bare unqualified function names (run, main, create_user), while the tree-sitter extractors have always emitted module-qualified names (Main.run, App.main, UserService.create_user, full ObjC selectors). Prior to v3.12.0, zero edges resolved for these languages, so the name mismatch was invisible (0 FP, 0 TP — trivially passing at 0.0/0.0). Once v3.12.0's improved cross-module resolution started successfully resolving edges, every correct edge appeared as a false positive and every expected edge as a false negative. This drove apparent precision to 0% while the extractor was working correctly.

Why This Approach: The fix is surgical and correct: update the manifests to match what the extractors actually emit, then ratchet the thresholds to reflect real performance. This is the right direction — the manifests are authoritative ground truth, so they must match actual extractor behavior. The alternative (changing the extractor to emit bare names) would make Elixir/Julia/ObjC inconsistent with every other module-aware language the tool supports.

Risk Assessment: Low. This is a test-data correction, not a production code change. The extractors are unchanged. The thresholds are ratcheted up (from 0.0 to 1.0 precision), which is strictly more restrictive. The PR cannot introduce runtime regressions.


Phase 2: Configuration Change Detection

No Biome rules, TypeScript flags, or build configs touched. The only resolution-benchmark.test.ts change is threshold values for three languages — all moving from { precision: 0.0, recall: 0.0 } to { precision: 1.0, recall: 0.8/0.7/0.4 }. These are ratchets upward, not relaxations. ✓


Phase 3: Test Integrity

No assertions weakened. The change makes tests more strict:

  • elixir: 0% precision floor → 100% precision floor (any future FP will fail CI immediately)
  • julia: same — 0% → 100%
  • objc: same — 0% → 100%

One subtlety worth verifying: the PR description says "81% recall" for elixir but sets the threshold to 0.8 (80%). That's a small but intentional buffer — defensible and consistent with how other languages are handled. No concern.

The notes field added to all manifest entries is genuinely useful: it documents why module-qualified names are used rather than leaving future maintainers confused. ✓


Phase 4: Code Quality

objc/expected-edges.json — one edge added (main → run, plain C call): this new edge is correctly grounded. main.m shows:

void run(void) { ... }
int main(int argc, const char *argv[]) {
    @autoreleasepool { run(); }
}

Both main and run are plain C functions (not ObjC methods), so they correctly use bare names — consistent with how the extractor would see them. The new main → run edge was missing from the original manifest; adding it is correct. ✓

ObjC selector names verified against Service.m and Validators.m:

  • UserService.initWithRepository: ✓ (trailing colon matches ObjC convention)
  • UserService.createUserWithId:name:email:
  • UserService.getUserWithId:
  • UserService.removeUserWithId:
  • Validators.isValidEmail:
  • Validators.isValidName:
  • UserRepository.saveWithId:name:, findById:, deleteWithId: ✓ (match [_repo ...] calls in Service.m)
  • Validators.validateUser:email: + its two same-class callees ✓

Elixir manifest verified against main.ex: defmodule Main do / def run do → extractor emits Main.run. All cross-module calls in the file match the updated manifest. ✓

Julia manifest follows the same module-qualification pattern consistently. ✓

Minor nit (resolution-benchmark.test.ts:174): The comment says "81% recall" but the threshold is set to 0.8:

// elixir: cross-module qualified calls resolve at 100% precision, 81% recall.
elixir: { precision: 1.0, recall: 0.8 },

The comment slightly overstates actual precision ("81%") while the threshold is set to 80%. This is intentional headroom, but the discrepancy in the comment could confuse future readers into thinking the manifest was designed to hit exactly 81% when CI only requires 80%. Not blocking, but worth a follow-up fix to say "≥80% recall" or "11/14 recall" to match the actual fraction. Same applies to julia ("73%" comment, 0.7 threshold) and objc ("46%" comment, 0.4 threshold).


Phase 5: Backlog Compliance

Axis Score Reason
Zero-dep No new dependencies
Foundation-aligned Pure test-data correction; no production code, no cloud calls, no schema changes
Problem-fit 3/5 Fixes CI accuracy metrics. Doesn't directly improve agent navigation or blast-radius detection, but ensures the benchmark correctly reflects real extraction quality — which matters for honest capability tracking
Breaking No Test-only change; no API, DB schema, CLI output, or MCP contract touched
Tier 1 Zero-dep + foundation-aligned

Critical Concerns

None. The changes are well-scoped, the root cause is correctly identified, and the fix is the right approach. The thresholds are ratcheted upward (more strict), not downward.

The only genuine question would be: Was the extractor actually correct before v3.12.0? The PR says v3.12.0 improved cross-module resolution, which revealed the manifest mismatch. If zero edges resolved before, the manifests at 0.0/0.0 were vacuously correct — this PR converts them to genuinely informative thresholds. That's a net improvement to test quality.


Final Recommendation

  • Rating: ⭐⭐⭐⭐ (4/5)
  • Action: APPROVE
  • Reasoning: Correct diagnosis, minimal scope, moves tests strictly more restrictive, well-documented. The only reason it's not 5 stars is the minor comment imprecision (81%/73%/46% stated vs 80%/70%/40% enforced) — not worth blocking for, but worth a follow-up tightening.

@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes stale benchmark manifests for Elixir, Julia, and Objective-C that used bare unqualified function names (run, main, create_user) while the extractors emit module-qualified names (Main.run, App.main, UserService.create_user). The mismatch caused every resolved edge to appear as a false positive and every expected edge as a false negative, collapsing precision to 0% for all three languages and artificially dragging global precision from 89.9% to 84.4%.

  • Elixir / Julia manifests: all 21 / 15 edge names rewritten to module-qualified form; thresholds raised from 0% to 100% precision / 80% and 70% recall respectively, matching measured extractor output with a conservative floor.
  • ObjC manifest: selectors updated to include full trailing colons and multi-part labels (createUserWithId:name:email:); one previously-missing plain-C edge (main → run) added; thresholds raised to 100% precision / 40% recall, consistent with the 6/13 TPs produced by static and same-file resolution modes.

Confidence Score: 5/5

Pure benchmark manifest and threshold update — no production code paths are touched, and all 176 resolution tests pass.

The changes are limited to fixture JSON files and threshold constants. The name-qualification math is internally consistent across all three languages (Elixir 17/21 = 80.95%, Julia 11/15 = 73.3%, ObjC 6/13 = 46.2%), and every threshold is set as a conservative floor below the measured value. The new ObjC plain-C edge (main → run) is a basic static resolution that the extractor handles reliably. No harness logic was modified.

No files require special attention.

Important Files Changed

Filename Overview
tests/benchmarks/resolution/fixtures/elixir/expected-edges.json All 21 edge names updated to module-qualified form (e.g. validate_user → Validators.validate_user, run → Main.run); notes updated to explain extractor naming convention. Math checks out: 17 cross-module TPs / 21 total = 80.95% ≈ 81% recall, threshold conservatively set to 0.8.
tests/benchmarks/resolution/fixtures/julia/expected-edges.json All 15 edge names updated to module-qualified form (e.g. main → App.main, create_user → Service.create_user); notes updated. Math: 11 cross-module TPs / 15 total = 73.3%, threshold set conservatively to 0.7.
tests/benchmarks/resolution/fixtures/objc/expected-edges.json Selector names updated with full trailing colons (e.g. isValidEmail → isValidEmail:, createUserWithId → createUserWithId:name:email:); one new plain-C edge (main → run, static mode) added; multi-part saveWithId: selector corrected to saveWithId:name:. Math: 4 static + 2 same-file = 6 TPs / 13 total = 46.2%, threshold set conservatively to 0.4.
tests/benchmarks/resolution/resolution-benchmark.test.ts Thresholds for elixir (0→1.0/0.8), julia (0→1.0/0.7), and objc (0→1.0/0.4) raised to reflect actual extractor output; detailed comments explain resolved vs unresolved modes and precision ratchet. No logic changes to the test harness itself.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Extractor emits\nmodule-qualified names\ne.g. Main.run] -->|before fix| B{Manifest had\nbare names\ne.g. run}
    B --> C[Every resolved edge\n→ False Positive\nprecision = 0%]
    B --> D[Every expected edge\n→ False Negative\nrecall = 0%]
    A -->|after fix| E{Manifest now uses\nmodule-qualified names\ne.g. Main.run}
    E --> F[Cross-module calls\n→ True Positives]
    E --> G[Same-file bare calls\n→ False Negatives\nnot yet resolved]
    F --> H[Elixir: 17/21 TP\n81% recall, 100% precision]
    F --> I[Julia: 11/15 TP\n73% recall, 100% precision]
    F --> J[ObjC: 6/13 TP\n46% recall, 100% precision]
Loading

Reviews (5): Last reviewed commit: "Merge remote-tracking branch 'origin/fix..." | Re-trigger Greptile

Comment on lines +189 to +192
// objc: class-method static calls and same-class calls resolve at 100% precision, 46% recall.
// Expected-edges now use full ObjC selectors (createUserWithId:name:email:, isValidEmail:, etc.)
// matching what the ObjC extractor emits. Receiver-typed instance message sends
// are not yet resolved — tracked as FNs. Precision 1.0 acts as ratchet against future FPs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The ObjC threshold comment says only "Receiver-typed instance message sends are not yet resolved", but the constructor mode edge (run → UserService.initWithRepository:) is also unresolved. With 13 total edges, 6 are resolved (4 static + 2 same-file), giving 6/13 = 46.2% recall. If the constructor were a true positive, recall would be 7/13 = 53.8% — inconsistent with the "46% recall" stated in the comment. The missing mention of the constructor FN could mislead future contributors auditing which modes are covered.

Suggested change
// objc: class-method static calls and same-class calls resolve at 100% precision, 46% recall.
// Expected-edges now use full ObjC selectors (createUserWithId:name:email:, isValidEmail:, etc.)
// matching what the ObjC extractor emits. Receiver-typed instance message sends
// are not yet resolved — tracked as FNs. Precision 1.0 acts as ratchet against future FPs.
// objc: class-method static calls and same-class calls resolve at 100% precision, 46% recall.
// Expected-edges now use full ObjC selectors (createUserWithId:name:email:, isValidEmail:, etc.)
// matching what the ObjC extractor emits. Constructor message sends (initWithRepository:) and
// receiver-typed instance message sends are not yet resolved — tracked as FNs.
// Precision 1.0 acts as ratchet against future FPs.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the comment now lists both constructor message sends (initWithRepository:) and receiver-typed instance message sends as unresolved FNs, so future contributors see the complete picture of what is and isn't covered.

…bjc threshold comment

The previous comment only listed receiver-typed instance message sends as
unresolved false negatives, omitting the constructor edge
(run → UserService.initWithRepository:). Both modes are FNs; the comment now
lists them together so future contributors see the complete picture.
@carlos-alm

Copy link
Copy Markdown
Contributor Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant