Skip to content

fix: handle adjacent @@ variable tokens in split_words()#15

Merged
mahaloz merged 2 commits intobinsync:mainfrom
hwu71:main
Feb 23, 2026
Merged

fix: handle adjacent @@ variable tokens in split_words()#15
mahaloz merged 2 commits intobinsync:mainfrom
hwu71:main

Conversation

@hwu71
Copy link
Contributor

@hwu71 hwu71 commented Feb 22, 2026

Fixes #14

Problem

When processing decompiled code where variables appear adjacent without spaces
(e.g., func(a,b,c) — common in Ghidra output), VarBERT silently returns
zero predictions for the entire function.

Root Cause

_process_code_with_text() replaces variable names with @@varname@@id@@
placeholders. When variables are adjacent without whitespace, multiple
placeholders merge into a single space-delimited word:

FUN(@@local_18@@varid_abc@@,@@param_3@@varid_def@@,@@pcVar2@@varid_ghi@@);

split_words() uses re.search() which only returns the first @@
match per word — subsequent adjacent patterns are silently lost. This causes
generate_popular_names() to see a holder/mask count mismatch and discard
all predictions.

Fix

Replace re.search() with re.finditer() in split_words() to extract
all @@ patterns from each word.

Testing

Tested with a minimal Ghidra-decompiled function containing adjacent variables:

  • Before fix: 0 predictions (bug triggered)
  • After fix: 7 predictions (all variables renamed correctly)

Existing tests continue to pass.

When variables appear adjacent without spaces in decompiled code
(e.g., func(a,b,c)), the @@ placeholder tokens merge into one word.
re.search() only matched the first pattern, silently losing the rest
and causing a holder/mask count mismatch that discards all predictions.

Replace re.search() with re.finditer() to extract all @@ patterns.
@mahaloz mahaloz merged commit 0434324 into binsync:main Feb 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] split_words() silently drops adjacent @@ variable tokens → zero predictions

2 participants