fix(compiler): handle Rust identifier escaping and name collisions#3658
fix(compiler): handle Rust identifier escaping and name collisions#3658BaldDemian wants to merge 1 commit intoapache:mainfrom
Conversation
|
|
| scope = match.nested_messages | ||
| return lineage | ||
|
|
||
| def _allocate_type_scope_names(self, type_defs: List[object], scope: str) -> None: |
There was a problem hiding this comment.
This allocator only checks IDL declarations against each other, but the generated Rust module also uses helper and prelude names unqualified. For example, message String { string value = 1; } generates pub struct String { pub value: String, }, so the field type resolves to the generated struct instead of std::string::String and the Rust code is recursive/invalid. Vec and Fory have similar failures with Vec<T> and use fory::Fory. Please reserve the generated/helper/prelude names in the relevant scopes or fully qualify those references, and add a generated Rust compile check for String, Vec, and Fory.
There was a problem hiding this comment.
I'm considering changing the generated Rust code to use crate-root absolute paths for generator-owned dependencies, for example emitting ::std::string::String instead of std::string::String, and removing generated use statements.
Prost follows this absolute path style.
Using absolute paths keeps Fory/runtime/std/core references out of the generated code's local namespace, so IDL-defined names such as String, Vec cannot collide with the generator-owned types.
I'd like to hear your opinion @chaokunyang
There was a problem hiding this comment.
good idea, absolute paths is better
| if grpc: | ||
| service_files = generator.generate_services() | ||
| files.extend(service_files) | ||
| if lang == "rust": |
There was a problem hiding this comment.
This guard only rejects overwriting existing non-generated files. It does not catch two inputs in the same invocation that normalize to the same Rust output path. For example, packages foo.bar and foo_bar both produce foo_bar.rs; the second file sees the generated marker, overwrites the first file, and the command exits successfully with only the second schema's types. Please track generated target paths during the compile invocation and fail when distinct source files map to the same Rust output file, with a two-file CLI regression test.
Why?
Identifiers colliding with Rust keywords should be escaped to prevent compiling errors.
What does this PR do?
Workflow
The current implementation broadly follows the approach used by prost-build. It first normalizes identifiers from the IDL according to Rust naming conventions, e.g. converting struct names to UpperCamelCase and field names to snake_case. It then checks whether the normalized identifiers are Rust keywords and escapes them when necessary.
This process is not trivial as I first expected. For example, we need to cache the escaped name of each identifier to make sure the definition site and all use sites refer to the same name consistently.
Known issues
Although I have tested the implementation against many corner cases, this handling is still incomplete, much like prost itself. It may still generate Rust code that fails to compile for valid (but rare, I suppose 🤔) IDL. In most cases, this happens because an identifier from the IDL conflicts with an identifier already used by the Rust code generated by Fory.
Example 1:
IDL:
Generated Rust code:
Compiling errors:
Example 2:
IDL:
Generated Rust code:
Compiling errors:
However, I think IDL like the above are rare in real practice. The resulting compiling errors are clear and straightforward and the user can handle the conflicting names easily in the IDL.
Changes made to the FBS compiler
In the current implementation, I use the tuple
<file_path, line, column>as the stable key for an identifier. This lets the generator connect an identifier's definition site with all of its use sites, so they consistently resolve to the same name after normalization and Rust keyword escaping.However, when translating a FlatBuffers union into the Fory IR, the FBS translator currently assigns every generated union case the same source location: the line and column of the union declaration itself.
fory/compiler/fory_compiler/frontend/fbs/translator.py
Lines 243 to 244 in 389b66c
As a result, different cases in the same union can end up sharing the same identifier key. This causes the later case to overwrite the earlier one in the Rust generator's name cache, and can incorrectly make distinct union variants resolve to the same Rust name.
Since both the FDL parser and the Protobuf translator preserve finer-grained source locations for union/oneof cases:
fory/compiler/fory_compiler/frontend/proto/translator.py
Lines 132 to 133 in 389b66c
fory/compiler/fory_compiler/frontend/fdl/parser.py
Lines 598 to 599 in 389b66c
the FBS translator should do the same by assigning each union case the location of its own token.
Related issues
Close #3544.
AI Contribution Checklist
yes, I included a completed AI Contribution Checklist in this PR description and the requiredAI Usage Disclosure.yes, my PR description includes the requiredai_reviewsummary and screenshot evidence of the final clean AI review results from both fresh reviewers on the current PR diff or current HEAD after the latest code changes.Does this PR introduce any user-facing change?
Benchmark