Skip to content

Feat(optimizer): canonicalize internal query names#7580

Open
georgesittas wants to merge 4 commits intomainfrom
jo/query_fingerprinting
Open

Feat(optimizer): canonicalize internal query names#7580
georgesittas wants to merge 4 commits intomainfrom
jo/query_fingerprinting

Conversation

@georgesittas
Copy link
Copy Markdown
Collaborator

@georgesittas georgesittas commented Apr 29, 2026

New optimizer pass that rewrites a query into a canonical structural form: data-contract names (base-table names/columns, top-level output aliases) are preserved, while internal names (table aliases, CTE/subquery names, internal column aliases) are renamed to sequential _tN / _cN. Useful for query equivalence and deduplication.

Handles set operations (UNION / UNION BY NAME), pivots, UDTFs, recursive CTEs, LATERAL/UNNEST, BigQuery whole-row struct refs, and case-folding semantics across dialects.

This comment was marked as outdated.

@georgesittas georgesittas force-pushed the jo/query_fingerprinting branch from bdfc9ca to 778d33d Compare April 29, 2026 12:45
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

SQLGlot Integration Test Results

Comparing:

  • this branch (sqlglot:jo/query_fingerprinting, sqlglot version: jo/query_fingerprinting)
  • baseline (main, sqlglot version: 0.0.1.dev1)

By Dialect

dialect main sqlglot:jo/query_fingerprinting transitions links
bigquery -> bigquery 24647/24652 passed (100.0%) 23497/23497 passed (100.0%) No change full result / delta
bigquery -> duckdb 867/1154 passed (75.1%) 0/0 passed (0.0%) Results not found full result / delta
duckdb -> duckdb 5823/5823 passed (100.0%) 0/0 passed (0.0%) Results not found full result / delta
snowflake -> duckdb 1063/1961 passed (54.2%) 0/0 passed (0.0%) Results not found full result / delta
snowflake -> snowflake 65133/65133 passed (100.0%) 63027/63027 passed (100.0%) No change full result / delta
databricks -> databricks 1370/1370 passed (100.0%) 1370/1370 passed (100.0%) No change full result / delta
postgres -> postgres 6042/6042 passed (100.0%) 6042/6042 passed (100.0%) No change full result / delta
redshift -> redshift 7101/7101 passed (100.0%) 7101/7101 passed (100.0%) No change full result / delta

Overall

main: 113236 total, 112046 passed (pass rate: 98.9%), sqlglot version: 0.0.1.dev1

sqlglot:jo/query_fingerprinting: 101037 total, 101037 passed (pass rate: 100.0%), sqlglot version: jo/query_fingerprinting

Transitions:
No change

Dialect pair changes: 0 previous results not found, 3 current results not found

✅ 38 test(s) passed

@georgesittas georgesittas force-pushed the jo/query_fingerprinting branch from 778d33d to d9e7533 Compare April 29, 2026 16:43
Comment thread sqlglot/optimizer/canonicalize_internal_names.py Outdated
@georgesittas georgesittas force-pushed the jo/query_fingerprinting branch from bb2f924 to 667c02e Compare April 30, 2026 11:39
@georgesittas georgesittas requested a review from tobymao April 30, 2026 11:40
New optimizer pass that rewrites a query into a canonical structural
form: data-contract names (base-table names/columns, top-level output
aliases) are preserved, while internal names (table aliases, CTE/
subquery names, internal column aliases) are renamed to sequential
_tN / _cN. Useful for query equivalence and deduplication.

Handles set operations (UNION / UNION BY NAME), pivots, UDTFs,
recursive CTEs, lateral/unnest, BigQuery whole-row struct refs, and
case-folding semantics across dialects.
@georgesittas georgesittas force-pushed the jo/query_fingerprinting branch from 667c02e to fd51669 Compare April 30, 2026 13:21
Comment thread sqlglot/optimizer/canonicalize_internal_names.py
from sqlglot._typing import E


def canonicalize_internal_names(expression: E) -> E:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this rule safe with DDLs like CREATE TABLE AS? I think we'd rename the selections so we'd break the data contract

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the intention was to make it work for Query instances. DDLs and other statements aren't taken into account. I can mention this in the docstring.

Comment on lines +49 to +50
next_table = name_sequence("_t")
next_column = name_sequence("_c")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the contract of this rule? Should it be that different queries get canonicalized differently? Because both SELECT * FROM foo and SELECT * FROM bar would be canonicalized to the same query selecting from _t0

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption in this rule is that source tables are qualified with a schema and possibly a catalog. Meaning that for your example, foo and bar are treated as CTEs, and hence canonicalized. In the context of a larger query containing them, that is expected, because foo and bar would get different canonical names, and hence result in a different canonical query at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants