Skip to content

Fix ORDER BY with named aggregates that expand to subqueries#34

Open
nicosuave wants to merge 11 commits intomainfrom
nicosuave/fix-orderby-subquery
Open

Fix ORDER BY with named aggregates that expand to subqueries#34
nicosuave wants to merge 11 commits intomainfrom
nicosuave/fix-orderby-subquery

Conversation

@nicosuave
Copy link
Copy Markdown
Member

Summary

Fixes #28. When AGGREGATE(...) AT (...) expressions expand to correlated subqueries and are referenced by alias in ORDER BY, DuckDB errors with "Alias has subquery - not yet supported." This wraps the inner query in a subquery (SELECT * FROM (...) _q ORDER BY ...) when ORDER BY references a subquery-aliased SELECT item, so aliases resolve as plain column references.

The wrapping is targeted: it only activates when ORDER BY actually references a subquery alias (via word-boundary matching), avoiding breakage for queries that use qualified column names like o.prodName in ORDER BY.

Adds 3 integration tests and 3 unit tests covering expression-based ORDER BY (revenue/year_total), simple alias ORDER BY, and DESC ordering.

When AGGREGATE(...) AT (...) expressions expand to correlated subqueries
and are referenced by alias in ORDER BY, DuckDB errors with "Alias has
subquery - not yet supported." Fix by wrapping the inner query in a
subquery when ORDER BY references a subquery-aliased SELECT item.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1b14ae6ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +5174 to +5175
let select_start = upper.find("SELECT").map(|p| p + 6).unwrap_or(0);
let select_text = &sql[select_start..from_pos];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Find outer SELECT before extracting subquery aliases

extract_subquery_aliases_from_select uses upper.find("SELECT"), which picks the first SELECT anywhere in the SQL text rather than the outer query's SELECT. For statements with CTEs (for example WITH cte AS (SELECT ...) SELECT ..., (SELECT ...) AS year_total ... ORDER BY year_total), this slices the wrong SELECT list, so year_total is missed (or aliases from the CTE are picked instead). In that case needs_wrap stays false and the original DuckDB Alias has subquery error still occurs for ORDER BY on the subquery alias.

Useful? React with 👍 / 👎.

Fixes a bug where extract_subquery_aliases_from_select would match a
SELECT inside a CTE rather than the outer query's SELECT, causing the
ORDER BY wrapping to be skipped for CTE-based queries.
@nicosuave
Copy link
Copy Markdown
Member Author

@codex review

1 similar comment
@nicosuave
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e25309ea64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +5262 to +5266
while let Some(pos) = text_upper[search_from..].find(&ident_upper) {
let abs = search_from + pos;
let before_ok = abs == 0 || {
let c = text.as_bytes()[abs - 1];
!c.is_ascii_alphanumeric() && c != b'_' && c != b'.'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore literals/comments when detecting ORDER BY alias refs

identifier_appears_in does a raw substring scan and boundary check, so alias names inside string literals/comments are treated as real identifier references. In that case needs_wrap becomes true and the query is wrapped, which can invalidate otherwise-valid ORDER BY expressions that rely on original table qualifiers (for example o.col) because those qualifiers no longer exist in the outer SELECT * FROM (...) _q. This creates a regression for queries where the alias text appears only as literal/comment text.

Useful? React with 👍 / 👎.

Comment on lines +5185 to +5189
b'\'' => {
// Skip single-quoted strings
let mut j = i + 1;
while j < select_text.len() {
if bytes[j] == b'\'' {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Advance past quoted strings when scanning SELECT items

In the single-quote branch, the scanner computes j to skip a string literal but never updates the loop index, so commas inside '...' are still processed as top-level separators. That can split one SELECT item into fragments and hide a subquery alias (e.g. (SELECT 1) || 'a,b' AS x), preventing the wrap logic from triggering and leaving the original DuckDB alias/subquery error unfixed.

Useful? React with 👍 / 👎.

Fixes two issues from PR review:
1. extract_subquery_aliases_from_select did not advance past quoted
   strings, so commas inside literals could split SELECT items
2. identifier_appears_in matched alias names inside string literals
   and comments, potentially triggering unnecessary query wrapping
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50e133e095

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

b'(' => depth += 1,
b')' => depth -= 1,
_ if depth == 0 => {
if i + 4 <= item.len() && upper[i..].starts_with(" AS ") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid UTF-8 boundary panics in alias AS scan

subquery_alias_from_item walks item by byte index and slices upper[i..] at each top-level byte. If a multibyte character appears outside parentheses (for example AGGREGATE(...) AT (...) || 'é' AS year_total), i can point into the middle of a UTF-8 code point and panic with byte index ... is not a char boundary. This path is executed for any query with top-level ORDER BY, so queries that previously executed can now crash before SQL reaches DuckDB.

Useful? React with 👍 / 👎.

Comment on lines +5269 to +5271
let c = stripped.as_bytes()[abs - 1];
!c.is_ascii_alphanumeric() && c != b'_' && c != b'.'
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude quoted qualified names from alias detection

The before_ok boundary check only rejects matches when the immediate previous byte is .. For ORDER BY o."year_total", the previous byte at year_total is ", so this function incorrectly treats it as an unqualified alias reference. That sets needs_wrap = true, and wrapping to SELECT * FROM (...) _q removes table alias o, causing valid qualified ORDER BY clauses to fail with missing-table binder errors.

Useful? React with 👍 / 👎.

- Use char_indices() in subquery_alias_from_item to avoid slicing
  at invalid UTF-8 boundaries when multibyte characters appear
- Reject matches preceded by " or ` in identifier_appears_in to
  avoid treating o."year_total" as an unqualified alias reference
Keep both ORDER BY subquery wrap tests (#28) and expression dimension
tests (#29).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c4bb0a6d89

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

'(' => depth += 1,
')' => depth -= 1,
_ if depth == 0 && ch == ' ' => {
if upper[i..].starts_with(" AS ")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid indexing upper with byte offsets from item

Fresh evidence: although this now uses char_indices(), i comes from item while the slice is taken from upper = item.to_uppercase(), whose byte length can differ for Unicode case mappings. For inputs like (SELECT 1) || 'ıı…' AS year_total, upper becomes shorter and upper[i..] can panic with an out-of-bounds/char-boundary error before SQL reaches DuckDB, so a valid query can crash the process.

Useful? React with 👍 / 👎.

let abs = search_from + pos;
let before_ok = abs == 0 || {
let c = stripped.as_bytes()[abs - 1];
!c.is_ascii_alphanumeric() && c != b'_' && c != b'.' && c != b'"' && c != b'`'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match unqualified quoted aliases in ORDER BY detection

The boundary check rejects matches when the preceding byte is " or `, which avoids o."alias" false positives but also drops legitimate unqualified references like ORDER BY "year_total" (or backtick-quoted equivalents). In that case needs_wrap stays false, so queries that alias subquery expressions with quoted identifiers still hit DuckDB's Alias has subquery failure instead of being wrapped.

Useful? React with 👍 / 👎.

- subquery_alias_from_item: compare " AS " directly on item bytes
  instead of slicing the uppercased string, which can have different
  byte offsets for certain Unicode case mappings
- identifier_appears_in: only reject " and ` before a match when
  preceded by . (qualified reference like o."alias"), allowing bare
  quoted identifiers like "year_total" to match correctly
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e630dbb883

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if before_ok && after_ok {
return true;
}
search_from = abs + 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Advance identifier scan by char boundary

identifier_appears_in advances with search_from = abs + 1, but abs is a byte offset in UTF-8. With non-ASCII aliases (for example AS "é"), a non-boundary match attempt (like scanning ORDER BY té) can set search_from into the middle of a multibyte character, and the next stripped_upper[search_from..] slice panics. This makes otherwise valid queries crash during SQL expansion instead of returning a planner error/result.

Useful? React with 👍 / 👎.

')' => depth -= 1,
' ' if depth == 0 => {
if item[i..].len() >= 4
&& item[i..].as_bytes()[1..4].eq_ignore_ascii_case(b"AS ")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse AS aliases with general whitespace

subquery_alias_from_item only recognizes aliases when the token pattern is exactly " AS ", so valid SQL formatting like (SELECT ...) AS\nyear_total or (SELECT ...) AS\tyear_total is missed. In those cases the subquery alias is not collected, needs_wrap stays false, and ORDER BY on that alias still hits DuckDB's Alias has subquery failure that this patch is meant to avoid.

Useful? React with 👍 / 👎.

- identifier_appears_in: advance search_from to the next valid char
  boundary after a non-matching hit, preventing panics on non-ASCII text
- subquery_alias_from_item: match AS with any surrounding whitespace
  (tabs, newlines, multiple spaces), not just single spaces
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7d6f736c00

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +5527 to +5528
if !item.to_uppercase().contains("(SELECT ") {
return None;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Relax subquery detection beyond literal "(SELECT "

subquery_alias_from_item only proceeds when the SELECT item text contains the exact substring "(SELECT ", so valid scalar subqueries formatted as ( SELECT ...), (SELECT\n...), or (WITH ... SELECT ...) are skipped. In those cases the alias is never collected, needs_wrap stays false, and ORDER BY on that alias still hits DuckDB’s "Alias has subquery" error this patch is meant to avoid.

Useful? React with 👍 / 👎.

Comment on lines +5588 to +5592
while let Some(pos) = stripped_upper[search_from..].find(&ident_upper) {
let abs = search_from + pos;
let before_ok = abs == 0 || {
let c = stripped.as_bytes()[abs - 1];
if c == b'"' || c == b'`' {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid matching aliases in ORDER BY keyword contexts

The alias check is a raw whole-word search, so aliases that are also SQL tokens (for example year) are treated as references when they appear in syntax like EXTRACT(YEAR FROM o.ts). That false positive sets needs_wrap = true and rewrites to SELECT * FROM (...) _q ORDER BY ...; after wrapping, qualified references such as o.ts are no longer valid and previously-working queries can fail with binder errors.

Useful? React with 👍 / 👎.

- item_has_subquery: detect subqueries with whitespace after opening
  paren and CTE-style (WITH ... SELECT ...) patterns
- identifier_appears_in: track parenthesis depth and only match at
  top level, preventing false positives like EXTRACT(YEAR FROM ...)
  triggering unnecessary query wrapping
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ec199d84b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +5627 to +5629
// Only match at top level (outside parentheses)
let at_top_level = depths[abs] == 0;
let before_ok = abs == 0 || {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Match aliases inside parenthesized ORDER BY expressions

identifier_appears_in only considers matches when depths[abs] == 0, so alias references nested in parentheses are ignored. That means common ORDER BY forms like COALESCE(year_total, 0) or (year_total) will not trigger wrapping even when year_total is a scalar-subquery alias, and DuckDB will still raise the same Alias has subquery error this patch is intended to avoid.

Useful? React with 👍 / 👎.

Comment on lines +5570 to +5573
if i + 2 <= bytes.len()
&& bytes[i..i + 2].eq_ignore_ascii_case(b"AS")
&& (i + 2 >= bytes.len() || bytes[i + 2].is_ascii_whitespace())
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Support subquery aliases declared without AS

subquery_alias_from_item only records aliases when it finds the AS token pattern, but SQL allows select-item aliases without AS (e.g. (SELECT ...) year_total). In that valid form, the subquery alias is never collected, needs_wrap stays false, and ORDER BY year_total still hits DuckDB's subquery-alias planner error.

Useful? React with 👍 / 👎.

- identifier_appears_in: remove depth=0 restriction so aliases inside
  COALESCE(year_total, 0) or (year_total) are correctly detected
- subquery_alias_from_item: detect implicit aliases without AS keyword,
  e.g. (SELECT ...) year_total
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f83ac55ebb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +5536 to +5540
if j + 6 <= bytes.len() && bytes[j..j + 6].eq_ignore_ascii_case(b"SELECT") {
return true;
}
if j + 4 <= bytes.len() && bytes[j..j + 4].eq_ignore_ascii_case(b"WITH") {
return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require token boundaries for SELECT/WITH subquery checks

item_has_subquery treats any parenthesized text that starts with SELECT or WITH as a subquery, but it does not verify a keyword boundary. Expressions like (selected_value) AS x or (with_tax) AS x are therefore misclassified as subqueries, so expand_aggregate_with_at can wrap the whole query when ORDER BY x is present. That unnecessary wrap changes name resolution context and can break otherwise-valid qualified sort keys (for example ORDER BY x, o.col) because o is no longer visible outside SELECT * FROM (...) _q.

Useful? React with 👍 / 👎.

item_has_subquery now verifies the character after SELECT/WITH is not
alphanumeric or underscore, preventing identifiers like selected_value
or with_tax from being misclassified as subquery keywords.
@nicosuave
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d430c8a954

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +5531 to +5538
if bytes[i] == b'(' {
let mut j = i + 1;
while j < bytes.len() && bytes[j].is_ascii_whitespace() {
j += 1;
}
if j + 6 <= bytes.len()
&& bytes[j..j + 6].eq_ignore_ascii_case(b"SELECT")
&& (j + 6 >= bytes.len() || !bytes[j + 6].is_ascii_alphanumeric() && bytes[j + 6] != b'_')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip literals/comments when detecting subquery aliases

item_has_subquery scans raw bytes and treats any ("SELECT"|"WITH") sequence as a real subquery, even when it appears inside a string literal or comment. That makes extract_subquery_aliases_from_select misclassify non-subquery SELECT items as subquery aliases, which can trigger the new outer wrapping path unnecessarily. In aggregate queries this can turn previously valid ORDER BY clauses (especially ones that also reference qualified columns like t.col) into binder errors after wrapping because the original table alias is no longer in scope.

Useful? React with 👍 / 👎.

item_has_subquery now skips single-quoted strings, line comments, and
block comments so that text like '(SELECT ...)' inside a literal is
not misclassified as a real subquery.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7740e275cb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +5602 to +5604
b'(' => depth += 1,
b')' => {
depth -= 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip string/comment parens in subquery alias scan

subquery_alias_from_item updates depth for every ( / ) byte but does not skip quoted strings or comments, so parentheses inside literals/comments after a scalar subquery can drive depth negative before the alias is parsed. For inputs like (SELECT ...) || ')' AS year_total, last_alias_start stays unset, the alias is missed, and the ORDER BY wrapper is not applied, so the original DuckDB "Alias has subquery" failure still occurs.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

order by named aggregates results in an error

1 participant