Skip to content

Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering#21315

Draft
kosiew wants to merge 14 commits intoapache:mainfrom
kosiew:dictionary-coercion-21150
Draft

Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering#21315
kosiew wants to merge 14 commits intoapache:mainfrom
kosiew:dictionary-coercion-21150

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Apr 2, 2026

Which issue does this PR close?


Rationale for this change

The current implementation of MIN and MAX aggregate functions does not properly support dictionary-encoded arrays. This limitation prevents efficient execution on dictionary-backed columns, which are commonly used for memory and performance optimization in Arrow-based systems.

Flattening dictionary arrays into their underlying values was considered, but rejected due to its negative impact on performance and loss of dictionary benefits.

This PR introduces native support for dictionary scalars and arrays in min/max computations, ensuring correctness while preserving performance characteristics.

Additionally, this PR improves determinism of logical plan JSON output (display_pg_json) by enforcing a stable and meaningful key ordering. This resolves inconsistencies observed in snapshot tests.


What changes are included in this PR?

1. Dictionary Support for MIN/MAX

  • Added support for dictionary-encoded scalars in min_max! macro

  • Introduced helper utilities:

    • unpack_dictionary_scalar
    • wrap_dictionary_scalar
    • scalar_extreme
    • update_extreme
  • Refactored batch update logic in MinAccumulator and MaxAccumulator

  • Unified handling of complex types (including Dictionary) via min_max_batch_generic

  • Ensured dictionary keys are preserved when both operands are dictionary scalars

2. Improved Type Coercion

  • Updated get_min_max_result_type to coerce dictionary inputs to their value types
  • Expanded test coverage for coercion behavior

3. Deterministic JSON Plan Output

  • Introduced ordered JSON serialization for display_pg_json

  • Ensured consistent key ordering:

    • "Node Type"
    • other fields (sorted)
    • "Plans"
    • "Output"
  • Replaced serde_json::to_string_pretty with custom ordered writer

4. New and Enhanced Tests

  • Added extensive tests for dictionary min/max behavior:

    • with nulls
    • across multiple batches
    • ignoring unreferenced dictionary values
    • handling null dictionary values
  • Added physical plan validation for dictionary aggregation path

  • Added ordering validation test for JSON output

  • Refactored existing tests for clarity and completeness


Are these changes tested?

Yes.

This PR includes:

  • Unit tests for dictionary min/max behavior across multiple scenarios
  • Type coercion validation tests
  • Logical plan JSON ordering tests
  • Integration-style test validating execution plan structure

These tests ensure correctness, prevent regressions, and document expected behavior.


Are there any user-facing changes?

Yes.

  • MIN and MAX now support dictionary-encoded columns
  • Query results involving dictionary inputs are now correctly computed
  • Logical plan JSON output is now deterministic, which may affect snapshot-based tooling

No breaking API changes are introduced.


LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 7 commits April 2, 2026 16:32
Return ScalarValue::Dictionary(...) in dictionary batches instead of
unwrapping to inner scalars. Enhance min_max! logic to safely handle
dictionary-vs-dictionary and dictionary-vs-non-dictionary comparisons.
Add regression tests for raw-dictionary covering no-coercion,
null-containing, and multi-batch scenarios.
Centralize dictionary batch handling for min/max operations.
Streamline min_max_batch_generic to initialize from the first
non-null element. Implement shared setup/assert helpers in
dictionary tests to reduce repetition while preserving test
coverage.
Refactor dictionary min/max flow by removing the wrap macro arm,
making re-wrapping explicit through a private helper. This
separates the "choose inner winner" from the "wrap as
dictionary" step for easier auditing.

In `datafusion/functions-aggregate/src/min_max.rs`, update
`string_dictionary_batch` to accept slices instead of owned
Vecs, and introduce a small `evaluate_dictionary_accumulator`
helper to streamline min/max assertions with a shared
accumulator execution path, reducing repeated setup.
Update min_max.rs to ensure dictionary batches iterate
actual array rows, comparing referenced scalar values.
Unreferenced dictionary entries no longer affect MIN/MAX,
and referenced null values are correctly skipped.
Expanded tests to cover these changes and updated
expectations
Added regression tests for unreferenced and referenced
null dictionary values.
Update get_min_max_result_type to maintain Dictionary<K, V>
instead of unwrapping to V, allowing planned MIN/MAX execution to
utilize the dictionary-aware accumulator. Add end-to-end SQL
regression test to ensure MIN/MAX properly ignores unreferenced
dictionary values and preserves dictionary-typed output schema.
Adjust unit expectations for dictionary coercion tests to reflect
new planned-path behavior.
Update the MIN/MAX integration test in basic.rs to use two MemTable
partitions, ensuring the physical plan includes both partial and final
aggregate stages. Retain checks for dictionary-typed output schema
and results. In min_max.rs, remove the no-op min_max_dictionary!
macro and inline the existing generic comparison helper for
improved clarity and efficiency.
- Refactored the formatting in `basic.rs` to enhance readability by breaking long lines into shorter segments.
- Updated `min_max.rs` for consistent formatting in the `test_min_max_dictionary_ignores_unreferenced_values` function.
@github-actions github-actions bot added core Core DataFusion crate functions Changes to functions implementation labels Apr 2, 2026
Ensure MIN/MAX(Dictionary(..., T)) returns T at SQL boundary
while retaining the new dictionary comparison logic.
Update regression tests to verify that dictionary inputs
are accepted and that the result is the underlying scalar type.
Adjust planner-level regression test in basic.rs to expect
final output schema to be Utf8 instead of dictionary-typed.
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Apr 2, 2026
@kosiew kosiew force-pushed the dictionary-coercion-21150 branch from 08a20c4 to 5002677 Compare April 2, 2026 13:46
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) and removed logical-expr Logical plan and expressions labels Apr 2, 2026
@kosiew kosiew changed the title Support dictionary-encoded arrays in MIN/MAX aggregates and preserve dictionary types Support Dictionary Arrays in MIN/MAX Aggregates with Direct Scalar Coercion Apr 2, 2026
kosiew added 2 commits April 2, 2026 22:37
Add deterministic JSON serialization that maintains consistent key
ordering regardless of map ordering. Introduce ordered key logic in
display.rs, along with a new pretty-writer. Update root pgjson
emission to use the ordered writer and add a regression test to
verify key sequence for Values plan output.
@kosiew kosiew force-pushed the dictionary-coercion-21150 branch from f2330b9 to b4938c1 Compare April 2, 2026 14:37
@github-actions github-actions bot added logical-expr Logical plan and expressions and removed sqllogictest SQL Logic Tests (.slt) labels Apr 2, 2026
kosiew added 3 commits April 2, 2026 22:54
Share array-extrema scan in min_max.rs.
Collapse dictionary scalar handling into a single
normalized path. Simplify key ordering,
indentation, and stringification helpers in
display.rs. Reduce duplicated dictionary test
setup in min_max.rs and basic.rs.
Refactor the code to eliminate the dictionary_batch_extreme.
Directly pass dictionary arrays through min_max_batch_generic
for improved clarity and efficiency.
Collapse repeated complex-type match arms in min_batch and
max_batch. Introduce a shared update_extreme path for the
min/max accumulators. Simplify the ordered_keys function
and factor the repeated pretty-print container logic
into write_delimited_entries. Use a local macro for JSON
stringification deduplication in display.rs to maintain
behavior without adding dependencies.
@kosiew kosiew changed the title Support Dictionary Arrays in MIN/MAX Aggregates with Direct Scalar Coercion Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering Apr 2, 2026
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dictionary coercion on min/max

2 participants