Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering#21315
Draft
kosiew wants to merge 14 commits intoapache:mainfrom
Draft
Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering#21315kosiew wants to merge 14 commits intoapache:mainfrom
kosiew wants to merge 14 commits intoapache:mainfrom
Conversation
Return ScalarValue::Dictionary(...) in dictionary batches instead of unwrapping to inner scalars. Enhance min_max! logic to safely handle dictionary-vs-dictionary and dictionary-vs-non-dictionary comparisons. Add regression tests for raw-dictionary covering no-coercion, null-containing, and multi-batch scenarios.
Centralize dictionary batch handling for min/max operations. Streamline min_max_batch_generic to initialize from the first non-null element. Implement shared setup/assert helpers in dictionary tests to reduce repetition while preserving test coverage.
Refactor dictionary min/max flow by removing the wrap macro arm, making re-wrapping explicit through a private helper. This separates the "choose inner winner" from the "wrap as dictionary" step for easier auditing. In `datafusion/functions-aggregate/src/min_max.rs`, update `string_dictionary_batch` to accept slices instead of owned Vecs, and introduce a small `evaluate_dictionary_accumulator` helper to streamline min/max assertions with a shared accumulator execution path, reducing repeated setup.
Update min_max.rs to ensure dictionary batches iterate actual array rows, comparing referenced scalar values. Unreferenced dictionary entries no longer affect MIN/MAX, and referenced null values are correctly skipped. Expanded tests to cover these changes and updated expectations Added regression tests for unreferenced and referenced null dictionary values.
Update get_min_max_result_type to maintain Dictionary<K, V> instead of unwrapping to V, allowing planned MIN/MAX execution to utilize the dictionary-aware accumulator. Add end-to-end SQL regression test to ensure MIN/MAX properly ignores unreferenced dictionary values and preserves dictionary-typed output schema. Adjust unit expectations for dictionary coercion tests to reflect new planned-path behavior.
Update the MIN/MAX integration test in basic.rs to use two MemTable partitions, ensuring the physical plan includes both partial and final aggregate stages. Retain checks for dictionary-typed output schema and results. In min_max.rs, remove the no-op min_max_dictionary! macro and inline the existing generic comparison helper for improved clarity and efficiency.
- Refactored the formatting in `basic.rs` to enhance readability by breaking long lines into shorter segments. - Updated `min_max.rs` for consistent formatting in the `test_min_max_dictionary_ignores_unreferenced_values` function.
Ensure MIN/MAX(Dictionary(..., T)) returns T at SQL boundary while retaining the new dictionary comparison logic. Update regression tests to verify that dictionary inputs are accepted and that the result is the underlying scalar type. Adjust planner-level regression test in basic.rs to expect final output schema to be Utf8 instead of dictionary-typed.
08a20c4 to
5002677
Compare
Add deterministic JSON serialization that maintains consistent key ordering regardless of map ordering. Introduce ordered key logic in display.rs, along with a new pretty-writer. Update root pgjson emission to use the ordered writer and add a regression test to verify key sequence for Values plan output.
f2330b9 to
b4938c1
Compare
Share array-extrema scan in min_max.rs. Collapse dictionary scalar handling into a single normalized path. Simplify key ordering, indentation, and stringification helpers in display.rs. Reduce duplicated dictionary test setup in min_max.rs and basic.rs.
Refactor the code to eliminate the dictionary_batch_extreme. Directly pass dictionary arrays through min_max_batch_generic for improved clarity and efficiency.
Collapse repeated complex-type match arms in min_batch and max_batch. Introduce a shared update_extreme path for the min/max accumulators. Simplify the ordered_keys function and factor the repeated pretty-print container logic into write_delimited_entries. Use a local macro for JSON stringification deduplication in display.rs to maintain behavior without adding dependencies.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
The current implementation of
MINandMAXaggregate functions does not properly support dictionary-encoded arrays. This limitation prevents efficient execution on dictionary-backed columns, which are commonly used for memory and performance optimization in Arrow-based systems.Flattening dictionary arrays into their underlying values was considered, but rejected due to its negative impact on performance and loss of dictionary benefits.
This PR introduces native support for dictionary scalars and arrays in min/max computations, ensuring correctness while preserving performance characteristics.
Additionally, this PR improves determinism of logical plan JSON output (
display_pg_json) by enforcing a stable and meaningful key ordering. This resolves inconsistencies observed in snapshot tests.What changes are included in this PR?
1. Dictionary Support for MIN/MAX
Added support for dictionary-encoded scalars in
min_max!macroIntroduced helper utilities:
unpack_dictionary_scalarwrap_dictionary_scalarscalar_extremeupdate_extremeRefactored batch update logic in
MinAccumulatorandMaxAccumulatorUnified handling of complex types (including Dictionary) via
min_max_batch_genericEnsured dictionary keys are preserved when both operands are dictionary scalars
2. Improved Type Coercion
get_min_max_result_typeto coerce dictionary inputs to their value types3. Deterministic JSON Plan Output
Introduced ordered JSON serialization for
display_pg_jsonEnsured consistent key ordering:
Replaced
serde_json::to_string_prettywith custom ordered writer4. New and Enhanced Tests
Added extensive tests for dictionary min/max behavior:
Added physical plan validation for dictionary aggregation path
Added ordering validation test for JSON output
Refactored existing tests for clarity and completeness
Are these changes tested?
Yes.
This PR includes:
These tests ensure correctness, prevent regressions, and document expected behavior.
Are there any user-facing changes?
Yes.
MINandMAXnow support dictionary-encoded columnsNo breaking API changes are introduced.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.