Fix: map_from_arrays() with NULL inputs causes native crash#3356
Fix: map_from_arrays() with NULL inputs causes native crash#3356kazantsev-maksim wants to merge 33 commits intoapache:mainfrom
Conversation
This reverts commit 768b3e9.
| -- Comet bug: map_from_arrays(NULL, NULL) causes native crash "map key cannot be null" | ||
| -- https://github.com/apache/datafusion-comet/issues/3327 | ||
| query ignore(https://github.com/apache/datafusion-comet/issues/3327) | ||
| query spark_answer_only |
There was a problem hiding this comment.
This should be fully native?
| query spark_answer_only | |
| query |
| -- https://github.com/apache/datafusion-comet/issues/3327 | ||
| query ignore(https://github.com/apache/datafusion-comet/issues/3327) | ||
| query spark_answer_only | ||
| SELECT map_from_arrays(k, v) FROM test_map_from_arrays WHERE k IS NULL |
There was a problem hiding this comment.
Could you also add tests where v is NULL?
There was a problem hiding this comment.
I think it will be the same behavior: in the original data, value == null when key == null.
statement
INSERT INTO test_map_from_arrays VALUES (array('a', 'b', 'c'), array(1, 2, 3)), (array(), array()), (NULL, NULL)There was a problem hiding this comment.
I ran some additional tests locally and found that the null check needs to cover both inputs, not just the keys. The Spark docs say map_from_arrays "returns null if either the keys array or values array is null", but the current keyIsNotNullExpr only checks expr.left.
I added the row (array('x'), NULL) to the test data and a query SELECT map_from_arrays(k, v) FROM test_map_from_arrays WHERE k IS NOT NULL AND v IS NULL, which crashes with CometNativeException: Compute error: concat requires input of at least one array.
Here's a suggested updated test file that covers the missing cases. Note that I did use AI to generate these tests, so it is possible that there are mistakes.
-- ConfigMatrix: parquet.enable.dictionary=false,true
statement
CREATE TABLE test_map_from_arrays(k array<string>, v array<int>) USING parquet
statement
INSERT INTO test_map_from_arrays VALUES
(array('a', 'b', 'c'), array(1, 2, 3)),
(array(), array()),
(NULL, NULL),
(array('x'), NULL),
(NULL, array(99))
-- basic functionality
query spark_answer_only
SELECT map_from_arrays(k, v) FROM test_map_from_arrays WHERE k IS NOT NULL AND v IS NOT NULL
-- both inputs NULL should return NULL
query
SELECT map_from_arrays(k, v) FROM test_map_from_arrays WHERE k IS NULL AND v IS NULL
-- keys not null but values null should return NULL (Spark behavior)
query
SELECT map_from_arrays(k, v) FROM test_map_from_arrays WHERE k IS NOT NULL AND v IS NULL
-- keys null but values not null should return NULL (Spark behavior)
query
SELECT map_from_arrays(k, v) FROM test_map_from_arrays WHERE k IS NULL AND v IS NOT NULL
-- all rows including nulls
query spark_answer_only
SELECT map_from_arrays(k, v) FROM test_map_from_arrays
-- literal arguments
query spark_answer_only
SELECT map_from_arrays(array('a', 'b'), array(1, 2))
-- literal null arguments
query
SELECT map_from_arrays(NULL, array(1, 2))
query
SELECT map_from_arrays(array('a'), NULL)
query
SELECT map_from_arrays(NULL, NULL)To fix the serde, the condition should check both inputs for null. Something like IsNotNull(expr.left) AND IsNotNull(expr.right) instead of just IsNotNull(expr.left).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3356 +/- ##
============================================
+ Coverage 56.12% 60.03% +3.91%
- Complexity 976 1477 +501
============================================
Files 119 175 +56
Lines 11743 16178 +4435
Branches 2251 2681 +430
============================================
+ Hits 6591 9713 +3122
- Misses 4012 5114 +1102
- Partials 1140 1351 +211 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Which issue does this PR close?
Closes #3327
Rationale for this change
What changes are included in this PR?
How are these changes tested?
Tested with existing unit tests