fix: regression of dict_id in physical plan proto#20063
Conversation
af1a108 to
e0a0061
Compare
brancz
left a comment
There was a problem hiding this comment.
lgtm, thank you for fixing this!
Thanks for the review! |
alamb
left a comment
There was a problem hiding this comment.
Thank you @kumarUjjawal
FYI @dispanser
|
thanks @kumarUjjawal , appreciated! |
| root_as_message(encoded_schema.ipc_message.as_slice()).map_err( | ||
| |e| { | ||
| Error::General(format!( | ||
| "Error IPC schema message while deserializing ScalarValue::List: {e}" |
There was a problem hiding this comment.
Why the error messages say ScalarValue::List ? (List)
Isn't this used for any nested type ? List, Map, Struct, ...
There was a problem hiding this comment.
Good catch! I will update.
…f-45 Undoing [apache#14227](apache#14227) which introduces a bug in protobuf deserialization by passing a wrong schema to record batch construction of the IPC message containing the dictionary. The associated test and a fix will probably be merged into datafusion as apache#20063 but the test fails in main for completely different reason so the patch cannot be backported and doesn't make sense in the context of df46.
alamb
left a comment
There was a problem hiding this comment.
Thank you @kumarUjjawal and @brancz
I feel intuitively something could be made much simpler here, but given this fixes the bug and i don't have any specific ideas of how to improve the PR, let's go with this one
| )); | ||
| }; | ||
|
|
||
| // IPC dictionary batch IDs are assigned when encoding the schema, but our protobuf |
There was a problem hiding this comment.
I feel like somehow I missing something key -- this seems like a pretty massive overhead to create / decode a single value here.
That being said, it seems like the existing code is also doing the overhead, so maybe it is fine for now 🤔
I wonder if we could pull this logic for creating the schema into its own function to try and reduce the size of the overall method / make it easier to understand
| Ok(()) | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
With the code change reverted, this test fails like
---- cases::roundtrip_physical_plan::roundtrip_call_null_scalar_struct_dict stdout ----
Error: Plan("General error: Error encoding ScalarValue::List as IPC: Ipc error: no dict id for field item")
|
Thank you @alamb. I will look into your suggestions. |
Thanks -- let's do it as a follow on PR |

Which issue does this PR close?
Struct(Dict)data type #20011.Rationale for this change
dict_idis intentionally not preserved protobuf (it’s deprecated in Arrow schema metadata), but Arrow IPC still requires dict IDs for dictionary encoding/decoding.What changes are included in this PR?
Are these changes tested?
Yes added a test for this
Are there any user-facing changes?
No