Refactor BitwiseXOR operator using ScalarUDF framework#20086
Refactor BitwiseXOR operator using ScalarUDF framework#20086Acfboy wants to merge 1 commit intoapache:mainfrom
Conversation
|
Hi @2010YOUY01, I've submitted the pilot pr for bitwise_xor as we discussed. Looking forward to your feedback! The implementation logic for bitwise_or and bitwise_and is very similar to bitwise_xor. If this approach is acceptable, I plan to refactor those operators using macros in a follow-up step. |
|
At what phase of planning do we do this replacement? I have a concern that we have optimizer rules based around binary exprs that are |
|
Hi @Jefffrey, thanks for the feedback! My apologies, I didn't consider this thoroughly enough. I realize now that making this change directly (during the initial conversion from statement to LogicalPlan) would break the optimizer. For instance, I haven't even migrated the logic from simplify_expr to the UDF's simplify method yet. I will convert this pr to a draft for now and carefully study the impact of this change on existing optimization rules. |
|
Thank you for this POC; it helps clarify the issue. I see two benefits to unifying operators and UDFs:
This PR’s approach is to replace operator inside As a middle ground, I have another idea (this gives up #[derive(Debug, Clone, Eq)]
pub struct BinaryExpr {
left: Arc<dyn PhysicalExpr>,
op: Operator,
right: Arc<dyn PhysicalExpr>,
/// Specifies whether an error is returned on overflow or not
fail_on_overflow: bool,
inner: Arc<dyn ScalarUDF>,
}Here, This way, operators and UDFs could share a consistent interface and reuse utilities like signature checking and user-friendly error messages. I think this approach is feasible, though I may be underestimating the difficulty since I haven’t worked extensively in these modules; I believe @Jefffrey has a deeper understanding here. |
|
Thank you for the suggestion @2010YOUY01 ! I was actually stuck on how to handle the optimizer rules to satisfy that first benefit without a massive refactoring. I think that the ideal long-term solution is to equip UDFs with "properties" (like associativity or commutativity). This would allow the optimizer to handle them generically, which would also benefit user-defined functions. But as you noted, that would require a significant overhaul. I agree your proposed middle ground is a great practical trade-off. I will try to create a new POC based on this wrapping approach. Hope this can serve as a starting point, and that we can eventually find a better way to unify the two frameworks step-by-step. |
After digging deeper, I think since we focus solely on improving signature checking and error message handling, it doesn't seem to necessitate introducing the UDF framework; what we really need are function signatures. The existing method for obtaining operator signatures relies on complex conditional logic or even directly calling Arrow computations to see if they error out. This is quite different from how functions are handled, and in this situation, manually specifying signatures requires a lot of work, and would likely lead to inconsistencies as the code evolves. I’m wondering if a 'probing' approach might be more suitable for error hints. For example, if a user provides Int32 + Utf8, when an error occurs, we could first test Int32 + x across various types of x to see if it succeeds, then try x + Utf8, and finally offer suggestions for candidate function signatures. Of course, the actual implementation would need to account for trying similar types first, the possibility that fixing one operand might still not yield a valid suggestion, and cases like subtraction where the order of operands matters. @Jefffrey, I’d love to hear your thoughts on this approach. I will try to provide a POC. |
I'm not as familiar with the current operator related code, though I wonder if it's possible to be more proactive, as this sounds more like a reactive approach? Proactive in the sense of how UDFs have an explicit signature to define which types they coerce to, making them easier to reason about. Though I do think trying to introduce the UDF framework to operators (even as an inner field) might be a bit clunky 🤔 |
Which issue does this PR close?
Rationale for this change
This PR is a pilot implementation of #20018. By refactoring operator into the ScalarUDF framework, we aim to:
What changes are included in this PR?
Refactor BitwiseXOR operator handling using ScalarUDF framework
Are these changes tested?
Yes.
Are there any user-facing changes?
No.