Skip to content

Refactor BitwiseXOR operator using ScalarUDF framework#20086

Draft
Acfboy wants to merge 1 commit intoapache:mainfrom
Acfboy:rewrite-xor-to-function
Draft

Refactor BitwiseXOR operator using ScalarUDF framework#20086
Acfboy wants to merge 1 commit intoapache:mainfrom
Acfboy:rewrite-xor-to-function

Conversation

@Acfboy
Copy link
Contributor

@Acfboy Acfboy commented Jan 31, 2026

Which issue does this PR close?

Rationale for this change

This PR is a pilot implementation of #20018. By refactoring operator into the ScalarUDF framework, we aim to:

  1. Move operator type handling from the physical execution phase to the logical plan phase, consistent with scalar functions.
  2. Simplify the operator handling logic which is currently scattered across the codebase.

What changes are included in this PR?

Refactor BitwiseXOR operator handling using ScalarUDF framework

Are these changes tested?

Yes.

  • All existing integration tests pass.
  • Move some existing binary operator tests to the new UDF-based implementation.

Are there any user-facing changes?

No.

@github-actions github-actions bot added sql SQL Planner physical-expr Changes to the physical-expr crates functions Changes to functions implementation labels Jan 31, 2026
@Acfboy
Copy link
Contributor Author

Acfboy commented Jan 31, 2026

Hi @2010YOUY01, I've submitted the pilot pr for bitwise_xor as we discussed. Looking forward to your feedback!

The implementation logic for bitwise_or and bitwise_and is very similar to bitwise_xor. If this approach is acceptable, I plan to refactor those operators using macros in a follow-up step.

@Jefffrey
Copy link
Contributor

At what phase of planning do we do this replacement? I have a concern that we have optimizer rules based around binary exprs that are BitwiseXor ops that might be made invalid if we substitute it with a scalar UDF 🤔

@Acfboy
Copy link
Contributor Author

Acfboy commented Jan 31, 2026

Hi @Jefffrey, thanks for the feedback!

My apologies, I didn't consider this thoroughly enough. I realize now that making this change directly (during the initial conversion from statement to LogicalPlan) would break the optimizer. For instance, I haven't even migrated the logic from simplify_expr to the UDF's simplify method yet.

I will convert this pr to a draft for now and carefully study the impact of this change on existing optimization rules.

@Acfboy Acfboy marked this pull request as draft January 31, 2026 03:28
@2010YOUY01
Copy link
Contributor

2010YOUY01 commented Feb 6, 2026

Thank you for this POC; it helps clarify the issue.

I see two benefits to unifying operators and UDFs:

  1. Optimizer rules would no longer need to differentiate between expressions and UDFs, which could simplify many optimizer implementations.
  2. A unified Operator/UDF API: as described in Unifying Operator Handling with the Scalar Function Framework #20018, we already have good signature checking and user-friendly error message utilities for UDFs, but not for operators.

This PR’s approach is to replace operator inside Expr with UDFs during the SQL -> Logical Plan stage. Conceptually, I think this is the ideal way operator expr should have been implemented from the start. However, doing this now would require migrating all related optimizer implementations for that operator. Based on some code searching, I believe this is effectively infeasible: changes that involve deep optimizer refactoring are particularly hard to review, and at the moment there likely isn’t enough review capacity for such a large and complex change.

As a middle ground, I have another idea (this gives up 1 for the reason above, but still tries to achieve 2): operator structs that implement PhysicalExpr could be thin wrappers around ScalarUDF, for example:

#[derive(Debug, Clone, Eq)]
pub struct BinaryExpr {
    left: Arc<dyn PhysicalExpr>,
    op: Operator,
    right: Arc<dyn PhysicalExpr>,
    /// Specifies whether an error is returned on overflow or not
    fail_on_overflow: bool,
    inner: Arc<dyn ScalarUDF>,
}

Here, inner would handle the heavy lifting such as signature checking and execution, while the outer BinaryExpr would mainly delegate to inner to implement PhysicalExpr.

This way, operators and UDFs could share a consistent interface and reuse utilities like signature checking and user-friendly error messages. I think this approach is feasible, though I may be underestimating the difficulty since I haven’t worked extensively in these modules; I believe @Jefffrey has a deeper understanding here.

@Acfboy
Copy link
Contributor Author

Acfboy commented Feb 7, 2026

Thank you for the suggestion @2010YOUY01 ! I was actually stuck on how to handle the optimizer rules to satisfy that first benefit without a massive refactoring.

I think that the ideal long-term solution is to equip UDFs with "properties" (like associativity or commutativity). This would allow the optimizer to handle them generically, which would also benefit user-defined functions. But as you noted, that would require a significant overhaul.

I agree your proposed middle ground is a great practical trade-off. I will try to create a new POC based on this wrapping approach. Hope this can serve as a starting point, and that we can eventually find a better way to unify the two frameworks step-by-step.

@Acfboy
Copy link
Contributor Author

Acfboy commented Feb 10, 2026

This way, operators and UDFs could share a consistent interface and reuse utilities like signature checking and user-friendly error messages. I think this approach is feasible, though I may be underestimating the difficulty since I haven’t worked extensively in these modules; I believe @Jefffrey has a deeper understanding here.

After digging deeper, I think since we focus solely on improving signature checking and error message handling, it doesn't seem to necessitate introducing the UDF framework; what we really need are function signatures.

The existing method for obtaining operator signatures relies on complex conditional logic or even directly calling Arrow computations to see if they error out. This is quite different from how functions are handled, and in this situation, manually specifying signatures requires a lot of work, and would likely lead to inconsistencies as the code evolves.

I’m wondering if a 'probing' approach might be more suitable for error hints. For example, if a user provides Int32 + Utf8, when an error occurs, we could first test Int32 + x across various types of x to see if it succeeds, then try x + Utf8, and finally offer suggestions for candidate function signatures. Of course, the actual implementation would need to account for trying similar types first, the possibility that fixing one operand might still not yield a valid suggestion, and cases like subtraction where the order of operands matters. @Jefffrey, I’d love to hear your thoughts on this approach. I will try to provide a POC.

@Jefffrey
Copy link
Contributor

I’m wondering if a 'probing' approach might be more suitable for error hints. For example, if a user provides Int32 + Utf8, when an error occurs, we could first test Int32 + x across various types of x to see if it succeeds, then try x + Utf8, and finally offer suggestions for candidate function signatures. Of course, the actual implementation would need to account for trying similar types first, the possibility that fixing one operand might still not yield a valid suggestion, and cases like subtraction where the order of operands matters.

I'm not as familiar with the current operator related code, though I wonder if it's possible to be more proactive, as this sounds more like a reactive approach? Proactive in the sense of how UDFs have an explicit signature to define which types they coerce to, making them easier to reason about. Though I do think trying to introduce the UDF framework to operators (even as an inner field) might be a bit clunky 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation physical-expr Changes to the physical-expr crates sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants