feat: Add Spark-compatible encode function to datafusion-spark#21331
feat: Add Spark-compatible encode function to datafusion-spark#21331JeelRajodiya wants to merge 1 commit intoapache:mainfrom
encode function to datafusion-spark#21331Conversation
Implements `encode(string_or_binary, charset)` that converts a string or binary value into binary using the specified character encoding, matching Apache Spark's behavior.
|
run benchmarks |
|
Hi @Zeel-e6x, thanks for the request (#21331 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas. File an issue against this benchmark runner |
| } | ||
|
|
||
| /// Extracts a charset string from a ColumnarValue, normalizing to uppercase. | ||
| pub(crate) fn extract_charset(charset_arg: &ColumnarValue) -> Result<String> { |
There was a problem hiding this comment.
Nit: does this need to be crate public?
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
LargeBinary and BinaryView tests might be nice
| .collect()), | ||
| "UTF-16BE" | "UTF16BE" => { | ||
| let mut bytes = Vec::new(); | ||
| for code_unit in s.encode_utf16() { |
There was a problem hiding this comment.
Pretty sure this will handle surrogate pairs but it's worth adding some emoji tests to check?
Rationale
The
datafusion-sparkcrate is missing theencodefunction. Spark'sencode(expr, charset)converts a string or binary value into binary using a specified character encoding — this is commonly used in Spark SQL workloads and needed by engines built on DataFusion that target Spark compatibility.What changes are included in this PR?
Adds
SparkEncodetodatafusion-spark's string functions. It supports US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE, and UTF-16LE charsets. Binary input is handled via lossy UTF-8 conversion (invalid bytes → U+FFFD), matching Spark/Databricks behavior.Are these changes tested?
Yes — 15 unit tests covering all charsets, case-insensitive charset matching, null handling, binary input with lossy UTF-8, Utf8View columns, unsupported charset errors, and return field nullability.
Are there any user-facing changes?
New
encodescalar function available when usingdatafusion-spark.