Skip to content

feat: Add Spark-compatible encode function to datafusion-spark#21331

Open
JeelRajodiya wants to merge 1 commit intoapache:mainfrom
JeelRajodiya:feat/spark-encode-function
Open

feat: Add Spark-compatible encode function to datafusion-spark#21331
JeelRajodiya wants to merge 1 commit intoapache:mainfrom
JeelRajodiya:feat/spark-encode-function

Conversation

@JeelRajodiya
Copy link
Copy Markdown

Rationale

The datafusion-spark crate is missing the encode function. Spark's encode(expr, charset) converts a string or binary value into binary using a specified character encoding — this is commonly used in Spark SQL workloads and needed by engines built on DataFusion that target Spark compatibility.

What changes are included in this PR?

Adds SparkEncode to datafusion-spark's string functions. It supports US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE, and UTF-16LE charsets. Binary input is handled via lossy UTF-8 conversion (invalid bytes → U+FFFD), matching Spark/Databricks behavior.

Are these changes tested?

Yes — 15 unit tests covering all charsets, case-insensitive charset matching, null handling, binary input with lossy UTF-8, Utf8View columns, unsupported charset errors, and return field nullability.

Are there any user-facing changes?

New encode scalar function available when using datafusion-spark.

Implements `encode(string_or_binary, charset)` that converts a string
or binary value into binary using the specified character encoding,
matching Apache Spark's behavior.
@github-actions github-actions bot added the spark label Apr 3, 2026
@Zeel-e6x
Copy link
Copy Markdown

Zeel-e6x commented Apr 3, 2026

run benchmarks

@adriangbot
Copy link
Copy Markdown

}

/// Extracts a charset string from a ColumnarValue, normalizing to uppercase.
pub(crate) fn extract_charset(charset_arg: &ColumnarValue) -> Result<String> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: does this need to be crate public?

}

#[cfg(test)]
mod tests {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LargeBinary and BinaryView tests might be nice

.collect()),
"UTF-16BE" | "UTF16BE" => {
let mut bytes = Vec::new();
for code_unit in s.encode_utf16() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this will handle surrogate pairs but it's worth adding some emoji tests to check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants