[SPARK-56804][SQL] Add bulk read+convert path for DATE to TimestampNTZ Parquet vector updater by LuciferYang · Pull Request #55855 · apache/spark

LuciferYang · 2026-05-13T14:37:06Z

What changes were proposed in this pull request?

Extend the bulk read+convert pattern introduced in SPARK-56791 to DateToTimestampNTZUpdater (parquet INT32 DATE read into a Spark TimestampNTZType at UTC, CORRECTED rebase mode).

A new readIntegersAsTimestampMicros default method on VectorizedValuesReader does the per-row fallback. VectorizedPlainValuesReader overrides it to fetch source bytes once via getBuffer(total * 4) and run a tight in-method conversion loop. DateToTimestampNTZUpdater.readValues becomes a one-line delegation. The per-element conversion is DateTimeUtils.daysToMicros(days, ZoneOffset.UTC), matching the per-row Updater's exact semantics including the Math.multiplyExact overflow check. The LEGACY / EXCEPTION rebase variants (handled by DateToTimestampNTZWithRebaseUpdater) are out of scope.

Unlike the pure-primitive-cast siblings (SPARK-56791/56801/56802), the per-element conversion here is a function call rather than a JVM-native widen, so the expected speedup is smaller; the win still comes from collapsing N getBuffer(4) allocations into one.

Why are the changes needed?

DateToTimestampNTZUpdater.readValues allocates a fresh ByteBuffer slice inside getBuffer(4) for every element on the legacy path. Collapsing N allocations into one is the same win SPARK-56791 delivered for the INT32 -> Long sibling.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

(To be updated after the GHA benchmark and test runs complete.)

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

…Z Parquet vector updater

…uet.ParquetVectorUpdaterBenchmark (JDK 17, Scala 2.13, split 1 of 1)

…uet.ParquetVectorUpdaterBenchmark (JDK 21, Scala 2.13, split 1 of 1)

…uet.ParquetVectorUpdaterBenchmark (JDK 25, Scala 2.13, split 1 of 1)

LuciferYang marked this pull request as draft May 13, 2026 14:42

[SPARK-56804][SQL] Add bulk read+convert path for DATE to TimestampNT…

eaf396f

…Z Parquet vector updater

LuciferYang force-pushed the SPARK-56804-date-to-tsntz branch from 22660e6 to eaf396f Compare May 14, 2026 15:35

LuciferYang added 3 commits May 14, 2026 17:01

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

1ca01a7

…uet.ParquetVectorUpdaterBenchmark (JDK 17, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

188d174

…uet.ParquetVectorUpdaterBenchmark (JDK 21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

5410497

…uet.ParquetVectorUpdaterBenchmark (JDK 25, Scala 2.13, split 1 of 1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56804][SQL] Add bulk read+convert path for DATE to TimestampNTZ Parquet vector updater#55855

[SPARK-56804][SQL] Add bulk read+convert path for DATE to TimestampNTZ Parquet vector updater#55855
LuciferYang wants to merge 4 commits into
apache:masterfrom
LuciferYang:SPARK-56804-date-to-tsntz

LuciferYang commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LuciferYang commented May 13, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant