Skip to content

[SPARK-56804][SQL] Add bulk read+convert path for DATE to TimestampNTZ Parquet vector updater#55855

Draft
LuciferYang wants to merge 4 commits into
apache:masterfrom
LuciferYang:SPARK-56804-date-to-tsntz
Draft

[SPARK-56804][SQL] Add bulk read+convert path for DATE to TimestampNTZ Parquet vector updater#55855
LuciferYang wants to merge 4 commits into
apache:masterfrom
LuciferYang:SPARK-56804-date-to-tsntz

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Extend the bulk read+convert pattern introduced in SPARK-56791 to DateToTimestampNTZUpdater (parquet INT32 DATE read into a Spark TimestampNTZType at UTC, CORRECTED rebase mode).

A new readIntegersAsTimestampMicros default method on VectorizedValuesReader does the per-row fallback. VectorizedPlainValuesReader overrides it to fetch source bytes once via getBuffer(total * 4) and run a tight in-method conversion loop. DateToTimestampNTZUpdater.readValues becomes a one-line delegation. The per-element conversion is DateTimeUtils.daysToMicros(days, ZoneOffset.UTC), matching the per-row Updater's exact semantics including the Math.multiplyExact overflow check. The LEGACY / EXCEPTION rebase variants (handled by DateToTimestampNTZWithRebaseUpdater) are out of scope.

Unlike the pure-primitive-cast siblings (SPARK-56791/56801/56802), the per-element conversion here is a function call rather than a JVM-native widen, so the expected speedup is smaller; the win still comes from collapsing N getBuffer(4) allocations into one.

Why are the changes needed?

DateToTimestampNTZUpdater.readValues allocates a fresh ByteBuffer slice inside getBuffer(4) for every element on the legacy path. Collapsing N allocations into one is the same win SPARK-56791 delivered for the INT32 -> Long sibling.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

(To be updated after the GHA benchmark and test runs complete.)

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

@LuciferYang LuciferYang marked this pull request as draft May 13, 2026 14:42
@LuciferYang LuciferYang force-pushed the SPARK-56804-date-to-tsntz branch from 22660e6 to eaf396f Compare May 14, 2026 15:35
…uet.ParquetVectorUpdaterBenchmark (JDK 17, Scala 2.13, split 1 of 1)
…uet.ParquetVectorUpdaterBenchmark (JDK 21, Scala 2.13, split 1 of 1)
…uet.ParquetVectorUpdaterBenchmark (JDK 25, Scala 2.13, split 1 of 1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant