[SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all data in memory by ivoson · Pull Request #55053 · apache/spark

ivoson · 2026-03-27T05:53:44Z

What changes were proposed in this pull request?

This PR adds a default fetchSize of 1000 for the PostgreSQL JDBC dialect to prevent loading entire
tables into memory when no explicit fetchSize is specified by the user.

The changes include:

JdbcDialect: Added defaultFetchSize (returns 0 by default) and
effectiveFetchSize(options) as the single source of truth for resolving the effective fetch size —
user-specified value takes precedence, otherwise falls back to the dialect's default.
PostgresDialect: Overrides defaultFetchSize to 1000, and updates beforeFetch to use
effectiveFetchSize() for the autoCommit decision.
AggregatedDialect: Delegates defaultFetchSize, effectiveFetchSize, and beforeFetch to
dialects.head, consistent with how other methods (e.g., quoteIdentifier, getTruncateQuery) are
delegated.
JDBCOptions: Removed the unused fetchSize field since the effective fetch size is now fully
resolved through JdbcDialect.effectiveFetchSize().
JDBCRDD: Uses dialect.effectiveFetchSize(options) for stmt.setFetchSize().

Why are the changes needed?

By default, the PostgreSQL JDBC driver loads all rows into memory when fetchSize is 0 (the Spark
default). Without partitioning information, a single task may load the entire table into memory,
which can easily cause executor OOM.

Unlike most JDBC drivers, PostgreSQL requires both fetchSize > 0 and autoCommit = false to
enable cursor-based fetching (see PostgreSQL JDBC
documentation). Setting
a sensible default fetch size of 1000 enables cursor-based row batching automatically, preventing the
driver from buffering the entire result set.

Users can still override the default by explicitly setting the fetchsize option (including
fetchsize=0 to restore the old behavior).

Does this PR introduce any user-facing change?

NA

How was this patch tested?

UTs added.
Manually verified the behavior for Postgres.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.1.87

ivoson · 2026-03-30T06:41:36Z

cc @yaooqinn @cloud-fan can you please take a look at this PR? Thanks

yaooqinn · 2026-03-30T08:42:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

     """.stripMargin
  )

-  val fetchSize = parameters.getOrElse(JDBC_BATCH_FETCH_SIZE, "0").toInt


breaking change?

Not really, I think...The behavior is still the same except for PostgresDialect.

Just moved the logic from JDBCOptions to JdbcDialects. See: https://github.com/apache/spark/pull/55053/changes#diff-1533255ad629a18e883f8186b303fdf4fae99043551ce20c5b5ea06d085e0b14R352

The default fetchSize is still 0 except for PostgresDialect.
For PostgresDialect, changed the default value from 0 to 1000.

Yeah, you're right.

However, please restore this field according to 77413d4

Sounds good. Updated

pan3793 · 2026-03-31T05:34:30Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

+   * Dialects can override this to provide a sensible default when the user does not
+   * explicitly set the fetchSize option.
+   */
+  def effectiveFetchSize(options: JDBCOptions): Int = options.fetchSize


add @Since("4.2.0")

I think I understand your intention of using "effective" here, but I would follow the style of the existing method names to call it getFetchSize

thx, updated

pan3793 · 2026-03-31T05:38:47Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala

+        logWarning(s"No fetchSize option set for PostgreSQL JDBC read. " +
+          s"Defaulting to $POSTGRES_DEFAULT_FETCH_SIZE to avoid loading all rows into memory. " +
+          s"Set the 'fetchsize' option explicitly to override this behavior.")


with this change, missing setting fetchSize via the option should be fine? if so, this message should be info or debug, not warning.

Yes, it's fine if not set the parameter.
Will keep it as info since the default behavior changes.

ivoson added 2 commits March 27, 2026 05:36

Add default fetchSize for postgres to avoid loading all data in memory

2f68f91

refine code

3696c51

ivoson changed the title ~~[WIP] Add default fetchSize for postgres to avoid loading all data in memory~~ [WIP][SPARK-56251] Add default fetchSize for postgres to avoid loading all data in memory Mar 27, 2026

ivoson added 4 commits March 27, 2026 08:40

remove dead code

9f37423

refine tests

75b4448

fix multiple dialect

76e9c45

fix uts

fefef16

ivoson force-pushed the postgres-default-fetchsize branch from 3a4f6b0 to e77f950 Compare March 30, 2026 02:57

refine UTs

ba034ad

ivoson force-pushed the postgres-default-fetchsize branch from e77f950 to ba034ad Compare March 30, 2026 04:58

ivoson changed the title ~~[WIP][SPARK-56251] Add default fetchSize for postgres to avoid loading all data in memory~~ [WIP][SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all data in memory Mar 30, 2026

ivoson changed the title ~~[WIP][SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all data in memory~~ [SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all data in memory Mar 30, 2026

ivoson marked this pull request as ready for review March 30, 2026 05:05

yaooqinn reviewed Mar 30, 2026

View reviewed changes

address comments

44cd99d

pan3793 reviewed Mar 31, 2026

View reviewed changes

address comments

81ae847

pan3793 approved these changes Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all data in memory#55053

[SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all data in memory#55053
ivoson wants to merge 9 commits intoapache:masterfrom
ivoson:postgres-default-fetchsize

ivoson commented Mar 27, 2026 •

edited

Loading

Uh oh!

ivoson commented Mar 30, 2026

Uh oh!

yaooqinn Mar 30, 2026

Uh oh!

ivoson Mar 30, 2026

Uh oh!

yaooqinn Mar 30, 2026

Uh oh!

ivoson Mar 31, 2026

Uh oh!

pan3793 Mar 31, 2026

Uh oh!

ivoson Mar 31, 2026

Uh oh!

pan3793 Mar 31, 2026 •

edited

Loading

Uh oh!

ivoson Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ivoson commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ivoson commented Mar 30, 2026

Uh oh!

yaooqinn Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ivoson Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

yaooqinn Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ivoson Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

pan3793 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

ivoson Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

pan3793 Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivoson Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ivoson commented Mar 27, 2026 •

edited

Loading

pan3793 Mar 31, 2026 •

edited

Loading