Skip to content

[SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all data in memory#55053

Open
ivoson wants to merge 9 commits intoapache:masterfrom
ivoson:postgres-default-fetchsize
Open

[SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all data in memory#55053
ivoson wants to merge 9 commits intoapache:masterfrom
ivoson:postgres-default-fetchsize

Conversation

@ivoson
Copy link
Copy Markdown
Contributor

@ivoson ivoson commented Mar 27, 2026

What changes were proposed in this pull request?

This PR adds a default fetchSize of 1000 for the PostgreSQL JDBC dialect to prevent loading entire
tables into memory when no explicit fetchSize is specified by the user.

The changes include:

  1. JdbcDialect: Added defaultFetchSize (returns 0 by default) and
    effectiveFetchSize(options) as the single source of truth for resolving the effective fetch size —
    user-specified value takes precedence, otherwise falls back to the dialect's default.
  2. PostgresDialect: Overrides defaultFetchSize to 1000, and updates beforeFetch to use
    effectiveFetchSize() for the autoCommit decision.
  3. AggregatedDialect: Delegates defaultFetchSize, effectiveFetchSize, and beforeFetch to
    dialects.head, consistent with how other methods (e.g., quoteIdentifier, getTruncateQuery) are
    delegated.
  4. JDBCOptions: Removed the unused fetchSize field since the effective fetch size is now fully
    resolved through JdbcDialect.effectiveFetchSize().
  5. JDBCRDD: Uses dialect.effectiveFetchSize(options) for stmt.setFetchSize().

Why are the changes needed?

By default, the PostgreSQL JDBC driver loads all rows into memory when fetchSize is 0 (the Spark
default). Without partitioning information, a single task may load the entire table into memory,
which can easily cause executor OOM.

Unlike most JDBC drivers, PostgreSQL requires both fetchSize > 0 and autoCommit = false to
enable cursor-based fetching (see PostgreSQL JDBC
documentation
). Setting
a sensible default fetch size of 1000 enables cursor-based row batching automatically, preventing the
driver from buffering the entire result set.

Users can still override the default by explicitly setting the fetchsize option (including
fetchsize=0 to restore the old behavior).

Does this PR introduce any user-facing change?

NA

How was this patch tested?

UTs added.
Manually verified the behavior for Postgres.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code v2.1.87

@ivoson ivoson changed the title [WIP] Add default fetchSize for postgres to avoid loading all data in memory [WIP][SPARK-56251] Add default fetchSize for postgres to avoid loading all data in memory Mar 27, 2026
@ivoson ivoson force-pushed the postgres-default-fetchsize branch from 3a4f6b0 to e77f950 Compare March 30, 2026 02:57
@ivoson ivoson force-pushed the postgres-default-fetchsize branch from e77f950 to ba034ad Compare March 30, 2026 04:58
@ivoson ivoson changed the title [WIP][SPARK-56251] Add default fetchSize for postgres to avoid loading all data in memory [WIP][SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all data in memory Mar 30, 2026
@ivoson ivoson changed the title [WIP][SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all data in memory [SPARK-56251][SQL] Add default fetchSize for postgres to avoid loading all data in memory Mar 30, 2026
@ivoson ivoson marked this pull request as ready for review March 30, 2026 05:05
@ivoson
Copy link
Copy Markdown
Contributor Author

ivoson commented Mar 30, 2026

cc @yaooqinn @cloud-fan can you please take a look at this PR? Thanks

""".stripMargin
)

val fetchSize = parameters.getOrElse(JDBC_BATCH_FETCH_SIZE, "0").toInt
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

breaking change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I think...The behavior is still the same except for PostgresDialect.

Just moved the logic from JDBCOptions to JdbcDialects. See: https://github.com/apache/spark/pull/55053/changes#diff-1533255ad629a18e883f8186b303fdf4fae99043551ce20c5b5ea06d085e0b14R352

The default fetchSize is still 0 except for PostgresDialect.
For PostgresDialect, changed the default value from 0 to 1000.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right.

However, please restore this field according to 77413d4

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Updated

* Dialects can override this to provide a sensible default when the user does not
* explicitly set the fetchSize option.
*/
def effectiveFetchSize(options: JDBCOptions): Int = options.fetchSize
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @Since("4.2.0")

I think I understand your intention of using "effective" here, but I would follow the style of the existing method names to call it getFetchSize

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, updated

Comment on lines +203 to +205
logWarning(s"No fetchSize option set for PostgreSQL JDBC read. " +
s"Defaulting to $POSTGRES_DEFAULT_FETCH_SIZE to avoid loading all rows into memory. " +
s"Set the 'fetchsize' option explicitly to override this behavior.")
Copy link
Copy Markdown
Member

@pan3793 pan3793 Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this change, missing setting fetchSize via the option should be fine? if so, this message should be info or debug, not warning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's fine if not set the parameter.
Will keep it as info since the default behavior changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants