[SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas by zhengruifeng · Pull Request #55848 · apache/spark

zhengruifeng · 2026-05-13T09:12:48Z

What changes were proposed in this pull request?

Add a new gotcha section to docs/spark-connect-gotchas.md describing how Spark Connect resolves DataFrame column references (df["c"]) via plan-id tagging, and how this diverges from Spark Classic once a column has been shadowed by withColumn or select.

The new section covers:

The motivating example: df.withColumn("c", sf.col("c").cast("string")).select(df["c"]) fails on both Spark Classic (MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION) and Spark Connect (CANNOT_RESOLVE_DATAFRAME_COLUMN).
How the SQL config spark.sql.analyzer.strictDataFrameColumnResolution (added in Spark 4.2.0 by [SPARK-56614][SQL][CONNECT] Add config for strict DataFrame column resolution #55531, default true) controls Spark Connect's behavior: under strict resolution the query fails; with strictDataFrameColumnResolution=false, the analyzer still tries plan-id-based resolution first and only falls back to name-based resolution when that fails, causing the same query to succeed.
A "Recommended way" subsection: switch to sf.col("c") (an untagged name reference) instead of df["c"] (a tagged reference to df's original column) when the column has been shadowed. Includes both Python and Scala examples.
The escape-hatch note: spark.sql.analyzer.strictDataFrameColumnResolution=false for users who cannot change the call sites.

Also adds a "DataFrame column references" row to the summary table at the end of the document (Eagerly resolved vs Lazily resolved against plan id), consistent with the eager/lazy framing used throughout the file.

Why are the changes needed?

The plan-id-based column resolution path is a Spark Connect-specific contract that is not documented anywhere user-facing. Users migrating workloads to Spark Connect have encountered surprises when patterns that previously "worked" stop resolving, with an error class (CANNOT_RESOLVE_DATAFRAME_COLUMN) and a config (strictDataFrameColumnResolution) whose connection to their code is not obvious. This adds explicit guidance and a code-level mitigation alongside the other Connect-vs-Classic gotchas already documented in this file.

Does this PR introduce any user-facing change?

No. Documentation-only change.

How was this patch tested?

Documentation-only change; no automated tests. Verified the markdown renders correctly and is consistent with the existing four-gotcha layout in docs/spark-connect-gotchas.md.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), claude-opus-4-7

…k-connect-gotchas Generated-by: Claude Code (Anthropic), claude-opus-4-7

Generated-by: Claude Code (Anthropic), claude-opus-4-7

…gful Generated-by: Claude Code (Anthropic), claude-opus-4-7

Generated-by: Claude Code (Anthropic), claude-opus-4-7

cloud-fan · 2026-05-14T13:05:27Z

 # Common Gotchas (with Mitigations)

-If you are not careful about the difference between lazy vs. eager analysis, there are four key gotchas to be aware of: 1) overwriting temporary view names, 2) capturing external variables in UDFs, 3) delayed error detection, and 4) excessive schema access on new DataFrames.
+If you are not careful about the difference between lazy vs. eager analysis, there are five key gotchas to be aware of: 1) overwriting temporary view names, 2) capturing external variables in UDFs, 3) delayed error detection, 4) excessive schema access on new DataFrames, and 5) DataFrame column references after a column is shadowed.


keeping a counter here is error-prone, can we just there are several key gotchas ....

cloud-fan · 2026-05-14T13:07:41Z

+
+### Recommended way
+
+If you hit any of the confusing failures mentioned above, it is recommended to switch to `sf.col` first. `sf.col("c")` is an untagged name reference that resolves against the most recent projection or `withColumn`, rather than `df["c"]` which is a tagged reference to `df`'s original column:


sf.col is confusing here as we didn't mention import pyspark.sql.functions as sf before

Generated-by: Claude Code (Anthropic), claude-opus-4-7

zhengruifeng added 9 commits May 13, 2026 09:12

[DOCS][CONNECT] Document DataFrame column resolution behavior in spar…

5ab2682

…k-connect-gotchas Generated-by: Claude Code (Anthropic), claude-opus-4-7

Rename example column from 'col' to 'c' to avoid confusion with F.col()

7bacb74

Generated-by: Claude Code (Anthropic), claude-opus-4-7

Use 'sf' alias for pyspark.sql.functions in new gotcha section

ee50079

Generated-by: Claude Code (Anthropic), claude-opus-4-7

Reference strictDataFrameColumnResolution config inline in section 5

118ff74

Generated-by: Claude Code (Anthropic), claude-opus-4-7

Use integer literal in section 5 examples so cast('string') is meanin…

ce53b10

…gful Generated-by: Claude Code (Anthropic), claude-opus-4-7

Restructure strict-Connect bullet to contrast strict and lenient modes

b2c0fb5

Generated-by: Claude Code (Anthropic), claude-opus-4-7

Clarify lenient mode tries plan-id first, then falls back to name-based

81383f8

Generated-by: Claude Code (Anthropic), claude-opus-4-7

Reframe section 5 mitigation as 'Recommended way' to switch to sf.col

25c5293

Generated-by: Claude Code (Anthropic), claude-opus-4-7

Shorten DataFrame column references row to Eager vs Lazy/plan-id

8376592

Generated-by: Claude Code (Anthropic), claude-opus-4-7

zhengruifeng changed the title ~~[DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas~~ [SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas May 13, 2026

zhengruifeng marked this pull request as ready for review May 13, 2026 11:59

zhengruifeng requested review from HyukjinKwon and cloud-fan May 14, 2026 12:05

cloud-fan reviewed May 14, 2026

View reviewed changes

cloud-fan approved these changes May 14, 2026

View reviewed changes

Address review: drop gotcha count and qualify col reference in prose

9483d8c

Generated-by: Claude Code (Anthropic), claude-opus-4-7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas#55848

[SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas#55848
zhengruifeng wants to merge 10 commits into
apache:masterfrom
zhengruifeng:SPARK-doc-col-diff

zhengruifeng commented May 13, 2026 •

edited

Loading

Uh oh!

cloud-fan May 14, 2026

Uh oh!

cloud-fan May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### Recommended way

		If you hit any of the confusing failures mentioned above, it is recommended to switch to `sf.col` first. `sf.col("c")` is an untagged name reference that resolves against the most recent projection or `withColumn`, rather than `df["c"]` which is a tagged reference to `df`'s original column:

Conversation

zhengruifeng commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan May 14, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengruifeng commented May 13, 2026 •

edited

Loading