Skip to content

[SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas#55848

Open
zhengruifeng wants to merge 10 commits into
apache:masterfrom
zhengruifeng:SPARK-doc-col-diff
Open

[SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas#55848
zhengruifeng wants to merge 10 commits into
apache:masterfrom
zhengruifeng:SPARK-doc-col-diff

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented May 13, 2026

What changes were proposed in this pull request?

Add a new gotcha section to docs/spark-connect-gotchas.md describing how Spark Connect resolves DataFrame column references (df["c"]) via plan-id tagging, and how this diverges from Spark Classic once a column has been shadowed by withColumn or select.

The new section covers:

  • The motivating example: df.withColumn("c", sf.col("c").cast("string")).select(df["c"]) fails on both Spark Classic (MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION) and Spark Connect (CANNOT_RESOLVE_DATAFRAME_COLUMN).
  • How the SQL config spark.sql.analyzer.strictDataFrameColumnResolution (added in Spark 4.2.0 by [SPARK-56614][SQL][CONNECT] Add config for strict DataFrame column resolution #55531, default true) controls Spark Connect's behavior: under strict resolution the query fails; with strictDataFrameColumnResolution=false, the analyzer still tries plan-id-based resolution first and only falls back to name-based resolution when that fails, causing the same query to succeed.
  • A "Recommended way" subsection: switch to sf.col("c") (an untagged name reference) instead of df["c"] (a tagged reference to df's original column) when the column has been shadowed. Includes both Python and Scala examples.
  • The escape-hatch note: spark.sql.analyzer.strictDataFrameColumnResolution=false for users who cannot change the call sites.

Also adds a "DataFrame column references" row to the summary table at the end of the document (Eagerly resolved vs Lazily resolved against plan id), consistent with the eager/lazy framing used throughout the file.

Why are the changes needed?

The plan-id-based column resolution path is a Spark Connect-specific contract that is not documented anywhere user-facing. Users migrating workloads to Spark Connect have encountered surprises when patterns that previously "worked" stop resolving, with an error class (CANNOT_RESOLVE_DATAFRAME_COLUMN) and a config (strictDataFrameColumnResolution) whose connection to their code is not obvious. This adds explicit guidance and a code-level mitigation alongside the other Connect-vs-Classic gotchas already documented in this file.

Does this PR introduce any user-facing change?

No. Documentation-only change.

How was this patch tested?

Documentation-only change; no automated tests. Verified the markdown renders correctly and is consistent with the existing four-gotcha layout in docs/spark-connect-gotchas.md.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), claude-opus-4-7

…k-connect-gotchas

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
…gful

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
@zhengruifeng zhengruifeng changed the title [DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas [SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas May 13, 2026
@zhengruifeng zhengruifeng marked this pull request as ready for review May 13, 2026 11:59
Comment thread docs/spark-connect-gotchas.md Outdated
# Common Gotchas (with Mitigations)

If you are not careful about the difference between lazy vs. eager analysis, there are four key gotchas to be aware of: 1) overwriting temporary view names, 2) capturing external variables in UDFs, 3) delayed error detection, and 4) excessive schema access on new DataFrames.
If you are not careful about the difference between lazy vs. eager analysis, there are five key gotchas to be aware of: 1) overwriting temporary view names, 2) capturing external variables in UDFs, 3) delayed error detection, 4) excessive schema access on new DataFrames, and 5) DataFrame column references after a column is shadowed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping a counter here is error-prone, can we just there are several key gotchas ....

Comment thread docs/spark-connect-gotchas.md Outdated

### Recommended way

If you hit any of the confusing failures mentioned above, it is recommended to switch to `sf.col` first. `sf.col("c")` is an untagged name reference that resolves against the most recent projection or `withColumn`, rather than `df["c"]` which is a tagged reference to `df`'s original column:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sf.col is confusing here as we didn't mention import pyspark.sql.functions as sf before

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants