[SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas#55848
Open
zhengruifeng wants to merge 10 commits into
Open
[SPARK-56846][DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas#55848zhengruifeng wants to merge 10 commits into
zhengruifeng wants to merge 10 commits into
Conversation
…k-connect-gotchas Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
…gful Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
cloud-fan
reviewed
May 14, 2026
| # Common Gotchas (with Mitigations) | ||
|
|
||
| If you are not careful about the difference between lazy vs. eager analysis, there are four key gotchas to be aware of: 1) overwriting temporary view names, 2) capturing external variables in UDFs, 3) delayed error detection, and 4) excessive schema access on new DataFrames. | ||
| If you are not careful about the difference between lazy vs. eager analysis, there are five key gotchas to be aware of: 1) overwriting temporary view names, 2) capturing external variables in UDFs, 3) delayed error detection, 4) excessive schema access on new DataFrames, and 5) DataFrame column references after a column is shadowed. |
Contributor
There was a problem hiding this comment.
keeping a counter here is error-prone, can we just there are several key gotchas ....
cloud-fan
reviewed
May 14, 2026
|
|
||
| ### Recommended way | ||
|
|
||
| If you hit any of the confusing failures mentioned above, it is recommended to switch to `sf.col` first. `sf.col("c")` is an untagged name reference that resolves against the most recent projection or `withColumn`, rather than `df["c"]` which is a tagged reference to `df`'s original column: |
Contributor
There was a problem hiding this comment.
sf.col is confusing here as we didn't mention import pyspark.sql.functions as sf before
cloud-fan
approved these changes
May 14, 2026
Generated-by: Claude Code (Anthropic), claude-opus-4-7
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add a new gotcha section to
docs/spark-connect-gotchas.mddescribing how Spark Connect resolves DataFrame column references (df["c"]) via plan-id tagging, and how this diverges from Spark Classic once a column has been shadowed bywithColumnorselect.The new section covers:
df.withColumn("c", sf.col("c").cast("string")).select(df["c"])fails on both Spark Classic (MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION) and Spark Connect (CANNOT_RESOLVE_DATAFRAME_COLUMN).spark.sql.analyzer.strictDataFrameColumnResolution(added in Spark 4.2.0 by [SPARK-56614][SQL][CONNECT] Add config for strict DataFrame column resolution #55531, defaulttrue) controls Spark Connect's behavior: under strict resolution the query fails; withstrictDataFrameColumnResolution=false, the analyzer still tries plan-id-based resolution first and only falls back to name-based resolution when that fails, causing the same query to succeed.sf.col("c")(an untagged name reference) instead ofdf["c"](a tagged reference todf's original column) when the column has been shadowed. Includes both Python and Scala examples.spark.sql.analyzer.strictDataFrameColumnResolution=falsefor users who cannot change the call sites.Also adds a "DataFrame column references" row to the summary table at the end of the document (Eagerly resolved vs Lazily resolved against plan id), consistent with the eager/lazy framing used throughout the file.
Why are the changes needed?
The plan-id-based column resolution path is a Spark Connect-specific contract that is not documented anywhere user-facing. Users migrating workloads to Spark Connect have encountered surprises when patterns that previously "worked" stop resolving, with an error class (
CANNOT_RESOLVE_DATAFRAME_COLUMN) and a config (strictDataFrameColumnResolution) whose connection to their code is not obvious. This adds explicit guidance and a code-level mitigation alongside the other Connect-vs-Classic gotchas already documented in this file.Does this PR introduce any user-facing change?
No. Documentation-only change.
How was this patch tested?
Documentation-only change; no automated tests. Verified the markdown renders correctly and is consistent with the existing four-gotcha layout in
docs/spark-connect-gotchas.md.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic), claude-opus-4-7