Skip to content

[SPARK-56853] Improve PATH Tests#55866

Open
srielau wants to merge 3 commits into
apache:masterfrom
srielau:SPARK-56853-path-qa-gaps
Open

[SPARK-56853] Improve PATH Tests#55866
srielau wants to merge 3 commits into
apache:masterfrom
srielau:SPARK-56853-path-qa-gaps

Conversation

@srielau
Copy link
Copy Markdown
Contributor

@srielau srielau commented May 13, 2026

What changes were proposed in this pull request?

This is a test-only PR that closes coverage gaps identified by a follow-up QA audit of the SQL Standard PATH feature delivered in SPARK-56489, SPARK-56501, SPARK-56518, SPARK-56520, SPARK-56605, SPARK-56639, SPARK-56681, and SPARK-56750. No product code is changed.

New test surfaces:

  • Spark Connect E2Esql/connect/.../SqlPathE2ETestSuite.scala (new): SET PATH and current_path() round-trip over the gRPC client; a persisted view created under one path resolves its body under the frozen path even when the invoker switches PATH; SET PATH is rejected with UNSUPPORTED_FEATURE.SET_PATH_WHEN_DISABLED over Connect when the feature flag is off. Previously, ProtoToParsedPlanTestSuite pinned PATH_ENABLED=false for analyzer isolation and nothing exercised the feature over the wire.
  • PySpark APIpython/pyspark/sql/tests/test_catalog.py: three new methods added to CatalogTestsMixin (test_path_current_path_disabled, test_path_set_path_and_current_path, test_path_set_path_rejected_when_disabled). Because the mixin is shared with test_parity_catalog.py, the same tests run under classic PySpark and under Connect parity. Previously pyspark.sql.functions.current_path had no test calling it, and SET PATH was never exercised from Python.
  • Golden SQLsql-tests/inputs/sql-path.sql + matching results/ and analyzer-results/ outputs. The file has a TOC and ten labeled sections: default path observability; SET PATH grammar (literal, DEFAULT_PATH, SYSTEM_PATH, PATH append, current_schema/current_database, backticks/case); CURRENT_PATH() and the ANSI no-parens form; the full set of static error conditions; routine resolution via PATH for scalar AND table functions; relation resolution via PATH; persisted view frozen-path behavior incl. invoker-context current_schema/current_path; SQL function frozen-path for scalar and table functions incl. the invoker-context guard; the spark.sql.defaultPath conf (override, expansion to conf, invalid-value rejection); and PATH-disabled behaviors. Previously zero .sql input files referenced PATH.
  • COUNT(*) → COUNT(1) rewrite gate — three tests in SetPathSuite exercise the path-driven gate: (1) the fixed-point analyzer (Analyzer.matchesFunctionName / FunctionResolution.isSessionBeforeBuiltinInPath); (2) the gate fires ONLY when a temp count exists (an unrelated temp must not affect the rewrite); (3) the single-pass-resolver counterpart (FunctionResolverUtils.isUnqualifiedCountShadowedByTemp via isNonDistinctCount / handleStarInArguments), gated under spark.sql.analyzer.singlePassResolver.enabled = true and asserted via the analyzed-plan output type (BIGINT vs INT) to avoid the DeserializeToObject / CreateSQLFunctionCommand operators the single-pass analyzer does not yet support.
  • ALTER VIEW ... WITH SCHEMA preserves the frozen path — new test in v1.AlterViewSchemaBindingSuite asserts that generateViewProperties(captureNewPath=false) keeps the persisted VIEW_RESOLUTION_PATH intact even when the caller's session PATH differs from the create-time path.
  • Catalyst unit tests — new SqlPathFormatSuite (toDescribeJson valid / malformed payloads, formatForDisplay quoting behavior, multi-level namespaces). CatalogManagerSuite is extended with direct PathElement.validateNoStaticDuplicates cases (case sensitivity, repeated CurrentSchemaEntry, literal-vs-CurrentSchemaEntry toleration, multi-part and dot-containing identifier error formatting) plus serializePathEntries round-tripping (incl. multi-level / quoted / space-containing identifiers) and deserializePathEntriesOrFail error messaging. The path-element tests live in CatalogManagerSuite rather than a separate suite because they are pure-data tests on a single-purpose helper that CatalogManager is the only consumer of; merging keeps the catalyst test-file count down.
  • PATH-disabled compat for persisted views — two new tests in SQLViewSuite pin the documented behavior of CatalogManager.resolutionPathEntriesForAnalysis when spark.sql.path.enabled=false: the pinned frozen path is dropped and analysis falls back to the recorded viewCatalogAndNamespace. The fully-qualified body path keeps working; the unqualified body resolves through the recorded namespace (or raises TABLE_OR_VIEW_NOT_FOUND when nothing in scope matches).
  • Concurrency smoke test — new SetPathSuite case runs SET PATH (alternating between two paths) and unqualified count(*) lookups concurrently for 200 iterations each, validating the in-source claim that SessionCatalog.lookupBuiltinOrTempFunction is intentionally non-synchronized to avoid lock-order inversion with CatalogManager.synchronized.
  • cloneSession() propagation matrix — replaces the in-source TODO in SetPathSuite ("audit and pin down clone semantics in a follow-up") with six tests pinning what does and does not survive spark.cloneSession(). Confirmed: stored SET PATH, USE SCHEMA, temp views, and temp functions (because functionRegistry.clone() deep-copies them in BaseSessionStateBuilder) propagate; temp variables do not; a child's SET PATH does not leak back to the parent.
  • V2 catalogs in SET PATH — new SqlPathV2CatalogSuite registers two InMemoryCatalog instances and verifies first-match resolution for unqualified tables (id=10 vs id=20 distinguishes which catalog supplied the row) and for unqualified V2 functions (a tiny StrLenTimes100 ScalarFunction[Int] returning s.length * 100 is registered on pathcat2, so strlen('abc') returns 3 or 300 depending on path order), plus the negative case where the unqualified name only lives in a catalog that is not on the path. Previously ProcedureSuite had a single V2-catalog test for the procedure side; tables and functions through real V2 catalogs were uncovered.

Explicitly deferred:

  • DESCRIBE FUNCTION rendering of the frozen path (DescribeFunctionCommandUtils.storedResolutionPathString). DESCRIBE FUNCTION needs broader improvements first; will land in a follow-up.

Why are the changes needed?

A QA audit of the PATH feature found several substantive gaps:

  • Spark Connect serialization of AnalysisContext.resolutionPathEntries was entirely untested; a regression dropping pinned entries from the proto plan would not have been caught.
  • The public current_path() PySpark function was documented (with a +SKIP doctest) but never called from any test.
  • The PATH feature had no SQLQueryTestSuite golden file, despite golden coverage being the project convention for SQL-level features of similar surface (CURRENT_SCHEMA, variables, identifier clauses).
  • The path-driven gate for COUNT(*) → COUNT(1) was a behavior change called out in the SPARK-56750 description but had no test that flipped PATH and verified the suppression -- in either the fixed-point or the single-pass resolver.
  • ALTER VIEW ... WITH SCHEMA preserving the frozen path was undocumented in tests.
  • PathElement, SqlPathFormat, and CatalogManager.serializePathEntries were only exercised end-to-end via SQL.
  • The PATH-disabled read path through a view that already persisted a frozen path was untested (a forward-compat scenario likely to occur during rolling upgrades).
  • The deadlock-safety claim in SessionCatalog about non-synchronized lookups was unvalidated.
  • The cloneSession() propagation matrix was flagged as "incidental" with an in-source TODO.
  • Real V2 catalogs in SET PATH were not exercised end-to-end (only stub cat/cat2 for procedures).

These tests pin the documented intent so future changes have to update the assertions deliberately.

Does this PR introduce any user-facing change?

No. Test-only changes.

How was this patch tested?

The new tests are themselves the patch. They all pass locally:

  • build/sbt "connect-client-jvm/testOnly *SqlPathE2ETestSuite"
  • build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z sql-path.sql" (and regenerated via SPARK_GENERATE_GOLDEN_FILES=1)
  • build/sbt "sql/testOnly org.apache.spark.sql.SetPathSuite"
  • build/sbt "sql/testOnly org.apache.spark.sql.execution.command.v1.AlterViewSchemaBindingSuite"
  • build/sbt "sql/testOnly *SimpleSQLViewSuite -- -z SPARK-56853"
  • build/sbt "sql/testOnly *SqlPathV2CatalogSuite"
  • build/sbt "catalyst/testOnly *SqlPathFormatSuite *CatalogManagerSuite"
  • ./dev/lint-scala passes (Scalastyle and Scalafmt).

PySpark tests (test_path_* in CatalogTestsMixin and the Connect parity variant) are syntactically validated; they are picked up automatically by the existing pyspark.sql.tests.test_catalog and pyspark.sql.tests.connect.test_parity_catalog modules registered in dev/sparktestsupport/modules.py and will run in CI.

Was this patch authored or co-authored using generative AI tooling?

Yes. Test design and implementation were iterated with coding-assistant tooling; the author reviewed and owns the final patch.

Generated-by: Cursor with Claude Opus 4.7

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test-only PR closing coverage gaps in the SQL Standard PATH feature (SPARK-56489 et al.). Coverage is broad and the in-source claims that the tests reference (Analyzer.matchesFunctionName, lookupBuiltinOrTempFunction non-synchronization, BaseSessionStateBuilder.functionRegistry.clone(), generateViewProperties(captureNewPath = false), resolutionPathEntriesForAnalysis semantics) all hold up.

A few small consistency / accuracy items inline. None blocking.

Summary of additions

  • PySpark API (current_path, SET PATH, disabled-feature error) — runs under classic and Connect parity via CatalogTestsMixin
  • Spark Connect E2E — proves AnalysisContext.resolutionPathEntries survives gRPC plan reification; persisted view frozen-path preserved across SET PATH
  • Golden SQL (sql-path.sql + analyzer/results) — 10 sections covering grammar, shortcuts, errors, frozen path, DEFAULT_PATH conf, and disabled state
  • COUNT(*) → COUNT(1) rewrite gate — suppression when session shadows builtin under SET PATH
  • ALTER VIEW ... WITH SCHEMA preserves the frozen path (v1 only; v2 uses viewInfoBuilderFrom and doesn't hit the same code path)
  • Catalyst unit testsSqlPathFormat, PathElement.validateNoStaticDuplicates, CatalogManager.serializePathEntries round-trip / deserializePathEntriesOrFail error format
  • cloneSession matrix — 6 tests pinning what propagates: SET PATH ✓, USE ✓, temp views ✓, temp functions ✓ (with snapshot semantics check), temp variables ✗, child→parent leak ✗
  • V2 catalogs in SET PATH — table resolution (first-match verified) and function resolution
  • PATH-disabled view fallback — pinned entries dropped, falls back to viewCatalogAndNamespace
  • Concurrency smoke test — validates the non-synchronized lookupBuiltin* claim in SessionCatalog

On the "Explicitly deferred" note

The PR description says testing FunctionResolverUtils.isUnqualifiedCountShadowedByTemp "is not actionable yet" because "the single-pass resolver wiring for PATH was reverted in the SPARK-56750 follow-ups." That method still exists at FunctionResolverUtils.scala:115 and is called by isNonDistinctCount at line 107 — the wiring looks live, not reverted. Either the claim is stale (and a test parallel to the fixed-point SetPathSuite cases is in fact actionable), or there's a runtime gate that keeps this path dormant. Worth saying which one so a future author knows when the deferral resolves.

Comment thread sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala Outdated
Comment thread sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala Outdated
@srielau
Copy link
Copy Markdown
Contributor Author

srielau commented May 14, 2026

Thanks for the careful review @cloud-fan! All four items addressed in 542eee8:

  • SqlPathV2CatalogSuite first-match function test — your read was right, the two StrLen impls returned identical lengths so swapping path order proved nothing. Now uses a tiny StrLenTimes100 ScalarFunction[Int] on pathcat2 so strlen('abc') returns 3 or 300 depending on path order. Structurally symmetric with the table test (id=10 vs id=20).
  • SQLViewSuite test namesSPARK-56853: prefix restored on both new tests; matches the file-local convention (your SPARK-56639 frozen-path tests right next to them are feature tests with the prefix too, so the convention isn't bug-regression-only).
  • CatalogManagerSuite / PathElementSuite — taking the "update the PR description" option. PathElement is a small single-purpose helper whose only consumer is CatalogManager, so the seven validate-no-duplicates tests sit naturally alongside the existing serialize/deserialize cases in the sibling suite. PR description corrected to match. Happy to re-extract if you'd prefer the strict source-mirror pattern.
  • "Explicitly deferred" claim about the single-pass resolver — you were right, my claim was stale. FunctionResolverUtils.isUnqualifiedCountShadowedByTemp IS live and wired into isNonDistinctCount / handleStarInArguments; only the spark.sql.analyzer.singlePassResolver.enabled conf default keeps the path dormant in CI. Added a parallel SetPathSuite test that enables that conf for the SELECT only (CREATE TEMPORARY FUNCTION and SET PATH stay on the fixed-point analyzer because the single-pass analyzer doesn't yet support those operators) and asserts the analyzed-plan output dataType (BIGINT when the shortcut fires, INT when the gate suppresses it and the temp count(INT) RETURN x + 100 wins). Bypassing checkAnswer avoids the DeserializeToObject operator the single-pass analyzer also doesn't yet support, but the gate predicate is exercised end-to-end through the analyzer. The "deferred" bullet for this item has been removed from the PR description.

srielau added 3 commits May 14, 2026 14:05
Follow-up changes after the initial PR landed on the branch:

- Strip the SPARK-56853 JIRA tag from test names, comments, and file
  headers across the new tests. These are feature-coverage tests, not
  bug regressions, so the JIRA label does not belong inline.
- Correct the inaccurate "view's home schema" framing in the two new
  PATH-disabled compat tests in `SQLViewSuite`. The fallback target is
  the view's `viewCatalogAndNamespace` property -- i.e. the creator
  session's USE state captured at CREATE VIEW time -- not the schema
  the view physically lives in. The previous comment overgeneralized
  from a setup choice (USE-ing the destination schema before CREATE)
  to the general behavior.
- Merge `PathElementSuite` into `CatalogManagerSuite`. The seven
  `PathElement.validateNoStaticDuplicates` tests are pure data tests
  living in the same package as the consumer; folding them into the
  sibling suite keeps the test file count down without losing coverage.
  `CatalogManagerSuite` is now 16 tests grouped by section comment.
- Restructure `sql-tests/inputs/sql-path.sql` as the primary SQL-level
  reference for the feature. Adds a self-describing header, a Table of
  Contents, and ten clearly delimited sections covering: default path
  observability; SET PATH grammar (literal, DEFAULT_PATH, SYSTEM_PATH,
  PATH append, current_schema/current_database); CURRENT_PATH() builtin
  and its ANSI no-parens form; the full set of static error conditions;
  routine resolution via PATH for scalar AND table functions; relation
  resolution via PATH; persisted view frozen-path behavior including
  invoker-context current_schema/current_path; SQL function frozen-path
  behavior for scalar and table functions including the invoker-context
  guard; the `spark.sql.defaultPath` conf (explicit override, expansion,
  rejection of invalid values and the PATH keyword); and PATH-disabled
  behaviors. The corresponding `results/` and `analyzer-results/`
  golden files were regenerated and reviewed.
…ud-fan

- SqlPathV2CatalogSuite: the V2-function first-match test was using two
  `StrLen` impls that both return `s.length`, so swapping path order did
  not change the row and the test only proved "neither catalog raised
  NOT_FOUND". Add a tiny distinguishable `StrLenTimes100` ScalarFunction
  next to the suite and register it on `pathcat2` so the result is `3`
  vs `300` depending on which catalog supplies the function -- now
  symmetric with the parallel table test (id=10 vs id=20).
- SQLViewSuite: restore the `SPARK-56853:` prefix on the two new tests
  to match the file-local convention (all surrounding tests, including
  the SPARK-56639 feature tests right next to them, use the JIRA prefix).
  The PR description's own `-z SPARK-56853` command now matches.
- SetPathSuite: the single-pass-resolver counterpart of the COUNT(*)
  rewrite gate IS actionable -- `FunctionResolverUtils.isUnqualifiedCount
  ShadowedByTemp` is wired into `isNonDistinctCount` /
  `handleStarInArguments`; it is only the `singlePassResolver.enabled`
  conf default that keeps the path dormant. Add a parallel test that
  enables `spark.sql.analyzer.singlePassResolver.enabled` for the SELECT
  (CREATE TEMPORARY FUNCTION and SET PATH stay on the fixed-point
  analyzer because the single-pass analyzer does not yet support those
  operators) and asserts the analyzed-plan output type: BIGINT when the
  rewrite fires (builtin count) vs INT when the gate suppresses it and
  the temp count(INT) RETURN x + 100 wins. Bypassing `checkAnswer` avoids
  the `DeserializeToObject` operator the single-pass analyzer also does
  not yet support, while still exercising the analyzer-side gate.
@srielau srielau force-pushed the SPARK-56853-path-qa-gaps branch from 542eee8 to 0d3c0ea Compare May 14, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants