Skip to content

[SPARK-36458][ML] Allow approxSimilarityJoin and approxNearestNeighbors to work without inputCol#55854

Open
Phucvt123 wants to merge 1 commit into
apache:masterfrom
Phucvt123:SPARK-36458
Open

[SPARK-36458][ML] Allow approxSimilarityJoin and approxNearestNeighbors to work without inputCol#55854
Phucvt123 wants to merge 1 commit into
apache:masterfrom
Phucvt123:SPARK-36458

Conversation

@Phucvt123
Copy link
Copy Markdown

@Phucvt123 Phucvt123 commented May 13, 2026

What changes were proposed in this pull request?

Modified LSHModel to conditionally check for the existence of inputCol in approxSimilarityJoin and approxNearestNeighbors. When inputCol is missing (user already transformed and dropped it), the methods now fall back to hash-based approximate distance using outputCol instead of throwing an error.

Why are the changes needed?

Users who pre-compute hashes and drop the original features column to save memory/storage cannot use approxSimilarityJoin or approxNearestNeighbors. The documentation implies this should work, but the current implementation always requires inputCol.

Does this PR introduce any user-facing change?

Yes. approxSimilarityJoin and approxNearestNeighbors now accept datasets that contain only the outputCol (hashes) without inputCol. When inputCol is present, behavior is unchanged (exact distance). When inputCol is absent, hash-based approximate distance is used with a warning logged.

How was this patch tested?

Added 3 unit tests in MinHashLSHSuite:

  • approxSimilarityJoin without inputCol
  • approxNearestNeighbors without inputCol
  • approxSimilarityJoin self-join without inputCol

All 17 tests pass (14 existing + 3 new).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic Claude Opus 4.6)

…rs to work without inputCol

## What changes were proposed in this pull request?

Modified LSHModel to conditionally check for the existence of inputCol
in approxSimilarityJoin and approxNearestNeighbors. When inputCol is
missing (user already transformed and dropped it), the methods now fall
back to hash-based approximate distance using outputCol instead of
throwing an error.

## Why are the changes needed?

Users who pre-compute hashes and drop the original features column to
save memory/storage cannot use approxSimilarityJoin or
approxNearestNeighbors. The documentation implies this should work but
the current implementation always requires inputCol.

## Does this PR introduce any user-facing change?

Yes. approxSimilarityJoin and approxNearestNeighbors now accept
datasets that contain only the outputCol (hashes) without inputCol.
When inputCol is present, behavior is unchanged (exact distance).
When inputCol is absent, hash-based approximate distance is used.

## How was this patch tested?

Added 3 unit tests in MinHashLSHSuite:
- approxSimilarityJoin without inputCol
- approxNearestNeighbors without inputCol
- approxSimilarityJoin self-join without inputCol

All existing tests continue to pass.

Closes #SPARK-36458
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant