feat(spark): unify Databricks connector auth with PAT, OAuth M2M, and OIDC federation strategies#3429
Open
rohitrsh wants to merge 1 commit intoflyteorg:masterfrom
Open
Conversation
… OIDC federation strategies Databricks has marked Personal Access Tokens (PATs) as a legacy auth method (https://docs.databricks.com/aws/en/dev-tools/auth/pat) and is steering customers toward OAuth machine-to-machine (M2M) and OIDC workload-identity federation. This change brings both modern auth modes to the Flyte Databricks connector and refactors the existing PAT support into a shared strategy module so all four modes resolve identically. What is added: * OAuth M2M (client credentials) using a per-namespace 'databricks-oauth' K8s secret with operator-level fallbacks via env vars. * OIDC federation, Model 1: the connector pod's own projected JWT (e.g. EKS IRSA) is exchanged for a Databricks bearer token. * OIDC federation, Model 2: per-workflow-namespace ServiceAccount discovery driven by labels and annotations on the SA. The connector mints a JWT via the Kubernetes TokenRequest API and exchanges it for a Databricks token. This preserves the existing per-namespace tenancy model that PAT customers rely on for Unity Catalog access. * A unified DatabricksAuth strategy abstraction in 'databricks_auth.py' with auto-detection, per-strategy token caching, and token refresh on 401 responses for long-running jobs. What changes for existing PAT users: * This PR refactors the PAT support that was added in flyteorg#3394 from a direct function call into a 'PATAuth' strategy that lives alongside the new modes. The behaviour, env vars, and per-namespace 'databricks-token' lookup are preserved end-to-end. Reviewers may want to read 'connector.py' and the new tests with this refactor in mind: PAT now flows through the same 'select_auth' resolver as the new modes so we have one extension point instead of two. * Workflow code is unchanged. 'DatabricksV2' gains optional override fields for power users, but existing tasks keep working without edits. Validation: * 'pytest plugins/flytekit-spark/tests/test_databricks_auth.py plugins/flytekit-spark/tests/test_databricks_token.py plugins/flytekit-spark/tests/test_connector.py' passes (100 tests). * End-to-end tested on an EKS test cluster against a real Databricks workspace for PAT, OAuth M2M, and OIDC Model 2. * Pre-commit (ruff, ruff-format, codespell, pydoclint) clean on the changed plugin files. Tracking: flyteorg/flyte#7319 Related: * flyteorg#3394 (PAT multi-tenancy, refactored here) * flyteorg#3392 (Databricks Serverless compute) * flyteorg/flyte#6911 (original PAT multi-tenancy issue) Signed-off-by: Rohit Sharma <rohitrsh@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## master #3429 +/- ##
===========================================
- Coverage 84.08% 73.19% -10.90%
===========================================
Files 388 216 -172
Lines 31182 22855 -8327
Branches 3016 3016
===========================================
- Hits 26220 16729 -9491
- Misses 4082 5287 +1205
+ Partials 880 839 -41 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tracking issue
Related to flyteorg/flyte#7319
Why are the changes needed?
Databricks has marked Personal Access Tokens (PATs) as a legacy authentication method and is steering customers toward OAuth machine-to-machine (M2M) credentials and OIDC workload-identity federation. Today the Flyte Databricks connector only supports PAT, which means:
connector.py.This PR brings OAuth M2M and OIDC federation to the connector while preserving the existing per-namespace tenancy story, and refactors the existing PAT path so all four modes share one extension point.
What changes were proposed in this pull request?
A unified
DatabricksAuthstrategy abstraction in a newdatabricks_auth.pymodule, plus three new strategies and a refactor of the existing PAT path:PATAuthdatabricks-tokenk8s secret in workflow namespace, withFLYTE_DATABRICKS_ACCESS_TOKENfallbackOAuthM2MAuthclient_id+client_secretdatabricks-oauthk8s secret in workflow namespace, with env-var fallbackOIDCConnectorIRSAAuth(Model 1)OIDCNamespaceSAAuth(Model 2)Each strategy implements
auth_type,get_bearer_token(session),invalidate_cache(), anddescribe().select_auth(...)resolves task config -> connector env var -> auto-detect;build_auth(...)reconstructs the strategy fromDatabricksJobMetadataso long-running jobs can refresh their token transparently on a 401 response.Important note for reviewers: PAT is being refactored
This PR is more than a feature addition. It refactors the PAT support that landed via #3394 from a direct call into
get_databricks_token(...)into aPATAuthstrategy that lives alongside the new modes. The behaviour, env vars, custom-secret-name task field, and cross-namespace lookup semantics are preserved end to end (new tests cover all of those scenarios), but the code path now flows throughselect_auth->PATAuth.get_bearer_token->get_databricks_token. If you reviewed #3394 you may want to readconnector.pyand the newtests/test_databricks_token.pywith this lens; the goal is one extension point for all four modes instead of two.Files changed
plugins/flytekit-spark/flytekitplugins/spark/databricks_auth.py(new): strategy abstraction, four strategies,select_auth,build_auth, token cache, OIDC discovery cache, validators.plugins/flytekit-spark/flytekitplugins/spark/connector.py:DatabricksJobMetadatacarries auth context for refresh;createcallsselect_auth;getanddeletego through a new_request_with_authhelper that refreshes on 401 and falls back to the stored token for older job metadata.plugins/flytekit-spark/flytekitplugins/spark/task.py:DatabricksV2gains optional per-task overrides (databricks_auth_type,databricks_client_id,databricks_oauth_secret,databricks_oidc_token_file,databricks_oidc_audience). All optional, all backward compatible.plugins/flytekit-spark/tests/test_databricks_auth.py(new): 56 tests covering every strategy, the resolver, auto-detection, fallback rules, and the OIDC Model 2 SA discovery cache.plugins/flytekit-spark/tests/test_databricks_token.py: refactored to call into the strategy viaselect_authwhile preserving the multi-tenant PAT scenarios from Add multi-tenant Databricks token support via cross-namespace K8s secrets #3394.plugins/flytekit-spark/tests/test_connector.py: updatedDatabricksJobMetadataconstructions for the new fields.plugins/flytekit-spark/README.md: full Authentication section, env-var matrix, RBAC manifest for OIDC Model 2, migration guide from PAT.Zero workflow-code changes
A core constraint of this PR is that existing Databricks workflows keep working without edits. Operators flip auth modes by setting connector-level env vars; task authors only touch
DatabricksV2fields if a single workflow needs to diverge from the connector default.How was this patch tested?
Unit tests
The full plugin suite (
pytest tests/ --ignore=tests/test_remote_register.py) shows 119 passed. The 5 failures intest_wf.pyandtest_pyspark_transformers.pyare a pre-existing PySpark + Java 18+ compatibility issue (Exception: getSubject is not supported) on the local dev machine running Java 25; they pass on upstream CI which uses Java 11/17. None of those tests touch the files in this PR.End-to-end on a development EKS cluster
Each auth mode was exercised against a real Databricks workspace by patching the connector deployment with an image built from this branch:
pat: per-namespacedatabricks-tokenk8s secret. Existing path.oauth_m2m: per-namespacedatabricks-oauthk8s secret withclient_idandclient_secretkeys, federation against a Databricks Service Principal.oidc_federationModel 2: per-workflow-namespace ServiceAccount annotated withflyte.org/databricks-enabled,flyte.org/databricks-client-id,flyte.org/databricks-audience. Connector minted a JWT via KubernetesTokenRequest, exchanged it for a Databricks bearer token, and the resulting Jobs run in the Databricks console was attributed to the per-namespace Service Principal (verifying Unity Catalog tenancy is preserved).Pre-commit
pydoclint-errors-baseline.txtwas not modified by this PR.Setup process
The plugin README's Authentication section now documents the full operator setup, including the RBAC manifest for OIDC Model 2 and a migration guide from PAT. No infra changes are required for existing PAT users; the connector defaults still resolve PAT first.
Screenshots
n/a
Check all the applicable boxes
Related PRs
task.pythrough this rebase.Docs link
The plugin README in this PR (
plugins/flytekit-spark/README.md) is the documentation; there is no separate docs site change.