Skip to content

feat(spark): unify Databricks connector auth with PAT, OAuth M2M, and OIDC federation strategies#3429

Open
rohitrsh wants to merge 1 commit intoflyteorg:masterfrom
rohitrsh:feat/databricks-m2m-oidc-auth
Open

feat(spark): unify Databricks connector auth with PAT, OAuth M2M, and OIDC federation strategies#3429
rohitrsh wants to merge 1 commit intoflyteorg:masterfrom
rohitrsh:feat/databricks-m2m-oidc-auth

Conversation

@rohitrsh
Copy link
Copy Markdown
Contributor

Tracking issue

Related to flyteorg/flyte#7319

Why are the changes needed?

Databricks has marked Personal Access Tokens (PATs) as a legacy authentication method and is steering customers toward OAuth machine-to-machine (M2M) credentials and OIDC workload-identity federation. Today the Flyte Databricks connector only supports PAT, which means:

This PR brings OAuth M2M and OIDC federation to the connector while preserving the existing per-namespace tenancy story, and refactors the existing PAT path so all four modes share one extension point.

What changes were proposed in this pull request?

A unified DatabricksAuth strategy abstraction in a new databricks_auth.py module, plus three new strategies and a refactor of the existing PAT path:

Strategy What it uses Where credentials live
PATAuth Personal Access Token (existing flow) databricks-token k8s secret in workflow namespace, with FLYTE_DATABRICKS_ACCESS_TOKEN fallback
OAuthM2MAuth Service Principal client_id + client_secret databricks-oauth k8s secret in workflow namespace, with env-var fallback
OIDCConnectorIRSAAuth (Model 1) The connector pod's own projected JWT (e.g. EKS IRSA) Federation policy in Databricks bound to the connector SA
OIDCNamespaceSAAuth (Model 2) Per-workflow-namespace ServiceAccount JWT minted via Kubernetes TokenRequest API Federation policy in Databricks bound to the workflow SA, discovered from SA labels and annotations

Each strategy implements auth_type, get_bearer_token(session), invalidate_cache(), and describe(). select_auth(...) resolves task config -> connector env var -> auto-detect; build_auth(...) reconstructs the strategy from DatabricksJobMetadata so long-running jobs can refresh their token transparently on a 401 response.

Important note for reviewers: PAT is being refactored

This PR is more than a feature addition. It refactors the PAT support that landed via #3394 from a direct call into get_databricks_token(...) into a PATAuth strategy that lives alongside the new modes. The behaviour, env vars, custom-secret-name task field, and cross-namespace lookup semantics are preserved end to end (new tests cover all of those scenarios), but the code path now flows through select_auth -> PATAuth.get_bearer_token -> get_databricks_token. If you reviewed #3394 you may want to read connector.py and the new tests/test_databricks_token.py with this lens; the goal is one extension point for all four modes instead of two.

Files changed

  • plugins/flytekit-spark/flytekitplugins/spark/databricks_auth.py (new): strategy abstraction, four strategies, select_auth, build_auth, token cache, OIDC discovery cache, validators.
  • plugins/flytekit-spark/flytekitplugins/spark/connector.py: DatabricksJobMetadata carries auth context for refresh; create calls select_auth; get and delete go through a new _request_with_auth helper that refreshes on 401 and falls back to the stored token for older job metadata.
  • plugins/flytekit-spark/flytekitplugins/spark/task.py: DatabricksV2 gains optional per-task overrides (databricks_auth_type, databricks_client_id, databricks_oauth_secret, databricks_oidc_token_file, databricks_oidc_audience). All optional, all backward compatible.
  • plugins/flytekit-spark/tests/test_databricks_auth.py (new): 56 tests covering every strategy, the resolver, auto-detection, fallback rules, and the OIDC Model 2 SA discovery cache.
  • plugins/flytekit-spark/tests/test_databricks_token.py: refactored to call into the strategy via select_auth while preserving the multi-tenant PAT scenarios from Add multi-tenant Databricks token support via cross-namespace K8s secrets #3394.
  • plugins/flytekit-spark/tests/test_connector.py: updated DatabricksJobMetadata constructions for the new fields.
  • plugins/flytekit-spark/README.md: full Authentication section, env-var matrix, RBAC manifest for OIDC Model 2, migration guide from PAT.

Zero workflow-code changes

A core constraint of this PR is that existing Databricks workflows keep working without edits. Operators flip auth modes by setting connector-level env vars; task authors only touch DatabricksV2 fields if a single workflow needs to diverge from the connector default.

How was this patch tested?

Unit tests

$ cd plugins/flytekit-spark
$ pytest tests/test_databricks_auth.py tests/test_databricks_token.py tests/test_connector.py
============================ 100 passed in 3.04s ============================

The full plugin suite (pytest tests/ --ignore=tests/test_remote_register.py) shows 119 passed. The 5 failures in test_wf.py and test_pyspark_transformers.py are a pre-existing PySpark + Java 18+ compatibility issue (Exception: getSubject is not supported) on the local dev machine running Java 25; they pass on upstream CI which uses Java 11/17. None of those tests touch the files in this PR.

End-to-end on a development EKS cluster

Each auth mode was exercised against a real Databricks workspace by patching the connector deployment with an image built from this branch:

  • pat: per-namespace databricks-token k8s secret. Existing path.
  • oauth_m2m: per-namespace databricks-oauth k8s secret with client_id and client_secret keys, federation against a Databricks Service Principal.
  • oidc_federation Model 2: per-workflow-namespace ServiceAccount annotated with flyte.org/databricks-enabled, flyte.org/databricks-client-id, flyte.org/databricks-audience. Connector minted a JWT via Kubernetes TokenRequest, exchanged it for a Databricks bearer token, and the resulting Jobs run in the Databricks console was attributed to the per-namespace Service Principal (verifying Unity Catalog tenancy is preserved).

Pre-commit

$ pre-commit run --files <the seven changed files>
ruff..................Passed
ruff-format...........Passed
codespell.............Passed
pydoclint.............Passed

pydoclint-errors-baseline.txt was not modified by this PR.

Setup process

The plugin README's Authentication section now documents the full operator setup, including the RBAC manifest for OIDC Model 2 and a migration guide from PAT. No infra changes are required for existing PAT users; the connector defaults still resolve PAT first.

Screenshots

n/a

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

The plugin README in this PR (plugins/flytekit-spark/README.md) is the documentation; there is no separate docs site change.

… OIDC federation strategies

Databricks has marked Personal Access Tokens (PATs) as a legacy auth
method (https://docs.databricks.com/aws/en/dev-tools/auth/pat) and is
steering customers toward OAuth machine-to-machine (M2M) and OIDC
workload-identity federation. This change brings both modern auth modes
to the Flyte Databricks connector and refactors the existing PAT support
into a shared strategy module so all four modes resolve identically.

What is added:

* OAuth M2M (client credentials) using a per-namespace 'databricks-oauth'
  K8s secret with operator-level fallbacks via env vars.
* OIDC federation, Model 1: the connector pod's own projected JWT
  (e.g. EKS IRSA) is exchanged for a Databricks bearer token.
* OIDC federation, Model 2: per-workflow-namespace ServiceAccount discovery
  driven by labels and annotations on the SA. The connector mints a JWT via
  the Kubernetes TokenRequest API and exchanges it for a Databricks token.
  This preserves the existing per-namespace tenancy model that PAT
  customers rely on for Unity Catalog access.
* A unified DatabricksAuth strategy abstraction in 'databricks_auth.py'
  with auto-detection, per-strategy token caching, and token refresh on
  401 responses for long-running jobs.

What changes for existing PAT users:

* This PR refactors the PAT support that was added in
  flyteorg#3394 from a direct function call into a 'PATAuth'
  strategy that lives alongside the new modes. The behaviour, env vars,
  and per-namespace 'databricks-token' lookup are preserved end-to-end.
  Reviewers may want to read 'connector.py' and the new tests with this
  refactor in mind: PAT now flows through the same 'select_auth' resolver
  as the new modes so we have one extension point instead of two.
* Workflow code is unchanged. 'DatabricksV2' gains optional override
  fields for power users, but existing tasks keep working without edits.

Validation:

* 'pytest plugins/flytekit-spark/tests/test_databricks_auth.py
   plugins/flytekit-spark/tests/test_databricks_token.py
   plugins/flytekit-spark/tests/test_connector.py' passes (100 tests).
* End-to-end tested on an EKS test cluster against a real Databricks
  workspace for PAT, OAuth M2M, and OIDC Model 2.
* Pre-commit (ruff, ruff-format, codespell, pydoclint) clean on the
  changed plugin files.

Tracking: flyteorg/flyte#7319

Related:

* flyteorg#3394 (PAT multi-tenancy, refactored here)
* flyteorg#3392 (Databricks Serverless compute)
* flyteorg/flyte#6911 (original PAT multi-tenancy issue)

Signed-off-by: Rohit Sharma <rohitrsh@gmail.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.19%. Comparing base (39d4a9f) to head (ba19fbf).
⚠️ Report is 1 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (39d4a9f) and HEAD (ba19fbf). Click for more details.

HEAD has 104 uploads less than BASE
Flag BASE (39d4a9f) HEAD (ba19fbf)
105 1
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3429       +/-   ##
===========================================
- Coverage   84.08%   73.19%   -10.90%     
===========================================
  Files         388      216      -172     
  Lines       31182    22855     -8327     
  Branches     3016     3016               
===========================================
- Hits        26220    16729     -9491     
- Misses       4082     5287     +1205     
+ Partials      880      839       -41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core feature] Add Databricks Serverless Compute Support to Databricks Connector

1 participant