Skip to content

AIR CLI Integration: Adding support for air run configuration#5657

Open
riddhibhagwat-db wants to merge 9 commits into
air-clifrom
air-integration-m2-2
Open

AIR CLI Integration: Adding support for air run configuration#5657
riddhibhagwat-db wants to merge 9 commits into
air-clifrom
air-integration-m2-2

Conversation

@riddhibhagwat-db

Copy link
Copy Markdown

Changes

Ports the air run YAML config schema and its structural validation from the Python CLI (cli/sdk/config.py) to Go, under experimental/air/cmd/.

  • Schema (runconfig.go): the top-level runConfig plus the nested environment (with docker_image), code_source/snapshot/git, and permission blocks. Reuses the compute model from the parent branch. Includes custom YAML unmarshalers for the three polymorphic fields that don't map to a single Go type: environment.dependencies (string path or inline list), environment.version (string or int), and git.remote (bool or remote-name string).
  • Loader (runconfig_load.go): loadRunConfig decodes a YAML file with KnownFields(true) — mirroring pydantic's extra="forbid" so unknown keys are rejected — then runs the validation pass.
  • Validation: every structural rule from the Python schema — required fields, the experiment_name/mlflow_run_name task-key regex and length caps, secret-ref scope/key format, the environment docker-image/dependencies/version exclusivity rules, git branch-xor-commit and remote-requires-branch rules, code_source snapshot requirements, and include_paths relative/no-traversal checks.

Two deliberate divergences from the Python schema, both following from the training-service-only port:

  • The compute.node_pool_id / compute.pool_name fields were already dropped on the parent branch.
  • The top-level priority field is dropped here: it's a node-pool queue-ordering knob (it requires a pool in Python) with no meaning for serverless workloads.

Why

"Structural" validation (types, required fields, format/cross-field rules) needs no workspace access, so it's a self-contained, fully unit-testable unit that's worth landing on its own ahead of the launch logic. Splitting it out keeps the upcoming handle_run PR focused on orchestration rather than mixing in ~900 lines of schema.

The extra="forbid" / KnownFields behavior is load-bearing: it's what turns a typo'd or stale config key into an actionable error instead of a silently-ignored field, so it's preserved faithfully. This is stacked on air-integration-m2-1 (the compute model).

Tests

New unit tests in runconfig_test.go (62 subtests, table-driven), covering:

  • Loading a minimal config and a full-featured config (all blocks populated).
  • Each polymorphic union decoding both of its forms (dependencies string vs list, git.remote bool vs string, default-unset).
  • Unknown-field rejection at top level and nested — including explicit cases asserting the dropped priority field and the not-yet-ported bases key surface as errors.
  • Every validation rule's failure mode, plus file-level errors (missing file, empty file).

go test ./experimental/air/... passes; ./task lint-q reports 0 issues.

Add the experimental `air` command group as the Go port surface for the
Python `air` CLI. Every subcommand (run, status, list, logs, cancel,
register-image) is registered as a stub that returns a not-implemented
error; the real implementations land in later milestones.

The package lives under experimental/air/cmd (imported as aircmd), matching
the layout of the other experimental features (aitools, genie, postgres);
cmd/experimental/ keeps only the dispatcher. TEST_PACKAGES in Taskfile.yml
gains ./experimental/air/... so the unit tests keep running after the move.

Includes unit tests for the command-tree wiring and the not-implemented
stubs, plus an acceptance test exercising the stubs end-to-end.

Co-authored-by: Isaac
Rename the run-details subcommand from `status` to `get`, matching the Python
air CLI's current `air get run` naming (it replaced `get status`). Renames the
file, constructor, command name, and updates the stub/help/unimplemented tests
and goldens accordingly.

Co-authored-by: Isaac
Implement the read-only run-details command (renamed from `status` to `get`).
It fetches a job run via the Jobs API and renders the run's status, start time,
duration, retries, experiment, accelerators, dashboard URL, MLflow deep-link,
and a foreach/sweep summary. Output is the air-style {v, ts, data} JSON envelope
under -o json, or a text view.

Renames the command-level identifiers (status -> get) while keeping the run's
"status" field/label. Adds format/mlflow/sweep/output helpers with unit tests
and an acceptance test, and drops `get` from the not-implemented stub coverage.

Co-authored-by: Isaac
Add compute.go: the gpuType model and compute-block validation the upcoming
`air run` config layer depends on. Defines the canonical GPU_* accelerator
types, parseGPUType (exact, case-sensitive), gpusPerNode (partition counts),
and computeConfig.validate (positive count, multiple-of-per-node, mutually
exclusive node_pool_id/pool_name).

Co-authored-by: Isaac
The training compute config no longer supports pool placement, so remove the
node_pool_id and pool_name fields and the validation that rejected setting both.

Co-authored-by: Isaac
Port the run YAML schema and its structural validation from the Python
CLI's sdk/config.py: the top-level runConfig plus the environment,
docker_image, code_source/snapshot/git, and permission blocks. loadRunConfig
decodes a YAML file with KnownFields (mirroring pydantic extra="forbid") and
runs the validation pass.

"Structural" covers types, required fields, and format/cross-field rules that
need no workspace access. Online checks (compute pool resolution, GPU
availability), git/filesystem checks, _bases_ composition, and CLI --override
handling are deferred to later milestones.

Two deliberate divergences from the Python schema, both following from the
training-service-only port: the compute pool fields were already dropped, and
the top-level priority field is dropped here since it is a node-pool
queue-ordering knob with no meaning for serverless workloads.

Co-authored-by: Isaac
@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

Could not determine reviewers from git history.
Round-robin suggestion: @apeforest

Eligible reviewers: @apeforest, @ben-hansen-db, @bfontain, @lu-wang-dl, @maggiewang-db, @panchalhp-db, @pardis-beikzadeh-db, @vinchenzo-db

Suggestions based on git history. See OWNERS for ownership rules.

@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: 73088e7

Run: 27781953252

Env 🟨​KNOWN 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 7 13 264 1014 7:04
🟨​ aws windows 7 13 266 1012 8:27
💚​ aws-ucws linux 7 13 360 928 6:14
💚​ aws-ucws windows 7 13 362 926 7:20
💚​ azure linux 1 15 267 1012 5:47
💚​ azure windows 1 15 269 1010 5:30
💚​ azure-ucws linux 1 15 365 924 6:59
💚​ azure-ucws windows 1 15 367 922 7:57
💚​ gcp linux 1 15 263 1015 6:53
💚​ gcp windows 1 15 265 1013 8:06
20 interesting tests: 13 SKIP, 7 KNOWN
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
Top 21 slowest tests (at least 2 minutes):
duration env testname
5:00 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:50 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:23 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:15 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:37 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:28 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:22 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:01 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:56 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:56 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:56 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:51 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:51 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:49 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:48 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:47 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:46 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:33 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:30 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:23 gcp linux TestSecretsPutSecretStringValue
2:22 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants