AIR CLI Integration: Adding support for air run configuration#5657
Open
riddhibhagwat-db wants to merge 9 commits into
Open
AIR CLI Integration: Adding support for air run configuration#5657riddhibhagwat-db wants to merge 9 commits into
riddhibhagwat-db wants to merge 9 commits into
Conversation
Add the experimental `air` command group as the Go port surface for the Python `air` CLI. Every subcommand (run, status, list, logs, cancel, register-image) is registered as a stub that returns a not-implemented error; the real implementations land in later milestones. The package lives under experimental/air/cmd (imported as aircmd), matching the layout of the other experimental features (aitools, genie, postgres); cmd/experimental/ keeps only the dispatcher. TEST_PACKAGES in Taskfile.yml gains ./experimental/air/... so the unit tests keep running after the move. Includes unit tests for the command-tree wiring and the not-implemented stubs, plus an acceptance test exercising the stubs end-to-end. Co-authored-by: Isaac
Rename the run-details subcommand from `status` to `get`, matching the Python air CLI's current `air get run` naming (it replaced `get status`). Renames the file, constructor, command name, and updates the stub/help/unimplemented tests and goldens accordingly. Co-authored-by: Isaac
Implement the read-only run-details command (renamed from `status` to `get`).
It fetches a job run via the Jobs API and renders the run's status, start time,
duration, retries, experiment, accelerators, dashboard URL, MLflow deep-link,
and a foreach/sweep summary. Output is the air-style {v, ts, data} JSON envelope
under -o json, or a text view.
Renames the command-level identifiers (status -> get) while keeping the run's
"status" field/label. Adds format/mlflow/sweep/output helpers with unit tests
and an acceptance test, and drops `get` from the not-implemented stub coverage.
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Add compute.go: the gpuType model and compute-block validation the upcoming `air run` config layer depends on. Defines the canonical GPU_* accelerator types, parseGPUType (exact, case-sensitive), gpusPerNode (partition counts), and computeConfig.validate (positive count, multiple-of-per-node, mutually exclusive node_pool_id/pool_name). Co-authored-by: Isaac
The training compute config no longer supports pool placement, so remove the node_pool_id and pool_name fields and the validation that rejected setting both. Co-authored-by: Isaac
Port the run YAML schema and its structural validation from the Python CLI's sdk/config.py: the top-level runConfig plus the environment, docker_image, code_source/snapshot/git, and permission blocks. loadRunConfig decodes a YAML file with KnownFields (mirroring pydantic extra="forbid") and runs the validation pass. "Structural" covers types, required fields, and format/cross-field rules that need no workspace access. Online checks (compute pool resolution, GPU availability), git/filesystem checks, _bases_ composition, and CLI --override handling are deferred to later milestones. Two deliberate divergences from the Python schema, both following from the training-service-only port: the compute pool fields were already dropped, and the top-level priority field is dropped here since it is a node-pool queue-ordering knob with no meaning for serverless workloads. Co-authored-by: Isaac
Contributor
Waiting for approvalCould not determine reviewers from git history. Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
Collaborator
Integration test reportCommit: 73088e7
20 interesting tests: 13 SKIP, 7 KNOWN
Top 21 slowest tests (at least 2 minutes):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
Ports the air run YAML config schema and its structural validation from the Python CLI (cli/sdk/config.py) to Go, under experimental/air/cmd/.
Two deliberate divergences from the Python schema, both following from the training-service-only port:
Why
"Structural" validation (types, required fields, format/cross-field rules) needs no workspace access, so it's a self-contained, fully unit-testable unit that's worth landing on its own ahead of the launch logic. Splitting it out keeps the upcoming handle_run PR focused on orchestration rather than mixing in ~900 lines of schema.
The extra="forbid" / KnownFields behavior is load-bearing: it's what turns a typo'd or stale config key into an actionable error instead of a silently-ignored field, so it's preserved faithfully. This is stacked on air-integration-m2-1 (the compute model).
Tests
New unit tests in runconfig_test.go (62 subtests, table-driven), covering:
go test ./experimental/air/... passes; ./task lint-q reports 0 issues.