feat: Implement kaggle benchmark client by dolaameng · Pull Request #955 · Kaggle/kaggle-cli

dolaameng · 2026-04-03T20:31:02Z

Benchmarks CLI Reference

The benchmarks CLI manages benchmark tasks — registering evaluation code, scheduling runs against models, monitoring progress, and downloading results.

Aliases: kaggle benchmarks or kaggle b

All task subcommands are under kaggle benchmarks tasks (alias: kaggle b t).

Commands

`push` — Register a task

Upload a Python source file as a benchmark task definition. The file is expected to be a .py file with percent delimiters (e.g., # %%). The CLI converts it to an .ipynb file before uploading. If the task already exists, it creates a new version.

kaggle b t push <task> -f <file>

Parameter	Flag	Required	Description
`task`	positional	Yes	Task name (e.g. `math-eval`)
`file`	`-f`, `--file`	Yes	Path to the Python source file defining the task

Behavior:

Validates the file exists and has a .py extension.
Reads the source file and parses it with Python's ast module to extract task names from @task decorators (supports both @task and @kbench.task styles, as well as @task(name="...") with explicit names).
Validates that the file contains at least one @task decorator. If none are found, raises ValueError and stops.
Validates that the given task name matches one of the decorated functions in the file.
Checks the server for an existing task with the same slug:
- If the task exists and its creation_state is QUEUED or RUNNING (i.e. a previous version is still being built), the push is rejected with ValueError.
- If the task exists and is in COMPLETED or ERRORED state, the push proceeds (creates a new version).
- If the task does not exist (404), the push proceeds (creates a new task).
Converts the .py file content to .ipynb format (Jupyter Notebook) using jupytext (assuming percent format).
Sends the notebook content (JSON string) to create_benchmark_task.
On success, prints the task slug and its URL.

Errors:

ValueError: File <path> does not exist — file path is invalid.
ValueError: File <path> must be a .py file — file is not a Python file.
ValueError: No @task decorators found in file <path>. The file must define at least one task. — the file does not contain any @task-decorated functions.
ValueError: Task '<name>' not found in file <path>. Found tasks: ... — the task name doesn't match any @task-decorated function in the file.
ValueError: Task '<name>' is currently being created (pending). Cannot push now. — a previous version of this task is still being processed by the server.
HTTPError — server-side error (e.g. authentication failure, permission denied).

Example:

kaggle b t push math-eval -f tasks/math_eval.py

`list` — List tasks

Display all benchmark tasks, optionally filtered by name pattern or creation status.

kaggle b t list [--regex <pattern>] [--status <status>]

Parameter	Flag	Required	Description
`regex`	`--regex`	No	Filter task names by regular expression
`status`	`--status`	No	Filter by creation status. Valid values: `queued`, `running`, `completed`, `errored`

Behavior:

Builds an ApiListBenchmarkTasksRequest. If --regex is provided, sets regex_filter. If --status is provided, sets status_filter. Both filters can be combined.
Calls list_benchmark_tasks and displays a table with columns:
- Task — the task slug (up to 40 chars)
- Status — the task's creation state (e.g. COMPLETED, ERRORED)
- Created — creation timestamp

Notes:

If no tasks match the filters, the table header is printed but with no rows.
The --status value is passed directly to the server as a string; the server performs the filtering.
The SDK does not support pagination for this endpoint — all matching tasks are returned in a single response.

Examples:

# List all tasks
kaggle b t list

# Filter by name
kaggle b t list --regex "^math"

# Filter by status
kaggle b t list --status completed

# Combine filters
kaggle b t list --regex "^math" --status errored

`status` — Show task details and run status

Display task metadata and per-model run information including timing and errors.

kaggle b t status <task> [-m <model> ...]

Parameter	Flag	Required	Description
`task`	positional	Yes	Task name (e.g. `math-eval`)
`model`	`-m`, `--model`	No	Filter to specific model(s). Accepts multiple space-separated values. If omitted, shows runs for all models

Behavior:

Fetches the task details via get_benchmark_task and prints a header:

Task:     math-eval
Status:   COMPLETED
Created:  2026-04-06T10:00:00Z

Fetches runs via list_benchmark_task_runs, optionally filtered to specific model slugs.
If no runs exist, prints: No runs yet. Use 'kaggle b t run <task>' to start one.
Otherwise displays a table with columns:
- Model — the model slug (up to 20 chars)
- Status — run state (e.g. RUNNING, COMPLETED, ERRORED)
- Started — run start timestamp
- Ended — run end timestamp (empty if still running)
- URL — direct link to the run: https://www.kaggle.com/benchmarks/runs/<id>
For errored runs, appends | Error: <message> to the row if error_message is present.

Errors:

HTTPError (404) — task does not exist on the server.
HTTPError — authentication or permission errors.

Examples:

# Show all runs for a task
kaggle b t status math-eval

# Filter to specific models
kaggle b t status math-eval -m gemini-pro gemma-2b

`run` — Schedule task runs

Schedule benchmark task execution against one or more models.

kaggle b t run <task> [-m <model> ...] [--wait]

Parameter	Flag	Required	Description
`task`	positional	Yes	Task name (e.g. `math-eval`)
`model`	`-m`, `--model`	No	Model slug(s) to run against. Accepts multiple space-separated values
`wait`	`--wait`	No	Wait for runs to complete. Can specify a timeout in seconds (0 or omit = indefinite)
`poll_interval`	`--poll-interval`	No	Seconds between status polls when using `--wait` (default: 10)

Behavior:

Model selection: If no -m is provided, fetches the list of available benchmark models via list_benchmark_models and prompts the user interactively:
```
No model specified. Available models:
  1. gemini-pro (Gemini Pro)
  2. gemma-2b (Gemma 2B)
Enter model numbers (comma-separated), or 'all':
```
- Enter comma-separated numbers (e.g. 1,3) to select specific models.
- Enter all to run against every available model.
- Invalid input (non-numeric, out-of-range index) raises ValueError.
- If no benchmark models exist on the server, raises ValueError: No benchmark models available. Cannot schedule runs.

Scheduling: Calls batch_schedule_benchmark_task_runs with the task slug and selected model slugs. Output:

Submitted run(s) for task 'math-eval'.
  gemini-pro: Scheduled
  gemma-2b: Scheduled
  gemini-flash: Skipped (Already running)

Waiting (--wait): After scheduling, if --wait is specified, polls list_benchmark_task_runs at a fixed interval (default 10 seconds, configurable via --poll-interval) until all runs reach a terminal state (COMPLETED or ERRORED) or the timeout is reached. Output while waiting:
```
Waiting for run(s) to complete...
  2 run(s) still in progress...
  1 run(s) still in progress...
All runs completed:
  gemini-pro: COMPLETED
  gemma-2b: ERRORED
```
- If a timeout (in seconds) is specified and reached, it stops waiting and prints: Timed out waiting for runs after <timeout> seconds.
- If 0 or no value is specified for --wait, it waits indefinitely.

Errors:

ValueError: No benchmark models available. Cannot schedule runs. — no models exist on the server and none were specified via -m.
ValueError: Invalid selection: <input> — the user entered non-numeric or out-of-range input during interactive model selection.
HTTPError — server-side error (task not found, authentication failure, etc.).

Examples:

# Run against specific models
kaggle b t run math-eval -m gemini-pro gemma-2b

# Run and wait for completion
kaggle b t run math-eval -m gemini-pro --wait

# Wait with a custom poll interval (30 seconds)
kaggle b t run math-eval -m gemini-pro --wait --poll-interval 30

# Wait with a timeout (60 seconds)
kaggle b t run math-eval -m gemini-pro --wait 60

# Interactive model selection (prompts user)
kaggle b t run math-eval

`download` — Download run outputs

Download output files for completed and errored runs of a task.

kaggle b t download <task> [-m <model> ...] [-o <directory>]

Parameter	Flag	Required	Description
`task`	positional	Yes	Task name (e.g. `math-eval`)
`model`	`-m`, `--model`	No	Filter to specific model(s). Accepts multiple space-separated values
`output`	`-o`, `--output`	No	Directory to save files. Default: `./<task>/output`

Behavior:

Fetches runs via list_benchmark_task_runs. If -m is specified, filters by those model slugs; otherwise, fetches runs for all models.
Iterates over runs and downloads output only for runs in terminal states:
- COMPLETED — downloads the result output files.
- ERRORED — downloads log files for debugging.
- Runs in QUEUED or RUNNING state are silently skipped (no message printed).
For each downloadable run:
- Calls download_benchmark_task_run_output which returns a streamed HTTP response.
- Saves to <output>/<model_slug>_<run_id> (the output directory is created automatically if it doesn't exist).
- The download streams in 1MB chunks with automatic retry (up to 5 retries) on network errors and supports resume for interrupted downloads.
- Prints progress: Downloading output for run <id> (<model_slug>)... and Downloaded output for <model_slug> to <path>.

Notes:

If no runs match the filters or all runs are still in progress, nothing is downloaded and no error is raised.
The -m flag is useful when a task has many models but you only need output from specific ones.

Errors:

HTTPError — server-side error (authentication, task not found, download failure).
Network errors during download are retried up to 5 times with resume support.

Examples:

# Download all completed/errored outputs
kaggle b t download math-eval

# Download for specific models to a custom directory
kaggle b t download math-eval -m gemini-pro -o ./results

# Download multiple models
kaggle b t download math-eval -m gemini-pro gemma-2b

`delete` — Remove a task

Delete a benchmark task.

kaggle b t delete <task> [-y]

Parameter	Flag	Required	Description
`task`	positional	Yes	Task name (e.g. `math-eval`)
`no_confirm`	`-y`, `--yes`	No	Skip confirmation prompt

Behavior:

Currently a stub — always prints: Delete is not supported by the server yet.
The -y flag is accepted but has no effect since the delete operation is not implemented.
No API call is made.

Quick Reference

kaggle b t push     <task> -f <file>                        # Register a task
kaggle b t list     [--regex <pat>] [--status <state>]      # List tasks
kaggle b t status   <task> [-m <model> ...]                  # Show runs
kaggle b t run      <task> [-m <model> ...] [--wait]         # Schedule runs
kaggle b t download <task> [-m <model> ...] [-o <dir>]       # Download output
kaggle b t delete   <task> [-y]                              # Delete task (stub)

dolaameng · 2026-04-08T00:41:33Z

+
+            request = ApiCreateBenchmarkTaskRequest()
+            request.slug = task
+            # Assume create_benchmark_task accepts ipynb content (JSON string)


@andrewmwang can you confirm this?

yes can confirm, same format as kaggle kernels push

dolaameng · 2026-04-08T21:31:08Z

+            request.task_slugs = [task_slug_obj]
+            request.model_slugs = models
+
+            response = kaggle.benchmarks.benchmark_tasks_api_client.batch_schedule_benchmark_task_runs(request)


client check of task status, only run when the task status is successful

andrewmwang

Looks great! Didn't look to hard at the download flow, yet, but the other ones LGTM

andrewmwang · 2026-04-09T14:22:50Z

+    }
+
+    @staticmethod
+    def _make_task_slug(task: str) -> ApiBenchmarkTaskSlug:


[nit]: _make_api_task_slug or _make_api_task_slug_object

andrewmwang · 2026-04-09T14:25:05Z

+            for decorator in node.decorator_list:
+                func = decorator.func if isinstance(decorator, ast.Call) else decorator
+
+                if not ((isinstance(func, ast.Name) and func.id == 'task') or


Nice! Would it be also possible to get the description field here also? Then I could set up a big portion of the TDP before the first session completes running

andrewmwang · 2026-04-09T14:27:02Z

+
+            request = ApiCreateBenchmarkTaskRequest()
+            request.slug = task
+            # Assume create_benchmark_task accepts ipynb content (JSON string)


yes can confirm, same format as kaggle kernels push

andrewmwang · 2026-04-09T14:28:11Z

+
+            response = kaggle.benchmarks.benchmark_tasks_api_client.create_benchmark_task(request)
+            print(f"Task '{task}' pushed.")
+            print(f"Task URL: {response.url}")


[super nit]: the response url could be better formatted to include https://kaggle.com/...

I could also just do that server side, so feel free to ignore

andrewmwang · 2026-04-09T14:29:22Z

+            # Assume create_benchmark_task accepts ipynb content (JSON string)
+            request.text = notebook_content
+
+            response = kaggle.benchmarks.benchmark_tasks_api_client.create_benchmark_task(request)


need to add error handling here. You could read the ErrorMessage field in the response

andrewmwang · 2026-04-09T14:29:56Z

+    def benchmarks_tasks_list_cli(self, regex=None, status=None):
+        request = ApiListBenchmarkTasksRequest()
+        if regex:
+            request.regex_filter = regex


@nl917 were we still going to allow regex?

feat: Implement benchmarks related cli

947cea0

dolaameng marked this pull request as draft April 3, 2026 20:31

dolaameng force-pushed the dolaameng/benchmarks-cli-v2 branch 2 times, most recently from 0b99fa4 to 0c564be Compare April 6, 2026 20:51

Implement benchmark CLI methods using SDK stubs

db8afa9

dolaameng force-pushed the dolaameng/benchmarks-cli-v2 branch from 0c564be to db8afa9 Compare April 7, 2026 23:53

dolaameng requested review from andrewmwang and nl917 April 8, 2026 00:27

dolaameng commented Apr 8, 2026

View reviewed changes

Comment thread src/kaggle/api/kaggle_api_extended.py

dolaameng commented Apr 8, 2026

View reviewed changes

dolaameng mentioned this pull request Apr 8, 2026

feat: Add cli for benchmark tasks #947

Closed

andrewmwang approved these changes Apr 9, 2026

View reviewed changes

ethanknights mentioned this pull request Apr 15, 2026

Add Kaggle notebook version #525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement kaggle benchmark client#955

feat: Implement kaggle benchmark client#955
dolaameng wants to merge 2 commits intomainfrom
dolaameng/benchmarks-cli-v2

dolaameng commented Apr 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

dolaameng Apr 8, 2026

Uh oh!

andrewmwang Apr 9, 2026

Uh oh!

dolaameng Apr 8, 2026

Uh oh!

andrewmwang left a comment

Uh oh!

andrewmwang Apr 9, 2026

Uh oh!

andrewmwang Apr 9, 2026

Uh oh!

andrewmwang Apr 9, 2026

Uh oh!

andrewmwang Apr 9, 2026

Uh oh!

andrewmwang Apr 9, 2026

Uh oh!

andrewmwang Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dolaameng commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks CLI Reference

Commands

push — Register a task

list — List tasks

status — Show task details and run status

run — Schedule task runs

download — Download run outputs

delete — Remove a task

Quick Reference

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewmwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dolaameng commented Apr 3, 2026 •

edited

Loading

`push` — Register a task

`list` — List tasks

`status` — Show task details and run status

`run` — Schedule task runs

`download` — Download run outputs

`delete` — Remove a task