Skip to content

feat: Implement kaggle benchmark client#955

Draft
dolaameng wants to merge 2 commits intomainfrom
dolaameng/benchmarks-cli-v2
Draft

feat: Implement kaggle benchmark client#955
dolaameng wants to merge 2 commits intomainfrom
dolaameng/benchmarks-cli-v2

Conversation

@dolaameng
Copy link
Copy Markdown
Contributor

@dolaameng dolaameng commented Apr 3, 2026

Benchmarks CLI Reference

The benchmarks CLI manages benchmark tasks — registering evaluation code, scheduling runs against models, monitoring progress, and downloading results.

Aliases: kaggle benchmarks or kaggle b

All task subcommands are under kaggle benchmarks tasks (alias: kaggle b t).


Commands

push — Register a task

Upload a Python source file as a benchmark task definition. The file is expected to be a .py file with percent delimiters (e.g., # %%). The CLI converts it to an .ipynb file before uploading. If the task already exists, it creates a new version.

kaggle b t push <task> -f <file>
Parameter Flag Required Description
task positional Yes Task name (e.g. math-eval)
file -f, --file Yes Path to the Python source file defining the task

Behavior:

  1. Validates the file exists and has a .py extension.
  2. Reads the source file and parses it with Python's ast module to extract task names from @task decorators (supports both @task and @kbench.task styles, as well as @task(name="...") with explicit names).
  3. Validates that the file contains at least one @task decorator. If none are found, raises ValueError and stops.
  4. Validates that the given task name matches one of the decorated functions in the file.
  5. Checks the server for an existing task with the same slug:
    • If the task exists and its creation_state is QUEUED or RUNNING (i.e. a previous version is still being built), the push is rejected with ValueError.
    • If the task exists and is in COMPLETED or ERRORED state, the push proceeds (creates a new version).
    • If the task does not exist (404), the push proceeds (creates a new task).
  6. Converts the .py file content to .ipynb format (Jupyter Notebook) using jupytext (assuming percent format).
  7. Sends the notebook content (JSON string) to create_benchmark_task.
  8. On success, prints the task slug and its URL.

Errors:

  • ValueError: File <path> does not exist — file path is invalid.
  • ValueError: File <path> must be a .py file — file is not a Python file.
  • ValueError: No @task decorators found in file <path>. The file must define at least one task. — the file does not contain any @task-decorated functions.
  • ValueError: Task '<name>' not found in file <path>. Found tasks: ... — the task name doesn't match any @task-decorated function in the file.
  • ValueError: Task '<name>' is currently being created (pending). Cannot push now. — a previous version of this task is still being processed by the server.
  • HTTPError — server-side error (e.g. authentication failure, permission denied).

Example:

kaggle b t push math-eval -f tasks/math_eval.py

list — List tasks

Display all benchmark tasks, optionally filtered by name pattern or creation status.

kaggle b t list [--regex <pattern>] [--status <status>]
Parameter Flag Required Description
regex --regex No Filter task names by regular expression
status --status No Filter by creation status. Valid values: queued, running, completed, errored

Behavior:

  1. Builds an ApiListBenchmarkTasksRequest. If --regex is provided, sets regex_filter. If --status is provided, sets status_filter. Both filters can be combined.
  2. Calls list_benchmark_tasks and displays a table with columns:
    • Task — the task slug (up to 40 chars)
    • Status — the task's creation state (e.g. COMPLETED, ERRORED)
    • Created — creation timestamp

Notes:

  • If no tasks match the filters, the table header is printed but with no rows.
  • The --status value is passed directly to the server as a string; the server performs the filtering.
  • The SDK does not support pagination for this endpoint — all matching tasks are returned in a single response.

Examples:

# List all tasks
kaggle b t list

# Filter by name
kaggle b t list --regex "^math"

# Filter by status
kaggle b t list --status completed

# Combine filters
kaggle b t list --regex "^math" --status errored

status — Show task details and run status

Display task metadata and per-model run information including timing and errors.

kaggle b t status <task> [-m <model> ...]
Parameter Flag Required Description
task positional Yes Task name (e.g. math-eval)
model -m, --model No Filter to specific model(s). Accepts multiple space-separated values. If omitted, shows runs for all models

Behavior:

  1. Fetches the task details via get_benchmark_task and prints a header:
    Task:     math-eval
    Status:   COMPLETED
    Created:  2026-04-06T10:00:00Z
    
  2. Fetches runs via list_benchmark_task_runs, optionally filtered to specific model slugs.
  3. If no runs exist, prints: No runs yet. Use 'kaggle b t run <task>' to start one.
  4. Otherwise displays a table with columns:
    • Model — the model slug (up to 20 chars)
    • Status — run state (e.g. RUNNING, COMPLETED, ERRORED)
    • Started — run start timestamp
    • Ended — run end timestamp (empty if still running)
    • URL — direct link to the run: https://www.kaggle.com/benchmarks/runs/<id>
  5. For errored runs, appends | Error: <message> to the row if error_message is present.

Errors:

  • HTTPError (404) — task does not exist on the server.
  • HTTPError — authentication or permission errors.

Examples:

# Show all runs for a task
kaggle b t status math-eval

# Filter to specific models
kaggle b t status math-eval -m gemini-pro gemma-2b

run — Schedule task runs

Schedule benchmark task execution against one or more models.

kaggle b t run <task> [-m <model> ...] [--wait]
Parameter Flag Required Description
task positional Yes Task name (e.g. math-eval)
model -m, --model No Model slug(s) to run against. Accepts multiple space-separated values
wait --wait No Wait for runs to complete. Can specify a timeout in seconds (0 or omit = indefinite)
poll_interval --poll-interval No Seconds between status polls when using --wait (default: 10)

Behavior:

  1. Model selection: If no -m is provided, fetches the list of available benchmark models via list_benchmark_models and prompts the user interactively:

    No model specified. Available models:
      1. gemini-pro (Gemini Pro)
      2. gemma-2b (Gemma 2B)
    Enter model numbers (comma-separated), or 'all':
    
    • Enter comma-separated numbers (e.g. 1,3) to select specific models.
    • Enter all to run against every available model.
    • Invalid input (non-numeric, out-of-range index) raises ValueError.
    • If no benchmark models exist on the server, raises ValueError: No benchmark models available. Cannot schedule runs.
  2. Scheduling: Calls batch_schedule_benchmark_task_runs with the task slug and selected model slugs. Output:

    Submitted run(s) for task 'math-eval'.
      gemini-pro: Scheduled
      gemma-2b: Scheduled
      gemini-flash: Skipped (Already running)
    
  3. Waiting (--wait): After scheduling, if --wait is specified, polls list_benchmark_task_runs at a fixed interval (default 10 seconds, configurable via --poll-interval) until all runs reach a terminal state (COMPLETED or ERRORED) or the timeout is reached. Output while waiting:

    Waiting for run(s) to complete...
      2 run(s) still in progress...
      1 run(s) still in progress...
    All runs completed:
      gemini-pro: COMPLETED
      gemma-2b: ERRORED
    
    • If a timeout (in seconds) is specified and reached, it stops waiting and prints: Timed out waiting for runs after <timeout> seconds.
    • If 0 or no value is specified for --wait, it waits indefinitely.

Errors:

  • ValueError: No benchmark models available. Cannot schedule runs. — no models exist on the server and none were specified via -m.
  • ValueError: Invalid selection: <input> — the user entered non-numeric or out-of-range input during interactive model selection.
  • HTTPError — server-side error (task not found, authentication failure, etc.).

Examples:

# Run against specific models
kaggle b t run math-eval -m gemini-pro gemma-2b

# Run and wait for completion
kaggle b t run math-eval -m gemini-pro --wait

# Wait with a custom poll interval (30 seconds)
kaggle b t run math-eval -m gemini-pro --wait --poll-interval 30

# Wait with a timeout (60 seconds)
kaggle b t run math-eval -m gemini-pro --wait 60

# Interactive model selection (prompts user)
kaggle b t run math-eval

download — Download run outputs

Download output files for completed and errored runs of a task.

kaggle b t download <task> [-m <model> ...] [-o <directory>]
Parameter Flag Required Description
task positional Yes Task name (e.g. math-eval)
model -m, --model No Filter to specific model(s). Accepts multiple space-separated values
output -o, --output No Directory to save files. Default: ./<task>/output

Behavior:

  1. Fetches runs via list_benchmark_task_runs. If -m is specified, filters by those model slugs; otherwise, fetches runs for all models.
  2. Iterates over runs and downloads output only for runs in terminal states:
    • COMPLETED — downloads the result output files.
    • ERRORED — downloads log files for debugging.
    • Runs in QUEUED or RUNNING state are silently skipped (no message printed).
  3. For each downloadable run:
    • Calls download_benchmark_task_run_output which returns a streamed HTTP response.
    • Saves to <output>/<model_slug>_<run_id> (the output directory is created automatically if it doesn't exist).
    • The download streams in 1MB chunks with automatic retry (up to 5 retries) on network errors and supports resume for interrupted downloads.
    • Prints progress: Downloading output for run <id> (<model_slug>)... and Downloaded output for <model_slug> to <path>.

Notes:

  • If no runs match the filters or all runs are still in progress, nothing is downloaded and no error is raised.
  • The -m flag is useful when a task has many models but you only need output from specific ones.

Errors:

  • HTTPError — server-side error (authentication, task not found, download failure).
  • Network errors during download are retried up to 5 times with resume support.

Examples:

# Download all completed/errored outputs
kaggle b t download math-eval

# Download for specific models to a custom directory
kaggle b t download math-eval -m gemini-pro -o ./results

# Download multiple models
kaggle b t download math-eval -m gemini-pro gemma-2b

delete — Remove a task

Delete a benchmark task.

kaggle b t delete <task> [-y]
Parameter Flag Required Description
task positional Yes Task name (e.g. math-eval)
no_confirm -y, --yes No Skip confirmation prompt

Behavior:

  • Currently a stub — always prints: Delete is not supported by the server yet.
  • The -y flag is accepted but has no effect since the delete operation is not implemented.
  • No API call is made.

Quick Reference

kaggle b t push     <task> -f <file>                        # Register a task
kaggle b t list     [--regex <pat>] [--status <state>]      # List tasks
kaggle b t status   <task> [-m <model> ...]                  # Show runs
kaggle b t run      <task> [-m <model> ...] [--wait]         # Schedule runs
kaggle b t download <task> [-m <model> ...] [-o <dir>]       # Download output
kaggle b t delete   <task> [-y]                              # Delete task (stub)

@dolaameng dolaameng marked this pull request as draft April 3, 2026 20:31
@dolaameng dolaameng force-pushed the dolaameng/benchmarks-cli-v2 branch 2 times, most recently from 0b99fa4 to 0c564be Compare April 6, 2026 20:51
@dolaameng dolaameng force-pushed the dolaameng/benchmarks-cli-v2 branch from 0c564be to db8afa9 Compare April 7, 2026 23:53
@dolaameng dolaameng requested review from andrewmwang and nl917 April 8, 2026 00:27
Comment thread src/kaggle/api/kaggle_api_extended.py

request = ApiCreateBenchmarkTaskRequest()
request.slug = task
# Assume create_benchmark_task accepts ipynb content (JSON string)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewmwang can you confirm this?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes can confirm, same format as kaggle kernels push

request.task_slugs = [task_slug_obj]
request.model_slugs = models

response = kaggle.benchmarks.benchmark_tasks_api_client.batch_schedule_benchmark_task_runs(request)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client check of task status, only run when the task status is successful

Copy link
Copy Markdown

@andrewmwang andrewmwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Didn't look to hard at the download flow, yet, but the other ones LGTM

}

@staticmethod
def _make_task_slug(task: str) -> ApiBenchmarkTaskSlug:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]: _make_api_task_slug or _make_api_task_slug_object

for decorator in node.decorator_list:
func = decorator.func if isinstance(decorator, ast.Call) else decorator

if not ((isinstance(func, ast.Name) and func.id == 'task') or
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Would it be also possible to get the description field here also? Then I could set up a big portion of the TDP before the first session completes running


request = ApiCreateBenchmarkTaskRequest()
request.slug = task
# Assume create_benchmark_task accepts ipynb content (JSON string)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes can confirm, same format as kaggle kernels push


response = kaggle.benchmarks.benchmark_tasks_api_client.create_benchmark_task(request)
print(f"Task '{task}' pushed.")
print(f"Task URL: {response.url}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[super nit]: the response url could be better formatted to include https://kaggle.com/...

I could also just do that server side, so feel free to ignore

# Assume create_benchmark_task accepts ipynb content (JSON string)
request.text = notebook_content

response = kaggle.benchmarks.benchmark_tasks_api_client.create_benchmark_task(request)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add error handling here. You could read the ErrorMessage field in the response

def benchmarks_tasks_list_cli(self, regex=None, status=None):
request = ApiListBenchmarkTasksRequest()
if regex:
request.regex_filter = regex
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nl917 were we still going to allow regex?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants