Skip to content

feat: add --remote flag for cloud/remote API benchmarking#12

Merged
outsourc-e merged 3 commits into
mainfrom
feat/remote-cloud-benchmarking
May 23, 2026
Merged

feat: add --remote flag for cloud/remote API benchmarking#12
outsourc-e merged 3 commits into
mainfrom
feat/remote-cloud-benchmarking

Conversation

@outsourc-e
Copy link
Copy Markdown
Owner

Summary

Adds first-class support for benchmarking cloud/remote LLM APIs (DashScope, OpenRouter, Together, OpenAI, etc.) without the local tok/s speed score tanking the overall.

Problem

The current scoring formula is 0.55 × quality + 0.20 × speed + 0.25 × reliability. For cloud APIs, speed_score is always 0 because the speed suite measures local inference throughput (tok/s). This drops cloud models to ~70 overall even with 90+ quality scores.

Changes

  • --remote CLI flag — explicit opt-in, also auto-detected when endpoint is not localhost
  • Reweighted scoring — remote runs use 0.65 × quality + 0.35 × reliability (speed weight redistributed)
  • is_remote field on BenchmarkRun — enables leaderboard to filter/separate cloud vs local runs
  • Skip model validation when /v1/models returns empty (DashScope and some other providers don't implement this endpoint)
  • Console report shows ☁ remote/cloud indicator and N/A (cloud) for speed

Example

benchloop run \
  --model qwen3.7-max \
  --provider openai_compat \
  --endpoint https://dashscope-intl.aliyuncs.com/compatible-mode \
  --remote \
  --suites toolcall,coding,dataextract,instructfollow,reasonmath

Before (without --remote):

QUALITY    │ 87.9
SPEED      │  0.0
RELIABILITY│ 80.6
OVERALL    │ 68.5 ⚠️

After (with --remote):

QUALITY    │ 87.9
SPEED      │ N/A (cloud)
RELIABILITY│ 80.6
OVERALL    │ 85.3 ✅

Files changed

  • bench_loop/cli.py — added --remote flag
  • bench_loop/models.py — added is_remote field + reweighted compute_aggregates()
  • bench_loop/runner/orchestrator.py — auto-detect remote, skip empty model list validation
  • bench_loop/report/console.py — remote indicator + N/A speed display

Eric added 3 commits May 22, 2026 22:51
- Add --remote CLI flag (auto-detected from non-localhost endpoints)
- Reweight overall scoring for remote runs: 65% quality + 35% reliability
  (drops 20% speed weight since local tok/s is meaningless for cloud APIs)
- Skip model validation when /v1/models returns empty (common for DashScope, etc.)
- Show 'N/A (cloud)' for speed in summary output
- Add is_remote field to BenchmarkRun for leaderboard filtering
- Display remote mode indicator in console report
- Add chat_streaming() to openai_compat provider — measures TTFT and
  effective tok/s via SSE streaming (what cloud users actually feel)
- Add cloud speed scoring curve: 60% TTFT + 40% effective tok/s
  (TTFT 200ms->100, 500ms->86, 2000ms->40; tok/s 40->60, 100->85)
- Use streaming for speed tasks when remote=true
- Include cloud speed in overall when available (0.50Q + 0.25S + 0.25R)
- Show TTFT + tok/s in summary output for remote runs
- Add is_cloud_speed metadata to speed task results
- Track first reasoning token as TTFT (what users see first)
- Track first content token separately for throughput calculation
- Subtract reasoning_tokens from completion_tokens for content tok/s
- Measure content throughput from first content token, not TTFT
- Fixes qwen3.7-max and similar reasoning models showing inflated
  token counts and near-zero tok/s
@outsourc-e outsourc-e merged commit b05eb3e into main May 23, 2026
@outsourc-e outsourc-e deleted the feat/remote-cloud-benchmarking branch May 23, 2026 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant