feat: add --remote flag for cloud/remote API benchmarking by outsourc-e · Pull Request #12 · outsourc-e/bench-loop

outsourc-e · 2026-05-23T02:52:06Z

Summary

Adds first-class support for benchmarking cloud/remote LLM APIs (DashScope, OpenRouter, Together, OpenAI, etc.) without the local tok/s speed score tanking the overall.

Problem

The current scoring formula is 0.55 × quality + 0.20 × speed + 0.25 × reliability. For cloud APIs, speed_score is always 0 because the speed suite measures local inference throughput (tok/s). This drops cloud models to ~70 overall even with 90+ quality scores.

Changes

--remote CLI flag — explicit opt-in, also auto-detected when endpoint is not localhost
Reweighted scoring — remote runs use 0.65 × quality + 0.35 × reliability (speed weight redistributed)
is_remote field on BenchmarkRun — enables leaderboard to filter/separate cloud vs local runs
Skip model validation when /v1/models returns empty (DashScope and some other providers don't implement this endpoint)
Console report shows ☁ remote/cloud indicator and N/A (cloud) for speed

Example

benchloop run \
  --model qwen3.7-max \
  --provider openai_compat \
  --endpoint https://dashscope-intl.aliyuncs.com/compatible-mode \
  --remote \
  --suites toolcall,coding,dataextract,instructfollow,reasonmath

Before (without --remote):

QUALITY    │ 87.9
SPEED      │  0.0
RELIABILITY│ 80.6
OVERALL    │ 68.5 ⚠️

After (with --remote):

QUALITY    │ 87.9
SPEED      │ N/A (cloud)
RELIABILITY│ 80.6
OVERALL    │ 85.3 ✅

Files changed

bench_loop/cli.py — added --remote flag
bench_loop/models.py — added is_remote field + reweighted compute_aggregates()
bench_loop/runner/orchestrator.py — auto-detect remote, skip empty model list validation
bench_loop/report/console.py — remote indicator + N/A speed display

- Add --remote CLI flag (auto-detected from non-localhost endpoints) - Reweight overall scoring for remote runs: 65% quality + 35% reliability (drops 20% speed weight since local tok/s is meaningless for cloud APIs) - Skip model validation when /v1/models returns empty (common for DashScope, etc.) - Show 'N/A (cloud)' for speed in summary output - Add is_remote field to BenchmarkRun for leaderboard filtering - Display remote mode indicator in console report

- Add chat_streaming() to openai_compat provider — measures TTFT and effective tok/s via SSE streaming (what cloud users actually feel) - Add cloud speed scoring curve: 60% TTFT + 40% effective tok/s (TTFT 200ms->100, 500ms->86, 2000ms->40; tok/s 40->60, 100->85) - Use streaming for speed tasks when remote=true - Include cloud speed in overall when available (0.50Q + 0.25S + 0.25R) - Show TTFT + tok/s in summary output for remote runs - Add is_cloud_speed metadata to speed task results

- Track first reasoning token as TTFT (what users see first) - Track first content token separately for throughput calculation - Subtract reasoning_tokens from completion_tokens for content tok/s - Measure content throughput from first content token, not TTFT - Fixes qwen3.7-max and similar reasoning models showing inflated token counts and near-zero tok/s

Eric added 3 commits May 22, 2026 22:51

outsourc-e merged commit b05eb3e into main May 23, 2026

outsourc-e deleted the feat/remote-cloud-benchmarking branch May 23, 2026 05:52

outsourc-e mentioned this pull request May 23, 2026

[bug] Some Issues Regarding Benchloop #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --remote flag for cloud/remote API benchmarking#12

feat: add --remote flag for cloud/remote API benchmarking#12
outsourc-e merged 3 commits into
mainfrom
feat/remote-cloud-benchmarking

outsourc-e commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

outsourc-e commented May 23, 2026

Summary

Problem

Changes

Example

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant