AI Benchmark 知识库 — 全面收录各大 AI 公司用来测试模型性能的 Benchmark 题库完整集合
-
Updated
Apr 16, 2026
AI Benchmark 知识库 — 全面收录各大 AI 公司用来测试模型性能的 Benchmark 题库完整集合
Saotri Bench — coding benchmark for evaluating LLM agents on multi-phase programming tasks with hidden requirements.
Benchmark of local LLMs (Qwen 3.6, Gemma 4) vs ChatGPT and Gemini on coding tasks. Apple M5, 32GB. Methodology, raw outputs, judge prompts, scores, and charts.
Raw logs of Claude Code running on local Qwen3.5-27B (llama.cpp). Builds a Python todo app with 50 tests. Real-world performance data: 30 min, cache thrashing, 38 t/s generation.
Open benchmark harness for latest major AI models on general reasoning, coding, tool use, and long-context tasks.
🔍 Evaluate LLM agents on multi-phase programming tasks with FluxCodeBench, focusing on hidden requirements, long-context retention, and iterative refinement.
Add a description, image, and links to the coding-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the coding-benchmark topic, visit your repo's landing page and select "manage topics."