A pluggable harness for benchmarking ROCm software stacks. Each ROCm build (typically a full container image: ROCm + OS) is exercised by a set of micro-benchmarks (GEMM, attention, collective/PCIe/HBM bandwidth, ...) that emit a uniform, machine-readable result format.
Status: scaffold. This repository currently contains the architecture, the project standards, and a non-functional core skeleton. The individual benchmark plugins are migrated in incrementally. See
DESIGN.mdfor the full refactoring plan.
The previous generation of the harness grew organically: ad-hoc per-test bash
and Python, results parsed out of mixed stdout, xlsx reports, duplicated
regression scripts, and committed throwaway artifacts. This rewrite replaces
that with one shared pipeline and a thin per-benchmark plugin.
- One pipeline, many plugins. Every benchmark follows
build -> run -> parse -> result -> (optional) regression. The shared pipeline lives inbenchtestkit/; each benchmark is a plugin underbenchmarks/. - Output is CSV only. The canonical result is a tidy/long CSV (one metric per row), ideal for ingestion by an external data platform. An optional human-facing wide CSV can be derived from the same data.
- Logs are separated from results. A run writes
stdout.log/stderr.lognext to the structuredresult.csv; they are never mixed. - Regression is pluggable and off by default. Backends:
none(default),local(baseline CSV comparison, for open-source users), andkish(external data platform, planned).
benchtestkit/ # core pipeline (knows nothing about any specific benchmark)
benchmarks/ # one plugin per benchmark (see benchmarks/_example)
vendor/ # third-party native sources without an upstream submodule
configs/ # global + per-benchmark parameter matrices
docs/standards/ # project standards (architecture, coding style, schema)
tests/ # unit tests for parsers, regression, etc.
runs/ # benchmark outputs (git-ignored)
BenchtestKit uses a fork + pull request workflow with a strictly linear history
(every change lands via Rebase and merge). See
CONTRIBUTING.md for the full fork / branch / rebase / PR
flow.
Contributors must follow the project standards under
docs/standards/. They are surfaced to the Cursor agent as
thin rules in .cursor/rules/.
python -m pip install -e ".[dev]"benchtestkit list # list available benchmarks
benchtestkit check-env # verify required tools/deps
benchtestkit run sustained_gemm flash_attention # run a subset
benchtestkit run --all --regression none
benchtestkit run --all --regression local # gate against stored baselinesResults are written to runs/<run_id>/<benchmark>/ (tidy result.csv, an
optional result_wide.csv, stdout.log/stderr.log, and meta.json), with a
run-level run.json.
Each benchmark is a plugin under benchmarks/<name>/ (see its README.md):
pcie_bandwidth,hbm_bandwidth,p2p_bandwidth- PCIe / HBM / peer-to-peer bandwidth.allreduce,alltoall- RCCL collective bandwidth.paged_attention,flash_attention- paged-attention and flash-attention kernels.sustained_gemm,peak_gemm- GEMM shape sweep and peak throughput.
The harness orchestrates workloads but does not bundle them; the target host or
container must provide the relevant tools per benchmark: TransferBench,
hip-stream (built from vendor/), p2pBandwidthLatencyTest (built from
vendor/), hipblaslt-bench, mpirun + RCCL perf tests, and Python packages
torch / vllm / flash_attn for the attention benchmarks. Run
benchtestkit check-env to see what is missing before a run.