TL;DR
I have a set of parser-only optimizations that make tomlkit.parse() roughly
2.4–2.9× faster (content-dependent) with no behavioural change: the full
test suite passes, including 680/680 of the toml-test conformance corpus,
and ruff + mypy --strict are clean.
Before opening a large PR, I'd like to check your appetite and preferred shape —
the work is naturally splittable into independent, separately-reviewable pieces.
Why this matters / why I'm being careful
tomlkit is a hot dependency of Poetry (and poetry-core), so I treated
correctness as non-negotiable: every change is gated on the conformance suite,
not just round-trip of valid files. (In fact, while validating I caught a subtle
EOF-sentinel collision a naive char-interning optimization can introduce — the
fix is included and is itself covered by toml-test invalid/control/*-null.)
Reproducing the numbers
A dependency-free benchmark is attached (bench_standalone.py): run it on
master, then on the branch, and compare medians. Measured here (macOS,
CPython 3.11):
# master (8694e4d)
$ python bench_standalone.py
median : 791.6 ms (1583 us/parse)
# branch
$ python bench_standalone.py
median : 316.8 ms ( 634 us/parse) -> ~2.5x faster
Cross-checked with a drift-immune A/B (both versions timed, interleaved, in one
process) on other workloads:
| Workload |
master |
branch |
factor |
embedded sample (bench_standalone.py) |
792 ms |
317 ms |
×2.50 |
| number-heavy ~3 KB doc |
~2970 ms |
~1045 ms |
×2.85 |
262 distinct toml-test/valid files |
1747 ms |
717 ms |
×2.44 |
What the changes do (grouped, so you can pick)
The core idea: the parser advanced one character per call (inc() ~2.2 M
times on the benchmark). The big wins replace those with bulk scans over the
underlying string. Grouped by theme, independently reviewable:
| # |
Theme |
Idea |
Risk |
| A |
Bulk char scanning |
index-based Source; advance_while/advance_until replace per-char inc() loops in bare-keys, whitespace, numbers, comments, single-line strings |
medium — hot path, but pure scan-equivalence |
| B |
Dispatch micro-wins |
frozenset membership, precomputed enum values, is vs set membership, hoisting loop invariants, binding Source delegates |
low — mechanical |
| C |
Object interning |
module-level TOMLChar cache (+ the NUL/EOF fix) |
medium — includes the correctness fix above |
| D |
__deepcopy__ |
hand-rolled Container/Trivia deepcopy to skip the reflective machinery (super-table merge path) |
higher — touches copy/alias semantics, wants careful review |
Themes A + B alone deliver most of the gain at the lowest risk; C and D can be
deferred or dropped.
What I've verified — and what I haven't (yet)
- ✅ Full
pytest tests green incl. toml-test (680/680), matching master.
- ✅
ruff check, ruff format, mypy --strict clean.
- ✅ Drift-immune A/B benchmark (numbers above).
- ⏳ Only CPython 3.11 / macOS so far — not the full CI matrix (3.9–3.14 ×
3 OS) nor the Poetry/poetry-core integration jobs. I'll run those before any PR
if you're interested.
Question for the maintainers
- Is a parser speedup of this size something you'd consider merging?
- If so, do you prefer one PR per theme (A, then B, …) or a single curated
PR of A+B with C/D as follow-ups?
- Any themes you'd rather not take (e.g. the
__deepcopy__ rewrite) on
maintainability grounds?
Happy to adapt to whatever keeps the hot path readable for you.
bench_standalone.py (zero-dependency, copy-paste & run)
#!/usr/bin/env python3
"""Standalone, dependency-free parse benchmark for tomlkit.
Run it on `main`, then on the optimization branch, and compare the medians:
python bench_standalone.py # default: 500 iters x 15 runs
python bench_standalone.py --iters 1000 --runs 25
Only stdlib + the locally-installed `tomlkit` are used, so the numbers are
trivially reproducible. The sample document below is embedded (no external
files) and exercises the common value kinds: strings, ints/floats, bools,
datetimes, arrays, inline tables, nested tables, AoT and comments.
Methodology: warmup, then `runs` timed batches of `iters` parses each;
we report the median batch time (robust to OS scheduling noise) and stdev.
"""
from __future__ import annotations
import argparse
import statistics
import time
import tomlkit
# Representative ~3 KB TOML covering the value kinds the parser dispatches on.
SAMPLE = """\
# A representative TOML document
title = "tomlkit benchmark"
version = 2
ratio = 3.14159
enabled = true
created = 2024-01-15T08:30:00Z
tags = ["alpha", "beta", "gamma", "delta"]
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
inline = { x = 1, y = 2, z = 3 }
[server]
host = "localhost"
port = 8080
timeout = 30.5
backends = ["10.0.0.1", "10.0.0.2", "10.0.0.3"]
[server.tls]
enabled = true
ciphers = ["TLS_AES_256", "TLS_CHACHA20"]
[database]
url = "postgres://localhost:5432/app" # inline comment
pool_size = 20
read_only = false
[[products]]
name = "Widget"
sku = 738594937
price = 19.99
[[products]]
name = "Gadget"
sku = 284758393
price = 49.95
in_stock = false
[logging]
level = "info"
format = "%(asctime)s %(message)s"
rotate_mb = 100
"""
def measure(content: str, iters: int) -> float:
t0 = time.perf_counter()
for _ in range(iters):
tomlkit.parse(content)
return (time.perf_counter() - t0) * 1000.0 # ms
def main() -> None:
ap = argparse.ArgumentParser(description=__doc__)
ap.add_argument("--iters", type=int, default=500)
ap.add_argument("--runs", type=int, default=15)
ap.add_argument("--warmup", type=int, default=3)
args = ap.parse_args()
for _ in range(args.warmup):
measure(SAMPLE, args.iters)
samples = sorted(measure(SAMPLE, args.iters) for _ in range(args.runs))
median = samples[len(samples) // 2]
stdev = statistics.stdev(samples) if len(samples) > 1 else 0.0
print(f"tomlkit {tomlkit.__version__ if hasattr(tomlkit, '__version__') else '?'}")
print(f"{args.iters} parses x {args.runs} runs (warmup x{args.warmup})")
print(f" median : {median:8.1f} ms ({median / args.iters * 1000:.1f} us/parse)")
print(f" stdev : {stdev:8.1f} ms")
print(f" min/max: {samples[0]:.1f} / {samples[-1]:.1f} ms")
if __name__ == "__main__":
main()
TL;DR
I have a set of parser-only optimizations that make
tomlkit.parse()roughly2.4–2.9× faster (content-dependent) with no behavioural change: the full
test suite passes, including 680/680 of the
toml-testconformance corpus,and
ruff+mypy --strictare clean.Before opening a large PR, I'd like to check your appetite and preferred shape —
the work is naturally splittable into independent, separately-reviewable pieces.
Why this matters / why I'm being careful
tomlkit is a hot dependency of Poetry (and poetry-core), so I treated
correctness as non-negotiable: every change is gated on the conformance suite,
not just round-trip of valid files. (In fact, while validating I caught a subtle
EOF-sentinel collision a naive char-interning optimization can introduce — the
fix is included and is itself covered by
toml-test invalid/control/*-null.)Reproducing the numbers
A dependency-free benchmark is attached (
bench_standalone.py): run it onmaster, then on the branch, and compare medians. Measured here (macOS,CPython 3.11):
Cross-checked with a drift-immune A/B (both versions timed, interleaved, in one
process) on other workloads:
bench_standalone.py)toml-test/validfilesWhat the changes do (grouped, so you can pick)
The core idea: the parser advanced one character per call (
inc()~2.2 Mtimes on the benchmark). The big wins replace those with bulk scans over the
underlying string. Grouped by theme, independently reviewable:
Source;advance_while/advance_untilreplace per-charinc()loops in bare-keys, whitespace, numbers, comments, single-line stringsfrozensetmembership, precomputed enum values,isvs set membership, hoisting loop invariants, bindingSourcedelegatesTOMLCharcache (+ the NUL/EOF fix)__deepcopy__Container/Triviadeepcopy to skip the reflective machinery (super-table merge path)Themes A + B alone deliver most of the gain at the lowest risk; C and D can be
deferred or dropped.
What I've verified — and what I haven't (yet)
pytest testsgreen incl.toml-test(680/680), matchingmaster.ruff check,ruff format,mypy --strictclean.3 OS) nor the Poetry/poetry-core integration jobs. I'll run those before any PR
if you're interested.
Question for the maintainers
PR of A+B with C/D as follow-ups?
__deepcopy__rewrite) onmaintainability grounds?
Happy to adapt to whatever keeps the hot path readable for you.
bench_standalone.py(zero-dependency, copy-paste & run)