Skip to content

perf: eager-load params when streaming on cpu backend#1687

Open
fszontagh wants to merge 2 commits into
leejet:masterfrom
fszontagh:perf/auto-eager-streaming
Open

perf: eager-load params when streaming on cpu backend#1687
fszontagh wants to merge 2 commits into
leejet:masterfrom
fszontagh:perf/auto-eager-streaming

Conversation

@fszontagh

@fszontagh fszontagh commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

After #1644, model weights are loaded from disk lazily on the first prepare_params call. For apps that pre-load the model (long-lived servers, batched generation) this means the actual disk I/O happens during the first sampling step rather than during the explicit model-load call. With --stream-layers on a large CPU-backed model, that first step pays the full multi-segment disk read while the host expects "the model is loaded, generation should be fast now."

This PR adds an --eager-load flag (off by default). When set, all registered params are loaded into the params backend right after metadata validation, so subsequent prepare_params calls fast-path and the I/O cost is paid at model-load time instead of during sampling. No behavior change for users who don't pass the flag.

Related

Follow-up to closed #1646 (same idea, was flag-named --eager-load-params).

Numbers

RTX 3060 12 GB, --offload-to-cpu --stream-layers --max-vram -1:

Workload Default (lazy) --eager-load
Z-Image bf16 1024x688 batch=2 9 steps generate_image 244 s 63 s
Qwen Image Edit Q8 1024x688 20 steps generate_image 40+ min stall 68 s

The disk-read work isn't avoided, only relocated from "first sampling step" to "model load." For interactive / long-lived apps this is the user-visible win.

Checklist

@leejet

leejet commented Jun 21, 2026

Copy link
Copy Markdown
Owner

I think it would be better to control this with a flag, such as --eager-load, and leave it disabled by default.

@fszontagh fszontagh force-pushed the perf/auto-eager-streaming branch from df69b60 to 5f817c0 Compare June 21, 2026 16:57
@fszontagh

Copy link
Copy Markdown
Contributor Author

Done in 5f817c0. Switched to a public --eager-load flag, defaults to false. The lazy path stays the upstream default; users opt in when they want the I/O paid at model-load time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants