AstraFlow is a dataflow-oriented reinforcement learning system designed for better flexibility and scalability.
AstraFlow natively supports the following for LLM RL training without any feature-specific system engineering:
- Fully Async Multi-policy collaborative RL
- Elastic heterogeneous cross-region rollouts
- Substitutable rollout and trainer service
- Composable data algorithms
Elastic Heterogeneous Cross-region Rollouts: RaaS instances on mixed hardware and across regions join and leave the rollout pool on demand, with no scheduler- or region-specific code.
Fully Async Multi-policy Collaborative RL Training: multiple policies train together, each as an independent trainer with its own data and weight stream.
- [2026/05] AstraFlow v0.1.0 released — first public release of the full system. See the project website.
- [2026/05] AstraFlow paper is on arXiv.
AstraFlow currently supports the following recipes. Check the documentation for more detailed instructions.
| Recipe | Description |
|---|---|
math/ |
RLVR math reasoning — Qwen3-1.7B / 8B, M2PO, full and delta-weight transfer |
math-multi-agent/ |
Actor + verifier collaborative math training |
math-efficient-data/ |
Composable data algorithms — GRESO, dynamic sampling, buffer replay |
code/ |
Code-generation RL — Qwen3-8B, M2PO |
code-multi-agent/ |
Codegen + verifier competitive coding |
search/ |
Search-augmented agent training with local retrieval |
alfworld/ |
ALFWorld embodied household agent |
webshop/ |
WebShop web-navigation shopping agent |
Near-term focus:
- Offline cluster training — Support training on offline clusters without internet access.
- All-in-one launcher — A launcher helper that streamlines bringing up the AstraFlow, RaaS, and trainer services.
- MoE model support — Extend the training backends to Mixture-of-Experts models.
- Terminal-Bench training — Add a recipe for training agents on Terminal-Bench.
- Megatron backend — Add Megatron-LM as a training backend.
- vLLM rollout engine — Support vLLM alongside SGLang as a rollout engine.
If you find AstraFlow useful in your research, please cite:
@article{zheng2026astraflow,
title = {AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs},
author = {Zheng, Haizhong and Di, Yizhuo and Wang, Jiahui and Jin, Shuowei and
Liu, Xueshen and Wu, Yongji and Mao, Z. Morley and Stoica, Ion and
Zhao, Jiawei and Chen, Beidi},
journal = {arXiv preprint arXiv:2605.15565},
year = {2026}
}We learned the design and reused code from the following projects: AReaL, verl, AgentBench, ASearcher, and M2PO.

