DataFlow series is a data preparation and training system designed to parse, generate, process, and evaluate high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuning, RL training) or RAG using knowledge base cleaning.
Specifically, we are constructing diverse operators leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct pipelines, collectively forming the comprehensive DataFlow system. Additionally, we develop an intelligent DataFlow-agent capable of dynamically assembling new pipelines by recombining existing operators on demand.
DataFlow-MM is the multimodal extension version of the awesome repo DataFlow
First, clone the repository and install DataFlow-MM in editable mode:
cd ./DataFlow-MM
conda create -n dataflow-mm python=3.12
conda activate dataflow-mm
pip install -e .Install additional dependencies based on your use case:
Audio environment
pip install -e ".[audio]"Image environment
pip install -e ".[image]"Create and initialize a DataFlow-MM workspace:
mkdir test_dataflow
cd test_dataflow
dataflowmm initThis command will generate the basic directory structure and configuration files required to run DataFlow-MM pipelines.
To run the Image or Video examples, please download the corresponding demo datasets from Hugging Face (GitHub is not suitable for hosting large files):
-
Image Examples: https://huggingface.co/datasets/OpenDCAI/dataflow-demo-image
-
Video Examples: https://huggingface.co/datasets/OpenDCAI/dataflow-demo-video
After downloading, place the data in the "test_dataflow/example" directory as instructed in each example.

