Dataflow-MM

🎉 If you like our project, please give us a star ⭐ on GitHub for the latest update.

📰 1. News

🔍 2. Overview

DataFlow series is a data preparation and training system designed to parse, generate, process, and evaluate high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuning, RL training) or RAG using knowledge base cleaning.

Specifically, we are constructing diverse operators leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct pipelines, collectively forming the comprehensive DataFlow system. Additionally, we develop an intelligent DataFlow-agent capable of dynamically assembling new pipelines by recombining existing operators on demand.

DataFlow-MM is the multimodal extension version of the awesome repo DataFlow

Quick Start

Installation

First, clone the repository and install DataFlow-MM in editable mode:

cd ./DataFlow-MM
conda create -n dataflow-mm python=3.12
conda activate dataflow-mm
pip install -e .

Optional Dependencies

Install additional dependencies based on your use case:

Audio environment

pip install -e ".[audio]"

Image environment

pip install -e ".[image]"

Initialize a DataFlow Workspace

Create and initialize a DataFlow-MM workspace:

mkdir test_dataflow
cd test_dataflow
dataflowmm init

This command will generate the basic directory structure and configuration files required to run DataFlow-MM pipelines.

Demo Data

To run the Image or Video examples, please download the corresponding demo datasets from Hugging Face (GitHub is not suitable for hosting large files):

Image Examples: https://huggingface.co/datasets/OpenDCAI/dataflow-demo-image
Video Examples: https://huggingface.co/datasets/OpenDCAI/dataflow-demo-video

After downloading, place the data in the "test_dataflow/example" directory as instructed in each example.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
dataflow		dataflow
test		test
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README-zh.md		README-zh.md
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataflow-MM

📰 1. News

🔍 2. Overview

Quick Start

Installation

Optional Dependencies

Initialize a DataFlow Workspace

Demo Data

About

Uh oh!

Releases

Packages

Contributors 16

Uh oh!

Languages

License

OpenDCAI/DataFlow-MM

Folders and files

Latest commit

History

Repository files navigation

Dataflow-MM

📰 1. News

🔍 2. Overview

Quick Start

Installation

Optional Dependencies

Initialize a DataFlow Workspace

Demo Data

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 16

Uh oh!

Languages

Packages