Skip to content

Dataflow-MM, multi-media operators for Dataflow. We aim to prepare data for Multimodal Large Language Models.

License

Notifications You must be signed in to change notification settings

OpenDCAI/DataFlow-MM

Repository files navigation

Dataflow-MM

Documents Ask DeepWiki

🎉 If you like our project, please give us a star ⭐ on GitHub for the latest update.

简体中文 | English

📰 1. News

🔍 2. Overview

df_overview_final_300

DataFlow series is a data preparation and training system designed to parse, generate, process, and evaluate high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuning, RL training) or RAG using knowledge base cleaning.

Specifically, we are constructing diverse operators leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct pipelines, collectively forming the comprehensive DataFlow system. Additionally, we develop an intelligent DataFlow-agent capable of dynamically assembling new pipelines by recombining existing operators on demand.

DataFlow-MM is the multimodal extension version of the awesome repo DataFlow

Quick Start

Installation

First, clone the repository and install DataFlow-MM in editable mode:

cd ./DataFlow-MM
conda create -n dataflow-mm python=3.12
conda activate dataflow-mm
pip install -e .

Optional Dependencies

Install additional dependencies based on your use case:

Audio environment

pip install -e ".[audio]"

Image environment

pip install -e ".[image]"

Initialize a DataFlow Workspace

Create and initialize a DataFlow-MM workspace:

mkdir test_dataflow
cd test_dataflow
dataflowmm init

This command will generate the basic directory structure and configuration files required to run DataFlow-MM pipelines.


Demo Data

To run the Image or Video examples, please download the corresponding demo datasets from Hugging Face (GitHub is not suitable for hosting large files):

After downloading, place the data in the "test_dataflow/example" directory as instructed in each example.

About

Dataflow-MM, multi-media operators for Dataflow. We aim to prepare data for Multimodal Large Language Models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 16

Languages