GitHub - Open-Bee/DataStudio: [ICLR 2026] Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

Config-driven multimodal data processing pipeline for MLLMs

Introduction

DataStudio is a data processing toolkit for Multimodal Large Language Models (MLLMs). It supports both rule-based and MLLM-powered operators for filtering and rewriting data. DataStudio drives the HoneyPipe data cleaning pipeline that produced Honey-Data-15M — a high-quality dataset of 15 million QA pairs.

Key Features

Config-Driven: Define complete processing pipelines via Python config files with config inheritance (_base_)
Dual-Engine Operators: 16 built-in rule operators (7 filters + 9 rewriters), plus MLLM-powered operators for intelligent filtering and rewriting
High Performance: Multi-process async concurrent API requests (8192+ via MPOpenAIAPI), LMDB sharded image caching, fork + COW memory sharing
Checkpoint Resume: Automatic progress saving with checkpoint-based recovery after interruption

Auxiliary Tools

LLMRouter: A high-performance reverse proxy router built in Go, OpenAI API-compatible, supporting multiple LLM backends with weighted load balancing, sliding-window RPM throttling, async health checks with auto failover, and config hot-reload. See its README for details.
DataVis: A multimodal data visualization and analysis platform (React + FastAPI) for browsing, filtering, and analyzing processed datasets.

Installation

git clone https://github.com/Open-Bee/DataStudio.git
cd DataStudio
pip install -e .

Requirements: Python >= 3.10.

Quick Start

1. Prepare a Dataset YAML

# dataset.yaml
datasets:
  - json_path: /path/to/data.jsonl
    source: my_dataset

2. Write a Config File

# configs/my_pipeline.py
_base_ = ["@/_base_/dataset.py"]

work_dir = "./work_dirs/my_experiment"
logger = dict(type="Logger", log_file="logs/process.log")
dataset_yaml = "/path/to/dataset.yaml"

dataloader = dict(dataset=dataset_yaml, batch_size=1000, use_image=False)
datasaver = dict(dataset=dataset_yaml, output_dir="./output", save_yaml_name="processed")

pipeline = dict(
    type="Pipeline",
    operations={
        "rule_filters": dict(
            cfg=dict(type="SubPipeline", operators=[
                dict(type="ConvLengthFilter", min_length=1, max_length=20),
                dict(type="ImageSizeFilter", min_size=14),
            ]),
            priority=0,
        ),
        "rewriters": dict(
            cfg=dict(type="SubPipeline", operators=[
                dict(type="RemoveThinkRewriter"),
            ]),
            priority=1,
        ),
    }
)

3. Run

python run.py -c configs/my_pipeline.py

For a complete tutorial covering data formats, MLLM operators, custom operators, performance tuning, and more, see the Quick Start Guide.

Project Structure

DataStudio/
├── run.py                  # Main entry point (supports -c/--config, --cache-images)
├── datastudio/
│   ├── operators/          # Operators
│   │   ├── core/           # Base: Operator, Filter, Rewriter, DataItem, OperatorResult
│   │   ├── filters/        # 7 rule-based filters
│   │   ├── rewriters/      # 9 rule-based rewriters
│   │   └── mllm/           # MLLM operators: MLLMFilter, MLLMRewriter, SelectiveMLLMRewriter
│   ├── models/             # Model backends: OpenAIAPI, MPOpenAIAPI
│   ├── pipelines/          # Pipeline, SubPipeline
│   ├── datasets/           # Data loading/saving, format conversion, YAML generation
│   └── utils/              # Registry, config, logging, LMDB cache, checkpoint
├── configs/                # Configuration files
│   ├── _base_/             # Base configs (models, datasets, filters, rewriters)
│   └── examples/           # Ready-to-run example configurations with demo data
├── prompts/                # MLLM prompt templates
├── tools/                  # Auxiliary tools
│   ├── LLMRouter/          # OpenAI API-compatible reverse proxy with load balancing
│   └── DataVis/            # Multimodal data visualization and analysis platform
├── docs/                   # Documentation (Sphinx + Markdown guides)
└── tests/                  # Unit tests

About the Bee Project

DataStudio is the data cleaning engine of the Bee project. Through the HoneyPipe pipeline, it produced the Honey-Data-15M dataset.

Component	Description	Link
Honey-Data-15M	15M high-quality QA pairs (with CoT)	HuggingFace
Bee-8B	Fully open-source 8B MLLM	HuggingFace
DataStudio	Data cleaning pipeline	GitHub

Citation

@article{zhang2025bee,
  title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs},
  author={Zhang, Yi and Ni, Bolin and Chen, Xin-Sheng and Zhang, Heng-Rui and Rao, Yongming and Peng, Houwen and Lu, Qinglin and Hu, Han and Guo, Meng-Hao and Hu, Shi-Min},
  journal={arXiv preprint arXiv:2510.13795},
  year={2025}
}

Contributing

If you encounter any bugs or have feature requests, feel free to open an Issue or submit a Pull Request. Contributions are welcome!

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
configs		configs
datastudio		datastudio
docs		docs
prompts		prompts
tests		tests
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Key Features

Auxiliary Tools

Installation

Quick Start

1. Prepare a Dataset YAML

2. Write a Config File

3. Run

Project Structure

About the Bee Project

Citation

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Key Features

Auxiliary Tools

Installation

Quick Start

1. Prepare a Dataset YAML

2. Write a Config File

3. Run

Project Structure

About the Bee Project

Citation

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages