Scripts to download HF dataset in arrayrecord or parquet by aireenmei · Pull Request #3315 · AI-Hypercomputer/maxtext

aireenmei · 2026-03-04T22:28:54Z

Description

Add scripts for downloading Hugging Face dataset in ArrayRecord or Parquet formats.
See the docstring in the script for details.

Tests

Use the scripts to download the OptimalScale/ClimbMix dataset

Download as ArrayRecord

python download_hf_dataset_as_arrayrecord.py \
    --dataset OptimalScale/ClimbMix \
    --output <path> \
    --workers 16

Log output: Done! Wrote 553,315,056 records to 1665 files in 16245.8s
Files are in gs://maxtext-dataset/array-record/climbmix/*.arrayrecord

Download as Parquet

python download_hf_dataset_as_parquet.py \
    --dataset OptimalScale/ClimbMix \
    --num-files 2048 \
    --row-group-size 50000 \
    --output <path> \
    --workers 16 \
    --split train

log output: Done! Wrote 553,315,056 rows in 12535.5s
Files are in gs://maxtext-dataset/hf/climbmix/*.parquet

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-03-04T22:32:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions · 2026-03-12T00:48:12Z

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This PR adds two useful scripts for downloading and converting Hugging Face streaming datasets to ArrayRecord and Parquet formats. The implementation is well-structured and uses multiprocessing effectively for faster downloads.

🔍 General Feedback

State Restoration: Fixed a critical bug in the retry mechanism where dataset byte and record tracking variables were not restored from checkpoints, which would have led to duplicate records in the final dataset files.
Hugging Face Formatting: Suggested enforcing a minimum width of 5 digits when naming files to ensure strict adherence to standard Hugging Face dataset conventions.
GCS Checkpoints: Identified and resolved an issue where PyArrow Parquet checkpoints were improperly saving to an invalid local directory path (gs:/...) when using a GCS bucket. Utilizing the PyArrow fs filesystem handles this elegantly.

tools/data_generation/download_hf_dataset_as_arrayrecord.py

tools/data_generation/download_hf_dataset_as_parquet.py

aireenmei force-pushed the aireen/hf_parquet branch from 07358c6 to 8fb1d8e Compare March 4, 2026 22:32

aireenmei force-pushed the aireen/hf_parquet branch 4 times, most recently from 90ceb4d to 5ee8a5a Compare March 12, 2026 00:46

aireenmei marked this pull request as ready for review March 12, 2026 00:46

aireenmei requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, bvandermoon, dipannita08, gagika, gobbleturk, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners March 12, 2026 00:46

aireenmei added the gemini-review label Mar 12, 2026

github-actions bot reviewed Mar 12, 2026

View reviewed changes

aireenmei assigned gagika Mar 12, 2026

aireenmei changed the title ~~Script to download HF dataset and convert to parquet~~ Scripts to download HF dataset in arrayrecord or parquet Mar 12, 2026

aireenmei force-pushed the aireen/hf_parquet branch 3 times, most recently from 41b09d9 to 92c0be6 Compare March 12, 2026 19:38

NuojCheng approved these changes Mar 12, 2026

View reviewed changes

aireenmei force-pushed the aireen/hf_parquet branch from 92c0be6 to 5705275 Compare March 12, 2026 23:04

Script for download HF dataset and convert to parquet

86a9696

aireenmei force-pushed the aireen/hf_parquet branch from 5705275 to 86a9696 Compare March 12, 2026 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripts to download HF dataset in arrayrecord or parquet#3315

Scripts to download HF dataset in arrayrecord or parquet#3315
aireenmei wants to merge 1 commit intomainfrom
aireen/hf_parquet

aireenmei commented Mar 4, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aireenmei commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov bot commented Mar 4, 2026

Codecov Report

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aireenmei commented Mar 4, 2026 •

edited

Loading