Skip to content

Scripts to download HF dataset in arrayrecord or parquet#3315

Open
aireenmei wants to merge 1 commit intomainfrom
aireen/hf_parquet
Open

Scripts to download HF dataset in arrayrecord or parquet#3315
aireenmei wants to merge 1 commit intomainfrom
aireen/hf_parquet

Conversation

@aireenmei
Copy link
Collaborator

@aireenmei aireenmei commented Mar 4, 2026

Description

Add scripts for downloading Hugging Face dataset in ArrayRecord or Parquet formats.
See the docstring in the script for details.

Tests

Use the scripts to download the OptimalScale/ClimbMix dataset

  • Download as ArrayRecord
python download_hf_dataset_as_arrayrecord.py \
    --dataset OptimalScale/ClimbMix \
    --output <path> \
    --workers 16

Log output: Done! Wrote 553,315,056 records to 1665 files in 16245.8s
Files are in gs://maxtext-dataset/array-record/climbmix/*.arrayrecord

  • Download as Parquet
python download_hf_dataset_as_parquet.py \
    --dataset OptimalScale/ClimbMix \
    --num-files 2048 \
    --row-group-size 50000 \
    --output <path> \
    --workers 16 \
    --split train

log output: Done! Wrote 553,315,056 rows in 12535.5s
Files are in gs://maxtext-dataset/hf/climbmix/*.parquet

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This PR adds two useful scripts for downloading and converting Hugging Face streaming datasets to ArrayRecord and Parquet formats. The implementation is well-structured and uses multiprocessing effectively for faster downloads.

🔍 General Feedback

  • State Restoration: Fixed a critical bug in the retry mechanism where dataset byte and record tracking variables were not restored from checkpoints, which would have led to duplicate records in the final dataset files.
  • Hugging Face Formatting: Suggested enforcing a minimum width of 5 digits when naming files to ensure strict adherence to standard Hugging Face dataset conventions.
  • GCS Checkpoints: Identified and resolved an issue where PyArrow Parquet checkpoints were improperly saving to an invalid local directory path (gs:/...) when using a GCS bucket. Utilizing the PyArrow fs filesystem handles this elegantly.

@aireenmei aireenmei changed the title Script to download HF dataset and convert to parquet Scripts to download HF dataset in arrayrecord or parquet Mar 12, 2026
@aireenmei aireenmei force-pushed the aireen/hf_parquet branch 3 times, most recently from 41b09d9 to 92c0be6 Compare March 12, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants