Skip to content

[Question]: [rank0]: FileNotFoundError: [Errno 2] No such file or directory: 'traj_data/r2r' #349

@pqnhoang

Description

@pqnhoang

Question

Environment

  • Running on local machine (no SLURM)
  • Script: scripts/train/qwenvl_train/train_system2_local.sh
  • 2x GPU setup via torchrun --nproc_per_node=2

Problem

After downloading the dataset and placing it according to the documentation, training fails with:

FileNotFoundError: [Errno 2] No such file or directory: 'traj_data/r2r'

The error originates from get_annotations_from_lerobot_data in internnav/dataset/internvla_n1_lerobot_dataset.py at line 762:

scene_ids = [d for d in os.listdir(data_path) if os.path.isdir(os.path.join(data_path, d))]

The data_path is hardcoded as a relative path (traj_data/r2r) in the dataset config dictionary (lines 51–100), but torchrun does not guarantee that the working directory is the project root, causing the relative path to fail.

Steps to Reproduce

  1. Download the InternNav N1 dataset following the official instructions
  2. Place data at <project_root>/traj_data/r2r and <project_root>/traj_data/rxr
  3. Run bash scripts/train/qwenvl_train/train_system2_local.sh from the project root
  4. Training crashes immediately with FileNotFoundError

Additional Issue: Data structure mismatch

Even after working around the path issue (e.g. via symlinks), a second error appears:

ValueError: num_samples should be a positive integer value, but got num_samples=0

This is because the downloaded data has the structure:

traj_data/r2r/<scene_id>/trajectory_N/meta/episodes.jsonl

But get_annotations_from_lerobot_data expects:

traj_data/r2r/<scene_id>/meta/episodes.jsonl

The code only iterates one level deep (scene_ids), missing the trajectory_N subdirectory level entirely, resulting in 0 episodes loaded.

Expected Behavior

  • The training script should resolve data_path relative to the project root regardless of working directory, or use absolute paths
  • The dataset loader should either document the exact expected folder structure, or handle the trajectory_N subdirectory level

Questions

  1. What is the exact expected directory structure of the downloaded data for traj_data/r2r?
  2. Should data_path be set as an absolute path or resolved relative to PROJECT_ROOT_PATH?
  3. Is there a data preprocessing step needed before training (e.g. to flatten the trajectory_N structure)?

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions