Skip to content

fix: prevent duplicate dataset uploads using file_id check#1288

Open
Shamarvey1 wants to merge 1 commit intoopenml:developfrom
Shamarvey1:fix-duplicate-dataset
Open

fix: prevent duplicate dataset uploads using file_id check#1288
Shamarvey1 wants to merge 1 commit intoopenml:developfrom
Shamarvey1:fix-duplicate-dataset

Conversation

@Shamarvey1
Copy link
Copy Markdown

@Shamarvey1 Shamarvey1 commented Apr 16, 2026

Fix: Prevent duplicate dataset uploads (#1192)

Closes #1192

Problem

Currently, the same dataset can be uploaded multiple times, resulting in duplicate entries with different dataset IDs. This leads to redundancy and inconsistency in the dataset repository.

Solution

This PR introduces a duplicate check in the data_upload() method before inserting a new dataset.

  • It checks if a dataset with the same file_id already exists.
  • If a match is found, the upload is rejected with an appropriate error message.
  • Otherwise, the dataset insertion proceeds as usual.

Why this approach

The system already assigns a unique file_id to each uploaded dataset file, making it a reliable identifier for detecting duplicate uploads. This avoids the need for additional hashing or database schema changes.

Result

  • Prevents duplicate dataset uploads
  • Ensures data consistency
  • Keeps implementation minimal and aligned with existing codebase

Note

This approach detects duplicates based on identical file uploads. Further enhancements (e.g., content-based hashing) can be explored if needed.

@Shamarvey1
Copy link
Copy Markdown
Author

Hi, I’ve implemented a duplicate dataset check based on file_id before insertion in the data_upload() flow.

This ensures that identical dataset files are not uploaded multiple times while keeping the solution minimal and consistent with the existing system.

Please let me know if you’d prefer a different approach (e.g., content-based hashing) or any adjustments. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Same dataset can be uploaded multiple times

1 participant