fix: prevent duplicate dataset uploads using file_id check by Shamarvey1 · Pull Request #1288 · openml/OpenML

Shamarvey1 · 2026-04-16T13:40:26Z

Fix: Prevent duplicate dataset uploads (#1192)

Problem

Currently, the same dataset can be uploaded multiple times, resulting in duplicate entries with different dataset IDs. This leads to redundancy and inconsistency in the dataset repository.

Solution

This PR introduces a duplicate check in the data_upload() method before inserting a new dataset.

It checks if a dataset with the same file_id already exists.
If a match is found, the upload is rejected with an appropriate error message.
Otherwise, the dataset insertion proceeds as usual.

Why this approach

The system already assigns a unique file_id to each uploaded dataset file, making it a reliable identifier for detecting duplicate uploads. This avoids the need for additional hashing or database schema changes.

Result

Prevents duplicate dataset uploads
Ensures data consistency
Keeps implementation minimal and aligned with existing codebase

Note

This approach detects duplicates based on identical file uploads. Further enhancements (e.g., content-based hashing) can be explored if needed.

Shamarvey1 · 2026-04-16T13:40:45Z

Hi, I’ve implemented a duplicate dataset check based on file_id before insertion in the data_upload() flow.

This ensures that identical dataset files are not uploaded multiple times while keeping the solution minimal and consistent with the existing system.

Please let me know if you’d prefer a different approach (e.g., content-based hashing) or any adjustments. Thanks!

fix: prevent duplicate dataset uploads using file_id check

f3770fb

Shamarvey1 mentioned this pull request Apr 30, 2026

[MENTEE] Ajit Kumar Prasad - ESoC 2026 Batch 2 Application. sktime/mentoring#80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent duplicate dataset uploads using file_id check#1288

fix: prevent duplicate dataset uploads using file_id check#1288
Shamarvey1 wants to merge 1 commit intoopenml:developfrom
Shamarvey1:fix-duplicate-dataset

Shamarvey1 commented Apr 16, 2026 •

edited

Loading

Uh oh!

Shamarvey1 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Shamarvey1 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Prevent duplicate dataset uploads (#1192)

Problem

Solution

Why this approach

Result

Note

Uh oh!

Shamarvey1 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Shamarvey1 commented Apr 16, 2026 •

edited

Loading