feat(datapoints): retry datapoint creation up to maxRetries by RapidPoseidon · Pull Request #586 · RapidataAI/rapidata-python-sdk

RapidPoseidon · 2026-05-11T08:16:22Z

Summary

The SDK previously failed a datapoint upload immediately on the first error from the create-datapoint POST. This wraps the call in a retry loop so transient failures (network blips, brief 5xx) no longer drop the datapoint.

Up to rapidata_config.upload.maxRetries attempts (default 3, already exposed via env/config).
Exponential backoff between attempts: 1s, 2s, 4s.
Intermediate failures logged at DEBUG via suppress_rapidata_error_logging to avoid spamming users on retried errors; the final exception still propagates so the existing orchestrator-level failure tracking in _rapidata_dataset.py keeps working.
Pattern mirrors the existing sample-upload retry in benchmark/participant/participant.py.

Asset upload is left outside the retry loop — assets are uploaded ahead of time by the orchestrator and cached, so only the datapoint creation API call itself is the new retry surface.

Notes

I could not run pyright src/rapidata/rapidata_client from the Poseidon container (Python tooling not present); please run it locally before merging as required by the repo's CLAUDE.md.

Test plan

Local: kick off an order with a small dataset and confirm successful path is unchanged.
Simulate a transient failure (e.g. point at a flaky endpoint or briefly stop the dataset service) and confirm the SDK retries and eventually succeeds within 3 attempts.
Force a persistent failure and confirm the datapoint ends up in the failed-uploads list with the original exception.
pyright src/rapidata/rapidata_client clean.

🔗 Session: https://session-e125c012.poseidon.rapidata.internal/

Previously a transient failure on the datapoint creation POST would fail the upload immediately. Wrap the call in the same retry loop already used by participant sample uploads: up to `rapidata_config.upload.maxRetries` attempts (default 3) with exponential backoff (1s, 2s, 4s) and suppressed intermediate logs. Co-Authored-By: luca <luca@rapidata.ai>

claude · 2026-05-11T08:17:59Z

Code Review

Overview

This PR adds a retry loop with exponential backoff (1 s, 2 s, 4 s) around the dataset_dataset_id_datapoint_post call, using the existing rapidata_config.upload.maxRetries config and mirroring the retry pattern already in participant.py. The refactor to extract the payload object first is a clean improvement.

Issues

🔴 Retrying non-retryable (4xx) errors

The except Exception block catches every exception including RapidataError with a 4xx status. Retrying a 400 Bad Request, 409 Conflict, or 422 Unprocessable Entity wastes up to 7 seconds of sleep (default 3 retries) and will always fail — the payload does not change between attempts.

# Suggested fix: skip retries for client errors
except Exception as e:
    last_exception = e
    is_client_error = (
        isinstance(e, RapidataError)
        and e.status_code is not None
        and 400 <= e.status_code < 500
    )
    if attempt < max_retries - 1 and not is_client_error:
        ...  # backoff and retry
    else:
        break  # propagate immediately

The same issue exists in participant.py; fixing it here is a good opportunity to establish the right pattern. participant.py could be updated in a follow-up.

🟡 `suppress_rapidata_error_logging` applied on the final attempt too

The context manager is re-entered on every loop iteration, including the last one. This means the API-client-level error log is silenced even for the terminal failure. The exception still propagates, so the orchestrator tracks it, but any structured logging emitted by RapidataApiClient.call_api (traceId, status code, etc.) is lost on the final attempt.

Consider wrapping only non-final attempts:

for attempt in range(max_retries):
    try:
        if attempt < max_retries - 1:
            with suppress_rapidata_error_logging():
                return self.openapi_service.dataset.datapoints_api...
        else:
            return self.openapi_service.dataset.datapoints_api...
    except Exception as e:
        ...

Or alternatively move the suppression inside the if attempt < max_retries - 1: branch only.

🟡 `assert last_exception is not None` as a type narrowing workaround

After the loop last_exception is always non-None (a successful return exits early), so the assert is purely for the type checker. The idiomatic pattern in this codebase (see participant.py) is raise last_exception with a # type: ignore comment, or initialise as Exception("unreachable"). The assert is not wrong, but it will silently do nothing in optimised (-O) mode. A raise last_exception # type: ignore[misc] is clearer about intent.

Positive aspects

Consistency — mirrors the retry pattern from participant.py closely.
Payload extraction — pulling CreateDatapointEndpointInput(...) out of the call site is a clean improvement regardless of the retry change.
max_retries read once — reading rapidata_config.upload.maxRetries once before the loop (rather than on every iteration as in participant.py) is a minor but correct improvement.
Debug logging format — using %s-style logging rather than f-strings avoids string interpolation when the log level is not active.

Checklist note

The PR notes that pyright src/rapidata/rapidata_client was not run. Per CLAUDE.md this is a required step before merging — please ensure it passes clean before approving.

Summary: The core idea is sound and the implementation is mostly clean. The main ask before merging is to guard against retrying 4xx client errors, which will never succeed and add unnecessary latency. The suppress_rapidata_error_logging scope and the assert pattern are lower-priority but worth addressing for debuggability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datapoints): retry datapoint creation up to maxRetries#586

feat(datapoints): retry datapoint creation up to maxRetries#586
RapidPoseidon wants to merge 1 commit into
mainfrom
feat(datapoints)/retry-datapoint-creation

RapidPoseidon commented May 11, 2026

Uh oh!

claude Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RapidPoseidon commented May 11, 2026

Summary

Notes

Test plan

Uh oh!

claude Bot commented May 11, 2026

Code Review

Overview

Issues

🔴 Retrying non-retryable (4xx) errors

🟡 suppress_rapidata_error_logging applied on the final attempt too

🟡 assert last_exception is not None as a type narrowing workaround

Positive aspects

Checklist note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🟡 `suppress_rapidata_error_logging` applied on the final attempt too

🟡 `assert last_exception is not None` as a type narrowing workaround