Skip to content

feat(datapoints): retry datapoint creation up to maxRetries#586

Open
RapidPoseidon wants to merge 1 commit into
mainfrom
feat(datapoints)/retry-datapoint-creation
Open

feat(datapoints): retry datapoint creation up to maxRetries#586
RapidPoseidon wants to merge 1 commit into
mainfrom
feat(datapoints)/retry-datapoint-creation

Conversation

@RapidPoseidon
Copy link
Copy Markdown
Contributor

Summary

The SDK previously failed a datapoint upload immediately on the first error from the create-datapoint POST. This wraps the call in a retry loop so transient failures (network blips, brief 5xx) no longer drop the datapoint.

  • Up to rapidata_config.upload.maxRetries attempts (default 3, already exposed via env/config).
  • Exponential backoff between attempts: 1s, 2s, 4s.
  • Intermediate failures logged at DEBUG via suppress_rapidata_error_logging to avoid spamming users on retried errors; the final exception still propagates so the existing orchestrator-level failure tracking in _rapidata_dataset.py keeps working.
  • Pattern mirrors the existing sample-upload retry in benchmark/participant/participant.py.

Asset upload is left outside the retry loop — assets are uploaded ahead of time by the orchestrator and cached, so only the datapoint creation API call itself is the new retry surface.

Notes

  • I could not run pyright src/rapidata/rapidata_client from the Poseidon container (Python tooling not present); please run it locally before merging as required by the repo's CLAUDE.md.

Test plan

  • Local: kick off an order with a small dataset and confirm successful path is unchanged.
  • Simulate a transient failure (e.g. point at a flaky endpoint or briefly stop the dataset service) and confirm the SDK retries and eventually succeeds within 3 attempts.
  • Force a persistent failure and confirm the datapoint ends up in the failed-uploads list with the original exception.
  • pyright src/rapidata/rapidata_client clean.

🔗 Session: https://session-e125c012.poseidon.rapidata.internal/

Previously a transient failure on the datapoint creation POST would
fail the upload immediately. Wrap the call in the same retry loop
already used by participant sample uploads: up to
`rapidata_config.upload.maxRetries` attempts (default 3) with
exponential backoff (1s, 2s, 4s) and suppressed intermediate logs.

Co-Authored-By: luca <luca@rapidata.ai>
@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

Code Review

Overview

This PR adds a retry loop with exponential backoff (1 s, 2 s, 4 s) around the dataset_dataset_id_datapoint_post call, using the existing rapidata_config.upload.maxRetries config and mirroring the retry pattern already in participant.py. The refactor to extract the payload object first is a clean improvement.


Issues

🔴 Retrying non-retryable (4xx) errors

The except Exception block catches every exception including RapidataError with a 4xx status. Retrying a 400 Bad Request, 409 Conflict, or 422 Unprocessable Entity wastes up to 7 seconds of sleep (default 3 retries) and will always fail — the payload does not change between attempts.

# Suggested fix: skip retries for client errors
except Exception as e:
    last_exception = e
    is_client_error = (
        isinstance(e, RapidataError)
        and e.status_code is not None
        and 400 <= e.status_code < 500
    )
    if attempt < max_retries - 1 and not is_client_error:
        ...  # backoff and retry
    else:
        break  # propagate immediately

The same issue exists in participant.py; fixing it here is a good opportunity to establish the right pattern. participant.py could be updated in a follow-up.


🟡 suppress_rapidata_error_logging applied on the final attempt too

The context manager is re-entered on every loop iteration, including the last one. This means the API-client-level error log is silenced even for the terminal failure. The exception still propagates, so the orchestrator tracks it, but any structured logging emitted by RapidataApiClient.call_api (traceId, status code, etc.) is lost on the final attempt.

Consider wrapping only non-final attempts:

for attempt in range(max_retries):
    try:
        if attempt < max_retries - 1:
            with suppress_rapidata_error_logging():
                return self.openapi_service.dataset.datapoints_api...
        else:
            return self.openapi_service.dataset.datapoints_api...
    except Exception as e:
        ...

Or alternatively move the suppression inside the if attempt < max_retries - 1: branch only.


🟡 assert last_exception is not None as a type narrowing workaround

After the loop last_exception is always non-None (a successful return exits early), so the assert is purely for the type checker. The idiomatic pattern in this codebase (see participant.py) is raise last_exception with a # type: ignore comment, or initialise as Exception("unreachable"). The assert is not wrong, but it will silently do nothing in optimised (-O) mode. A raise last_exception # type: ignore[misc] is clearer about intent.


Positive aspects

  • Consistency — mirrors the retry pattern from participant.py closely.
  • Payload extraction — pulling CreateDatapointEndpointInput(...) out of the call site is a clean improvement regardless of the retry change.
  • max_retries read once — reading rapidata_config.upload.maxRetries once before the loop (rather than on every iteration as in participant.py) is a minor but correct improvement.
  • Debug logging format — using %s-style logging rather than f-strings avoids string interpolation when the log level is not active.

Checklist note

The PR notes that pyright src/rapidata/rapidata_client was not run. Per CLAUDE.md this is a required step before merging — please ensure it passes clean before approving.


Summary: The core idea is sound and the implementation is mostly clean. The main ask before merging is to guard against retrying 4xx client errors, which will never succeed and add unnecessary latency. The suppress_rapidata_error_logging scope and the assert pattern are lower-priority but worth addressing for debuggability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant