Skip to content

trickle_publisher: _resolve_next_seq fallback drops first video segment + logs noisy warning #12

@rickstaa

Description

@rickstaa

Symptom

Two issues from the same code path in trickle_publisher.py:_resolve_next_seq:

(1) Noisy warning at every session start

WARNING livepeer_gateway.trickle_publisher: Trickle /next missing Lp-Trickle-Latest header

Fires once per publisher (events + data + media-out) at every LivePipeline session start. Visible in every example pipeline (live_grayscale, live_depth, live_transcribe).

(2) First video segment dropped on the -out channel

Real bug, not just noise. Orchestrator log signature:

POST segment stream=<sid>-out idx=0 next=0      ← URL /-1, server resolved to slot 0
POST segment stream=<sid>-out idx=0 next=0      ← URL /0,  duplicate write to slot 0
POST segment stream=<sid>-out idx=0 next=0      ← retry after the drop
POST segment stream=<sid>-out idx=1 next=1      ← in-sync from here onwards

Runner-side log:

WARNING ... MediaPublish[<sid>-out] dropped segment seq=0 mid-stream; draining pipe
livepeer_gateway.trickle_publisher.TrickleSegmentWriteError: Trickle segment writer timed out after 5.0s
WARNING ... Trickle segment close suppressed seq=0

The first MPEG-TS video segment never makes it through. Subsequent segments are fine because the publisher's local counter is back in sync with the server's nextWrite from segment 1.

Root cause

The publisher's _resolve_next_seq() (line 346) does:

async def _resolve_next_seq(self) -> int:
    url = f"{self.url}/next"
    try:
        resp = await self._session.get(url)
        latest = resp.headers.get("Lp-Trickle-Latest")
        resp.release()
        if latest is not None:
            return int(latest)
        else:
            _LOG.warning("Trickle /next missing Lp-Trickle-Latest header")
    except Exception:
        _LOG.warning("Trickle /next request failed", exc_info=True)
    return -1

The /next endpoint is in livepeer/go-livepeer#3884 (Josh's "Serverless" PR, still open) — both sides authored as a designed pair on the same day:

When Where Author
2026-03-26 01:11 PDT _resolve_next_seq lands in this repo (commit 5007e2c) Josh Allmann
2026-03-26 12:21 PDT /next endpoint added to go-livepeer (commit 8da1c692, in PR #3884) Josh Allmann

The Python side merged to main. The Go side hasn't merged to master. So on master, every probe hits a 400 (no /next route), no header is set, the warning fires, and _resolve_next_seq returns -1.

The caller in next():

if self.seq < 0:
    send_reset = True
    self.seq = await self._resolve_next_seq()    # = -1
self._next_state = await self.preconnect(self.seq, send_reset=send_reset)  # POST /-1
self.seq += 1                                    # = 0
preconnect_task = asyncio.create_task(self._preconnect_task(self.seq))     # POST /0 in background

Server-side, getForWrite(idx == -1) resolves locally to nextWrite (= 0 for a fresh channel) and creates segment[0]. The publisher's next POST then targets URL /0 — server's getForWrite(0) finds the existing segment[0] and returns it. Two HTTP handlers now both writing to segment[0]. On the events channel the writes are tiny JSON heartbeats and the race is invisible. On the video -out channel the first POST is mid-stream when the second arrives → the server's segment-write semantics race → 5 s later the first writer trips its timeout → seq=0 dropped.

Fix

One line, both symptoms covered:

async def _resolve_next_seq(self) -> int:
    url = f"{self.url}/next"
    try:
        resp = await self._session.get(url)
        latest = resp.headers.get("Lp-Trickle-Latest")
        resp.release()
        if latest is not None:
            return int(latest)
        # Common pre-#3884 — server has no /next endpoint, so we get no header.
        # Treat as fresh channel and start at slot 0; matches pytrickle's approach.
        _LOG.debug("Trickle /next missing Lp-Trickle-Latest header — assuming fresh channel")
    except Exception:
        _LOG.warning("Trickle /next request failed", exc_info=True)
    return 0  # fresh-channel default; the -1 sentinel race causes duplicate POSTs to slot 0

Two changes:

  1. return 0 (was -1) — eliminates the duplicate POST → first video segment delivers cleanly.
  2. debug (was warning) — eliminates the operator-visible noise, matches what PR #6 already does for symptom (1).

Both pre-#3884 (probe fails, fresh-channel default kicks in) and post-#3884 (probe succeeds, returns server-reported nextWrite) work correctly.

Why not just wait for #3884

Option, but unbounded timeline. PR #6 already partially addresses (1) by demoting the warning, but doesn't touch the return -1 line — so even after #6 merges into our branches, symptom (2) persists. The return 0 change is the load-bearing fix.

Caveat — resume semantics

Returning 0 on probe failure means a publisher reconnecting on a non-fresh channel pre-#3884 would overwrite slot 0 instead of resuming from nextWrite. No regression: today's -1 path also breaks resume on master (the sentinel resolves to nextWrite server-side, then the publisher's local counter increments from -1 to 0, posting to /0 and overwriting all existing segments). Resume only works post-#3884 anyway.

Out of scope

  • Removing _resolve_next_seq() entirely — it's correct and useful post-#3884 for resume.
  • Server-side fixes — covered by livepeer/go-livepeer#3884.

Confirmed impact on the data channel

Observable in live_transcribe's SSE-based caller test (examples/runner/live_transcribe/test.sh). Runner emits 5 transcripts; SSE subscriber on the gateway's data_url proxy receives only 3:

runner _LOG (all 5):                        SSE subscriber output (3 of 5):
  transcript[0]: ask not                      transcript[0]: ask not
  transcript[1]: What your country can do…    (missing — `seq=0` race)
  transcript[2]: ask what you can do for…     transcript[2]: ask what you can do for…
  transcript[3]: And so am I fellow Americans transcript[3]: And so am I fellow Americans
  transcript[4]: Ask!                         (missing — separate teardown race)

transcript[1] is dropped by exactly the bug above: the publisher's first POST goes to URL /-1 (server resolves to slot 0), and the second POST goes to URL /0 — server finds the existing segment[0] and races it. transcript[0]'s write wins; transcript[1]'s overwrites or is dropped depending on timing.

This matches the video-channel seq=0 dropped + writer timed out 5.0s we saw before, just with smaller payloads and different visibility. Same root cause, same one-line fix.

(transcript[4] is a separate teardown timing issue between the runner's on_stream_stop flush and the gateway's SSE end signal — not part of this issue.)

Context

Surfaced while testing live_transcribe end-to-end. The noisy warning is in every example pipeline; the duplicate-POST loss is in any pipeline that emits on the video output channel OR the data channel. The data-channel impact is now directly observable via the SSE test in live_transcribe.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions