Skip to content

TricklePublisher.SegmentWriter masks CancelledError as TrickleSegmentWriteError on shutdown #10

@rickstaa

Description

@rickstaa

Symptom

Normal session shutdown produces tracebacks that look like errors but happen on every clean teardown:

livepeer_gateway.trickle_publisher.TrickleSegmentWriteError:
    Trickle segment writer timed out after 5.0s
Trickle segment close suppressed seq=0
...
TimeoutError

Root cause (revised after investigation)

Not an asyncio.wait_for cancellation race (initial hypothesis was wrong). Verified via timestamped logs after migrating to asyncio.timeout — the timeouts still fire because they're legitimate:

  1. Mid-stream, the HTTP post-body consumer for the active segment becomes wedged or closed (the orchestrator-side reader stops reading, or the connection drops).
  2. SegmentWriter's internal queue fills up.
  3. The next await self.queue.put(data) in _stream_pipe_to_trickle blocks waiting for queue space.
  4. After 5s, the timeout fires → TrickleSegmentWriteErrorMediaPublish logs "dropped segment ... mid-stream" and enters drain mode.
  5. On /stream/stop, SegmentWriter.close() calls queue.put(None) to signal end-of-stream — but the consumer is gone, so it hangs.
  6. After 5s the timeout fires → "Trickle segment close suppressed" warning logged.

So during shutdown, every active segment with a wedged consumer adds 5s + a misleading-looking traceback before teardown completes. PR #7 / live_grayscale/test.sh reproduces it consistently.

Proposed fix (needs work)

SegmentWriter needs to know when its associated HTTP post-task is dead and short-circuit close() / write() rather than block on queue.put forever. Options:

  • Track the post-task on the segment; on close() check post_task.done() before queue.put(None).
  • Use put_nowait() and absorb QueueFull.
  • Reduce the close-path timeout to e.g. 0.5s (still misleading but faster).

What PR #11 does not fix

PR #11 migrates wait_forasyncio.timeout which is modernization, not a fix for this issue. The shutdown noise persists after that migration; investigation here is separate.

Reproduction

cd examples/runner/live_grayscale
docker compose up -d --wait --build
SKIP_VIEWER=1 bash test.sh        # passes
docker logs live_grayscale        # shows the tracebacks during shutdown

Severity

Cosmetic — teardown completes correctly, but the noise causes false alarms during debugging and adds 5s × N-active-segments to shutdown latency.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions