Skip to content

Fixed silent loss of reliable data on send failure and resume replay#921

Draft
adrian-niculescu wants to merge 3 commits intolivekit:mainfrom
adrian-niculescu:fix-senddata-ignored-result
Draft

Fixed silent loss of reliable data on send failure and resume replay#921
adrian-niculescu wants to merge 3 commits intolivekit:mainfrom
adrian-niculescu:fix-senddata-ignored-result

Conversation

@adrian-niculescu
Copy link
Copy Markdown
Contributor

@adrian-niculescu adrian-niculescu commented Apr 30, 2026

Summary

We rely on reliable LocalParticipant.publishData(...) for application-level signaling where every message must land, which is what drew my attention here.

DataChannel.send(Buffer) returns a Boolean that is false when the SCTP send queue is full, the channel is not OPEN, or the native layer rejects the buffer. RTCEngine.sendData discarded that result and always returned Result.success(Unit), so callers using LocalParticipant.publishData(...).getOrThrow() to drive retry logic could not see locally rejected packets and would treat them as delivered.

While there I also looked at the resume replay path. The Android WebRTC wrapper drains a Buffer via ByteBuffer.get(byte[]), which advances the buffer's position to its limit. reliableMessageBuffer was storing each item as a ByteBuffer, and resendReliableMessagesForResume passed the same ByteBuffer instance into DataChannel.send on every replay. If a buffered item survived more than one resume attempt without server progress on lastMessageSeq (back-to-back reconnects), the second and subsequent replays sent an empty buffer and DataChannel.send returned true, so no caller could detect the loss.

resendReliableMessagesForResume also discarded DataChannel.send's boolean return value during replay, and the reconnect caller discarded the function's Result. An SCTP-rejected replay would therefore be silently dropped while the reconnect continued to client.onPCConnected(), marking the connection resumed.

Fix

  • Check the return of channel.send(buf) and surface Result.failure(RoomException.ConnectException(...)) on false.

  • For RELIABLE packets, defer reliableMessageBuffer queueing and reliableDataSequence advancement until after a successful send. A failed send no longer burns a sequence number or leaves a stale entry that would replay on resume; a caller retry reuses the same sequence.

  • The RECONNECTING fast-path (queue and return success without sending) is unchanged.

  • Change DataPacketItem.data from ByteBuffer to ByteArray (cursorless), and have resendReliableMessagesForResume wrap a fresh ByteBuffer per send. ByteArray matches the intent — these are stored payload bytes, not active stream cursors.

  • In resendReliableMessagesForResume, check channel.send(...)'s boolean per replay and return Result.failure(RoomException.ConnectException(...)) on the first rejected item. The reconnect caller now logs the failure via LKLog.w instead of discarding the Result. Buffered items remain queued for the next legitimate soft reconnect.

    Whether the reconnect path should also react to a failed replay (e.g., await bufferedAmount low and retry in place, escalate to onEngineDisconnected, or have the full-reconnect path attempt reliable replay too) is a policy/behavior decision worth its own discussion. Note: simply continue-ing the reconnect loop on failure is not equivalent to "retry the resume" — lastMessageSeq is only set in the soft-reconnect branch (line 614), so the next iteration would upgrade to a full reconnect (line 574, retries != 0) and silently skip the replay (line 673), leaving buffered items orphaned. Open to direction from maintainers on what they'd prefer here; happy to do it in a follow-up.

  • Add @VisibleForTesting to resendReliableMessagesForResume so the regression tests can drive it directly.

Test plan

  • publishDataReturnsFailureWhenDataChannelSendFails in LocalParticipantMockE2ETest asserts:
    • failed send() produces Result.failure with RoomException.ConnectException
    • on retry after recovery, sentBuffers.size == 2 and the retried packet's sequence is 1 (proving the failed attempt did not consume a sequence).
  • resendReliableMessagesReplaysFullPayloadAcrossMultipleResumes in RTCEngineMockE2ETest publishes a reliable message and calls resendReliableMessagesForResume(0) twice with consumeSentBuffer = true (mirrors WebRTC's drain). It parses both replayed payloads back into DataPacket and asserts the original bytes survive repeated replay.
  • resendReliableMessagesReturnsFailureWhenReplaySendFails in RTCEngineMockE2ETest flips MockDataChannel.sendResult = false after a publish and asserts resendReliableMessagesForResume(0) returns Result.failure with RoomException.ConnectException.
  • MockDataChannel gained a sentPayloads list (snapshot of bytes visible at send time) and a consumeSentBuffer flag.
  • DataPacketBufferTest updated for the ByteArray storage type; all 6 cases pass.
  • ./gradlew livekit-android-test:testDebugUnitTest --tests "io.livekit.android.room.RTCEngineMockE2ETest" --tests "io.livekit.android.room.participant.LocalParticipantMockE2ETest" --tests "io.livekit.android.room.RoomDataMockE2ETest" --tests "io.livekit.android.room.RoomReconnectionMockE2ETest" --rerun-tasks passes.
  • ./gradlew spotlessApply clean.

DataChannel.send returns a boolean that is false when the SCTP send
queue is full, the channel is not OPEN, or the native layer rejects
the buffer. RTCEngine.sendData discarded that result and always
returned Result.success(Unit), so callers using
publishData(...).getOrThrow() to drive retry logic could never see
locally rejected packets and would treat them as delivered.

Now the boolean is checked and surfaced as
Result.failure(RoomException.ConnectException). Additionally, for
RELIABLE packets the reliableMessageBuffer queueing and
reliableDataSequence advancement are deferred until after a
successful send, so a failed send no longer burns a sequence number
or leaves a stale entry that would replay on resume; a caller retry
reuses the same sequence. The RECONNECTING fast path still queues
without sending.

Added a regression test in LocalParticipantMockE2ETest covering both
the failure surface and the sequence-reuse-on-retry property, plus a
sendResult hook on MockDataChannel.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 30, 2026

🦋 Changeset detected

Latest commit: 95910b7

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
client-sdk-android Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

…umes

DataPacketItem stored its payload as a ByteBuffer, and
resendReliableMessagesForResume passed the same ByteBuffer instance
into DataChannel.send across replays. The Android WebRTC wrapper
drains the buffer via ByteBuffer.get(byte[]), which advances the
position to the limit. If a buffered item survived more than one
resume attempt without server progress on lastMessageSeq (e.g.,
back-to-back reconnects), the second and subsequent replays sent an
empty buffer and DataChannel.send returned true, so no caller could
detect the loss.

Changed DataPacketItem.data to ByteArray (cursorless) and made
resendReliableMessagesForResume wrap a fresh ByteBuffer per send.
The construction sites in sendData and the DataPacketBufferTest were
updated accordingly. resendReliableMessagesForResume gained a
@VisibleForTesting annotation so the regression test can call it
directly.

Added a regression test in RTCEngineMockE2ETest that publishes a
reliable message, calls resendReliableMessagesForResume twice with
the same lastMessageSeq, and parses both replayed payloads to assert
the original bytes survive repeated replay. MockDataChannel gained a
sentPayloads list (snapshot of bytes visible at send time) and a
consumeSentBuffer flag that mirrors the real wrapper.
@adrian-niculescu adrian-niculescu changed the title Fixed RTCEngine.sendData ignoring DataChannel.send return value Fixed silent loss of reliable data on send failure and resume replay Apr 30, 2026
resendReliableMessagesForResume looped over the buffered items and
called channel.send without inspecting its boolean. If SCTP rejected
a buffered item (queue full, channel state-flap, native rejection),
the replay would silently skip that packet and the function would
still return Result.success. The reconnect caller was also discarding
the returned Result, so the connection would be marked resumed even
when buffered messages had not actually been replayed.

Now the loop returns Result.failure(RoomException.ConnectException)
on the first rejected replay, and the reconnect caller logs the
failure with LKLog.w. The buffered items remain queued for the next
resume rather than being silently lost. Whether a failed replay
should also retry the reconnect attempt or trigger a full reconnect
is left as a separate policy decision.

Added a regression test in RTCEngineMockE2ETest that publishes a
reliable message, flips MockDataChannel.sendResult to false, and
asserts resendReliableMessagesForResume returns a ConnectException
failure.
@adrian-niculescu adrian-niculescu marked this pull request as draft April 30, 2026 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant