Fixed silent loss of reliable data on send failure and resume replay#921
Draft
adrian-niculescu wants to merge 3 commits intolivekit:mainfrom
Draft
Fixed silent loss of reliable data on send failure and resume replay#921adrian-niculescu wants to merge 3 commits intolivekit:mainfrom
adrian-niculescu wants to merge 3 commits intolivekit:mainfrom
Conversation
DataChannel.send returns a boolean that is false when the SCTP send queue is full, the channel is not OPEN, or the native layer rejects the buffer. RTCEngine.sendData discarded that result and always returned Result.success(Unit), so callers using publishData(...).getOrThrow() to drive retry logic could never see locally rejected packets and would treat them as delivered. Now the boolean is checked and surfaced as Result.failure(RoomException.ConnectException). Additionally, for RELIABLE packets the reliableMessageBuffer queueing and reliableDataSequence advancement are deferred until after a successful send, so a failed send no longer burns a sequence number or leaves a stale entry that would replay on resume; a caller retry reuses the same sequence. The RECONNECTING fast path still queues without sending. Added a regression test in LocalParticipantMockE2ETest covering both the failure surface and the sequence-reuse-on-retry property, plus a sendResult hook on MockDataChannel.
🦋 Changeset detectedLatest commit: 95910b7 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
…umes DataPacketItem stored its payload as a ByteBuffer, and resendReliableMessagesForResume passed the same ByteBuffer instance into DataChannel.send across replays. The Android WebRTC wrapper drains the buffer via ByteBuffer.get(byte[]), which advances the position to the limit. If a buffered item survived more than one resume attempt without server progress on lastMessageSeq (e.g., back-to-back reconnects), the second and subsequent replays sent an empty buffer and DataChannel.send returned true, so no caller could detect the loss. Changed DataPacketItem.data to ByteArray (cursorless) and made resendReliableMessagesForResume wrap a fresh ByteBuffer per send. The construction sites in sendData and the DataPacketBufferTest were updated accordingly. resendReliableMessagesForResume gained a @VisibleForTesting annotation so the regression test can call it directly. Added a regression test in RTCEngineMockE2ETest that publishes a reliable message, calls resendReliableMessagesForResume twice with the same lastMessageSeq, and parses both replayed payloads to assert the original bytes survive repeated replay. MockDataChannel gained a sentPayloads list (snapshot of bytes visible at send time) and a consumeSentBuffer flag that mirrors the real wrapper.
resendReliableMessagesForResume looped over the buffered items and called channel.send without inspecting its boolean. If SCTP rejected a buffered item (queue full, channel state-flap, native rejection), the replay would silently skip that packet and the function would still return Result.success. The reconnect caller was also discarding the returned Result, so the connection would be marked resumed even when buffered messages had not actually been replayed. Now the loop returns Result.failure(RoomException.ConnectException) on the first rejected replay, and the reconnect caller logs the failure with LKLog.w. The buffered items remain queued for the next resume rather than being silently lost. Whether a failed replay should also retry the reconnect attempt or trigger a full reconnect is left as a separate policy decision. Added a regression test in RTCEngineMockE2ETest that publishes a reliable message, flips MockDataChannel.sendResult to false, and asserts resendReliableMessagesForResume returns a ConnectException failure.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
We rely on reliable
LocalParticipant.publishData(...)for application-level signaling where every message must land, which is what drew my attention here.DataChannel.send(Buffer)returns aBooleanthat isfalsewhen the SCTP send queue is full, the channel is notOPEN, or the native layer rejects the buffer.RTCEngine.sendDatadiscarded that result and always returnedResult.success(Unit), so callers usingLocalParticipant.publishData(...).getOrThrow()to drive retry logic could not see locally rejected packets and would treat them as delivered.While there I also looked at the resume replay path. The Android WebRTC wrapper drains a
BufferviaByteBuffer.get(byte[]), which advances the buffer's position to its limit.reliableMessageBufferwas storing each item as aByteBuffer, andresendReliableMessagesForResumepassed the sameByteBufferinstance intoDataChannel.sendon every replay. If a buffered item survived more than one resume attempt without server progress onlastMessageSeq(back-to-back reconnects), the second and subsequent replays sent an empty buffer andDataChannel.sendreturnedtrue, so no caller could detect the loss.resendReliableMessagesForResumealso discardedDataChannel.send's boolean return value during replay, and the reconnect caller discarded the function'sResult. An SCTP-rejected replay would therefore be silently dropped while the reconnect continued toclient.onPCConnected(), marking the connection resumed.Fix
Check the return of
channel.send(buf)and surfaceResult.failure(RoomException.ConnectException(...))onfalse.For RELIABLE packets, defer
reliableMessageBufferqueueing andreliableDataSequenceadvancement until after a successful send. A failed send no longer burns a sequence number or leaves a stale entry that would replay on resume; a caller retry reuses the same sequence.The
RECONNECTINGfast-path (queue and return success without sending) is unchanged.Change
DataPacketItem.datafromByteBuffertoByteArray(cursorless), and haveresendReliableMessagesForResumewrap a freshByteBufferper send.ByteArraymatches the intent — these are stored payload bytes, not active stream cursors.In
resendReliableMessagesForResume, checkchannel.send(...)'s boolean per replay and returnResult.failure(RoomException.ConnectException(...))on the first rejected item. The reconnect caller now logs the failure viaLKLog.winstead of discarding theResult. Buffered items remain queued for the next legitimate soft reconnect.Whether the reconnect path should also react to a failed replay (e.g., await
bufferedAmountlow and retry in place, escalate toonEngineDisconnected, or have the full-reconnect path attempt reliable replay too) is a policy/behavior decision worth its own discussion. Note: simplycontinue-ing the reconnect loop on failure is not equivalent to "retry the resume" —lastMessageSeqis only set in the soft-reconnect branch (line 614), so the next iteration would upgrade to a full reconnect (line 574,retries != 0) and silently skip the replay (line 673), leaving buffered items orphaned. Open to direction from maintainers on what they'd prefer here; happy to do it in a follow-up.Add
@VisibleForTestingtoresendReliableMessagesForResumeso the regression tests can drive it directly.Test plan
publishDataReturnsFailureWhenDataChannelSendFailsinLocalParticipantMockE2ETestasserts:send()producesResult.failurewithRoomException.ConnectExceptionsentBuffers.size == 2and the retried packet's sequence is1(proving the failed attempt did not consume a sequence).resendReliableMessagesReplaysFullPayloadAcrossMultipleResumesinRTCEngineMockE2ETestpublishes a reliable message and callsresendReliableMessagesForResume(0)twice withconsumeSentBuffer = true(mirrors WebRTC's drain). It parses both replayed payloads back intoDataPacketand asserts the original bytes survive repeated replay.resendReliableMessagesReturnsFailureWhenReplaySendFailsinRTCEngineMockE2ETestflipsMockDataChannel.sendResult = falseafter a publish and assertsresendReliableMessagesForResume(0)returnsResult.failurewithRoomException.ConnectException.MockDataChannelgained asentPayloadslist (snapshot of bytes visible at send time) and aconsumeSentBufferflag.DataPacketBufferTestupdated for theByteArraystorage type; all 6 cases pass../gradlew livekit-android-test:testDebugUnitTest --tests "io.livekit.android.room.RTCEngineMockE2ETest" --tests "io.livekit.android.room.participant.LocalParticipantMockE2ETest" --tests "io.livekit.android.room.RoomDataMockE2ETest" --tests "io.livekit.android.room.RoomReconnectionMockE2ETest" --rerun-taskspasses../gradlew spotlessApplyclean.