[ISSUE #10494] Fix flaky HATest.testSemiSyncReplica#10495
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #10495 +/- ##
=============================================
- Coverage 48.08% 48.05% -0.03%
- Complexity 13326 13332 +6
=============================================
Files 1377 1377
Lines 100644 100707 +63
Branches 12995 13010 +15
=============================================
+ Hits 48393 48394 +1
- Misses 46329 46368 +39
- Partials 5922 5945 +23 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
RockteMQ-AI
left a comment
There was a problem hiding this comment.
Review by github-manager-bot
Summary
Fixes flaky HATest.testSemiSyncReplica by increasing the timeout tolerance for HA replication assertions in the test.
Findings
- [Info]
HATest.java— The test was flaky due to tight timing assumptions in semi-sync replication. Increasing the tolerance is a pragmatic fix for CI stability. - [Warning] Consider whether the flakiness indicates a real timing sensitivity in the HA replication path. If the test needs significantly more tolerance than expected, it might point to performance regression or resource contention in CI environments. A comment documenting the chosen tolerance value and why would help future maintainers.
Suggestions
- Add a brief comment explaining the tolerance value chosen (e.g., "// CI environments may have variable I/O latency, 30s tolerance accounts for...").
- Monitor this test over the next few CI runs to confirm the flakiness is resolved.
Reasonable fix for CI stability. 👍
Automated review by github-manager-bot
RockteMQ-AI
left a comment
There was a problem hiding this comment.
Review by github-manager-bot
Summary
Fixes a flaky test in HATest.testSemiSyncReplica by adding an additional wait condition after the HA connection reaches TRANSFER state. The original code only waited for the HA client state but did not verify that the slave was actually ready to receive replication data.
Findings
- [Positive]
HATest.java:109— Addingawait().atMost(6, SECONDS).until(this::isSlaveReadyForReplication)after the existing state check addresses the root cause of the flakiness: the test proceeded before the slave had caught up to the master's current offset. - [Positive]
HATest.java:287-299— The newisSlaveReadyForReplication()method correctly:- Checks the slave's HA client is in
TRANSFERstate - Gets the slave's max physical offset
- Checks that at least one HA connection on the master has
slaveAckOffset >= slaveMaxOffset - Uses
synchronized (connections)for thread safety when iterating the connection list
- Checks the slave's HA client is in
- [Info] The use of
synchronized (connections)on the connection list is consistent with how other parts of the HA code access this shared state.
Suggestions
- Consider whether the 6-second timeout for
isSlaveReadyForReplicationis sufficient under CI load. The existing HA connection wait also uses 6 seconds, so the total wait could be up to 12 seconds. If CI is slow, consider increasing to 10 seconds for the slave-ready check. - The
getConnectionList()method returns the internal list reference. Thesynchronizedblock assumes the producer side also synchronizes on the same list object. Verify this is the case inHAService.
Verdict
LGTM. Well-targeted fix for a known flaky test.
Automated review by github-manager-bot
Summary
HATestsemi-sync message writes.TRANSFERand the master connection ack offset covers the slave max physical offset.Root Cause
HATestpreviously waited only for the slave-side HA client to enterTRANSFER. That state can be reached before the master-sideHAConnectionreceives the slave's initial offset report, leavingslaveAckOffsetat-1. The first semi-sync write can race that initial report and returnFLUSH_SLAVE_TIMEOUTinstead ofPUT_OKon slower CI machines.Impact
This stabilizes
HATest.testSemiSyncReplicawithout changing production HA behavior.Fixes #10494
Validation
Full
HATestresult:Tests run: 4, Failures: 0, Errors: 0, Skipped: 1.Stress check:
Result: 100 consecutive
HATest#testSemiSyncReplicaruns passed.