You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Solid hardening PR. Three real fixes (FakeClock Drop, thread::scope deadlock in compare, more correct value-source preservation), plus much stronger test coverage (compare semantics, chain-split, fuzzing, idle cancellation). I have no blocking concerns. A handful of questions and suggestions below — most are nits or things worth documenting.
Strengths
Drop for FakeClock removal is a real bug fix. Previously, because FakeClock is Clone with shared Arc<Mutex<…>>, Drop::drop ran on every clone going out of scope and indiscriminately cancelled all timers via inner.clients.clear(). Closures or threads that captured a clone would silently kill production timers in tests. Tests now always do explicit clock.cancel() at termination/panic sites, which is the right shape.
compare() now uses thread::spawn + a child CancellationTokenSource instead of thread::scope. The previous thread::scope blocked the entire run loop until the callback returned — so a buggy/blocking Compare could deadlock the algorithm even when the round timer fired. The new design lets the timer arm win and propagates cancellation to the callback via compare_cts. Good parity with Charon's context.WithCancel(ctx) pattern. (Tested directly in compare_timeout_does_not_wait_for_blocked_callback and compare_parent_cancel_cancels_callback_token.)
compare() drain-on-err is more correct than Charon. When compare_err_rx fires with Ok(()), the new code drains any pending compare_value_rx first (crates/core/src/qbft/mod.rs:660-662). Charon's select is non-deterministic and can return the oldinputValueSource when the goroutine wrote both channels before the parent picked. The new fixture compare_success_error_cached_value_source_and_timeout at internal_test.rs:1442-1461 exercises this exact race. Worth a one-line code comment noting this is a deliberate divergence so the next reader doesn't "fix" it back.
panic!("bug: expected only …") → Err(QbftError::UnexpectedCompareError) (mod.rs:515). Returning an error instead of panicking is the right call inside a hot loop running under thread::scope. Good.
New fixtures fill real gaps: idle_run_returns_when_cancelled, the chain-split halt case (zz_chain_split_no_consensus_halt), the fuzzer tests, and the compare_* matrix all match Charon's test intent.
Issues & Questions
1. compare() busy-cancels after ct.is_canceled() (minor)
crates/core/src/qbft/mod.rs:687-691 — once the parent token is observed cancelled, the loop calls compare_cts.cancel() on every 1 ms tick until the callback finally drains. Idempotent but wasteful and a little noisy in profiling. Consider tracking a bool so cancel runs once, and dropping the default-arm to mpmc::never()-equivalent behavior afterwards. Not a correctness issue.
mod.rs:591-595 — the cancellation path simply breaks, then Ok(()). Charon's equivalent is case <-ctx.Done(): return ctx.Err() (charon/core/qbft/qbft.go:428). idle_run_returns_when_cancelled codifies the Pluto behaviour. Is the divergence intentional? Callers downstream might rely on observing the cancellation. If intentional, a one-line comment at the break would help.
3. RUN_CANCELLATION_POLL_INTERVAL is used by both run and compare (naming)
mod.rs:36 — the name says "run" but the constant is the poll cadence in both functions. Either generalize the name (e.g. CANCELLATION_POLL_INTERVAL) or define a separate constant for compare.
4. Definition::compare is Arc<…> while every other callback is Box<dyn Fn …> (API asymmetry)
mod.rs:121 — the asymmetry is justified (compare must clone into a spawned thread), but it's a footgun for users who already constructed a Box. Worth a short doc comment noting "Arc, because the callback is shared with a spawned worker thread."
5. UnexpectedCompareError is opaque
mod.rs:57-58 — it always means "compare returned a ChannelError or some other surprise." Consider UnexpectedCompareError(QbftError) so debugging logs carry the inner error. Optional.
6. Leader function leaves rounds with no leader at round % N == 0
internal_test.rs:600 — (instance + round) % n == process with processes 1..=N means round 4, 8, 12… have leader 0 (no one). For should_halt test this just wastes timeouts, but it's surprising. Charon's IsLeader test helper uses process == round%n with process 0-indexed; Pluto uses 1-indexed processes. If you want exact parity, switch to (instance + round - 1) % n + 1.
7. CodeQL alerts on 0x5142… and 0x4348… seeds
internal_test.rs:16, 586 — the latest commit ("removed hard coded salt in tests") suggests these were addressed, but both constants are still present in the file. CodeQL appears to be flagging the literal pattern. Two options: (a) mark the alerts as false-positives in the GitHub UI (they're test-only PRNG seeds, not crypto material), or (b) derive them at runtime (e.g. via std::process::id() or env var with deterministic override) so the literal pattern goes away. (a) is fine if the team accepts.
8. Spawned compare callback can leak the thread on a misbehaving callback
mod.rs:638-647 — if d.compare ignores the token and never writes to compare_err/compare_value, the spawned thread outlives compare(). Matches Charon's "leaked goroutine" semantics, so this is parity, but worth a doc note since thread::spawn doesn't carry the same intuition as a goroutine.
9. Test infra: result_chan_tx bounded N then re-sent N times
internal_test.rs:91 — capacity is N (4). With consensus, exactly 4 sends happen, so it fits. If a future change adds extra decide callbacks (e.g. retry/duplicate), this becomes a silent deadlock waiting for the receiver. Consider unbounded since this is a test-only channel.
fake_clock.rs:126, 153 — ordering between the two sender threads isn't guaranteed; sorting before compare would be safer if the assertion ever expanded to non-uniform values. For true, true it's fine today.
Verdict
LGTM after addressing the cancellation-return-value parity question (#2) and the CodeQL alerts (#7). Everything else is optional polish. Nice incremental hardening of QBFT and a meaningful test surface improvement.
· Branch: iamquang95/qbft
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR