Skip to content

fix: QBFT#417

Open
iamquang95 wants to merge 10 commits into
mainfrom
iamquang95/qbft
Open

fix: QBFT#417
iamquang95 wants to merge 10 commits into
mainfrom
iamquang95/qbft

Conversation

@iamquang95
Copy link
Copy Markdown
Collaborator

This PR

  • Hardens the QBFT tests. Makes the test parity with charon: Previously, some params are misconfigured.
  • Add fuzz tests for QBFT matches with charon.
  • Add some more unit tests which helps the fix QBFT stricter.
  • Add more chain split test.

@iamquang95 iamquang95 changed the title fix: qbft fix: QBFT May 15, 2026
Comment thread crates/core/src/qbft/internal_test.rs Fixed
Comment thread crates/core/src/qbft/internal_test.rs Fixed
Comment thread crates/core/src/qbft/internal_test.rs Fixed
Comment thread crates/core/src/qbft/internal_test.rs Fixed
Comment thread crates/core/src/qbft/internal_test.rs Fixed
Comment thread crates/core/src/qbft/internal_test.rs Fixed
Comment thread crates/core/src/qbft/internal_test.rs Fixed
Comment thread crates/core/src/qbft/internal_test.rs Fixed
Comment thread crates/core/src/qbft/internal_test.rs Fixed
@iamquang95
Copy link
Copy Markdown
Collaborator Author

@claude review the changes

@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Claude finished @iamquang95's task in 7m 6s —— View job


Claude PR Review — fix: QBFT

  • Read PR diff and understand the scope of changes
  • Review crates/core/src/qbft/mod.rs (core QBFT logic changes)
  • Review crates/core/src/qbft/fake_clock.rs (test clock changes)
  • Review crates/core/src/qbft/internal_test.rs (tests/fuzz)
  • Cross-check with Charon Go source for parity
  • Post review feedback

Summary

Solid hardening PR. Three real fixes (FakeClock Drop, thread::scope deadlock in compare, more correct value-source preservation), plus much stronger test coverage (compare semantics, chain-split, fuzzing, idle cancellation). I have no blocking concerns. A handful of questions and suggestions below — most are nits or things worth documenting.


Strengths

  1. Drop for FakeClock removal is a real bug fix. Previously, because FakeClock is Clone with shared Arc<Mutex<…>>, Drop::drop ran on every clone going out of scope and indiscriminately cancelled all timers via inner.clients.clear(). Closures or threads that captured a clone would silently kill production timers in tests. Tests now always do explicit clock.cancel() at termination/panic sites, which is the right shape.

  2. compare() now uses thread::spawn + a child CancellationTokenSource instead of thread::scope. The previous thread::scope blocked the entire run loop until the callback returned — so a buggy/blocking Compare could deadlock the algorithm even when the round timer fired. The new design lets the timer arm win and propagates cancellation to the callback via compare_cts. Good parity with Charon's context.WithCancel(ctx) pattern. (Tested directly in compare_timeout_does_not_wait_for_blocked_callback and compare_parent_cancel_cancels_callback_token.)

  3. compare() drain-on-err is more correct than Charon. When compare_err_rx fires with Ok(()), the new code drains any pending compare_value_rx first (crates/core/src/qbft/mod.rs:660-662). Charon's select is non-deterministic and can return the old inputValueSource when the goroutine wrote both channels before the parent picked. The new fixture compare_success_error_cached_value_source_and_timeout at internal_test.rs:1442-1461 exercises this exact race. Worth a one-line code comment noting this is a deliberate divergence so the next reader doesn't "fix" it back.

  4. panic!("bug: expected only …")Err(QbftError::UnexpectedCompareError) (mod.rs:515). Returning an error instead of panicking is the right call inside a hot loop running under thread::scope. Good.

  5. New fixtures fill real gaps: idle_run_returns_when_cancelled, the chain-split halt case (zz_chain_split_no_consensus_halt), the fuzzer tests, and the compare_* matrix all match Charon's test intent.


Issues & Questions

1. compare() busy-cancels after ct.is_canceled() (minor)

crates/core/src/qbft/mod.rs:687-691 — once the parent token is observed cancelled, the loop calls compare_cts.cancel() on every 1 ms tick until the callback finally drains. Idempotent but wasteful and a little noisy in profiling. Consider tracking a bool so cancel runs once, and dropping the default-arm to mpmc::never()-equivalent behavior afterwards. Not a correctness issue.

Fix this →

2. qbft::run returns Ok(()) on cancellation, Charon returns ctx.Err() (parity question)

mod.rs:591-595 — the cancellation path simply breaks, then Ok(()). Charon's equivalent is case <-ctx.Done(): return ctx.Err() (charon/core/qbft/qbft.go:428). idle_run_returns_when_cancelled codifies the Pluto behaviour. Is the divergence intentional? Callers downstream might rely on observing the cancellation. If intentional, a one-line comment at the break would help.

3. RUN_CANCELLATION_POLL_INTERVAL is used by both run and compare (naming)

mod.rs:36 — the name says "run" but the constant is the poll cadence in both functions. Either generalize the name (e.g. CANCELLATION_POLL_INTERVAL) or define a separate constant for compare.

4. Definition::compare is Arc<…> while every other callback is Box<dyn Fn …> (API asymmetry)

mod.rs:121 — the asymmetry is justified (compare must clone into a spawned thread), but it's a footgun for users who already constructed a Box. Worth a short doc comment noting "Arc, because the callback is shared with a spawned worker thread."

5. UnexpectedCompareError is opaque

mod.rs:57-58 — it always means "compare returned a ChannelError or some other surprise." Consider UnexpectedCompareError(QbftError) so debugging logs carry the inner error. Optional.

6. Leader function leaves rounds with no leader at round % N == 0

internal_test.rs:600(instance + round) % n == process with processes 1..=N means round 4, 8, 12… have leader 0 (no one). For should_halt test this just wastes timeouts, but it's surprising. Charon's IsLeader test helper uses process == round%n with process 0-indexed; Pluto uses 1-indexed processes. If you want exact parity, switch to (instance + round - 1) % n + 1.

7. CodeQL alerts on 0x5142… and 0x4348… seeds

internal_test.rs:16, 586 — the latest commit ("removed hard coded salt in tests") suggests these were addressed, but both constants are still present in the file. CodeQL appears to be flagging the literal pattern. Two options: (a) mark the alerts as false-positives in the GitHub UI (they're test-only PRNG seeds, not crypto material), or (b) derive them at runtime (e.g. via std::process::id() or env var with deterministic override) so the literal pattern goes away. (a) is fine if the team accepts.

8. Spawned compare callback can leak the thread on a misbehaving callback

mod.rs:638-647 — if d.compare ignores the token and never writes to compare_err/compare_value, the spawned thread outlives compare(). Matches Charon's "leaked goroutine" semantics, so this is parity, but worth a doc note since thread::spawn doesn't carry the same intuition as a goroutine.

9. Test infra: result_chan_tx bounded N then re-sent N times

internal_test.rs:91 — capacity is N (4). With consensus, exactly 4 sends happen, so it fits. If a future change adds extra decide callbacks (e.g. retry/duplicate), this becomes a silent deadlock waiting for the receiver. Consider unbounded since this is a test-only channel.

10. Nit: assert_eq!(vec![true, true], done_rx.try_iter().collect::<Vec<_>>())

fake_clock.rs:126, 153 — ordering between the two sender threads isn't guaranteed; sorting before compare would be safer if the assertion ever expanded to non-uniform values. For true, true it's fine today.


Verdict

LGTM after addressing the cancellation-return-value parity question (#2) and the CodeQL alerts (#7). Everything else is optional polish. Nice incremental hardening of QBFT and a meaningful test surface improvement.
· Branch: iamquang95/qbft

@iamquang95 iamquang95 marked this pull request as ready for review May 18, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants