fix: QBFT by iamquang95 · Pull Request #417 · NethermindEth/pluto

iamquang95 · 2026-05-15T05:59:50Z

This PR

Hardens the QBFT tests. Makes the test parity with charon: Previously, some params are misconfigured.
Add fuzz tests for QBFT matches with charon.
Add some more unit tests which helps the fix QBFT stricter.
Add more chain split test.

iamquang95 · 2026-05-18T03:43:48Z

@claude review the changes

claude · 2026-05-18T03:44:02Z

Claude finished @iamquang95's task in 7m 6s —— View job

Claude PR Review — fix: QBFT

Read PR diff and understand the scope of changes
Review crates/core/src/qbft/mod.rs (core QBFT logic changes)
Review crates/core/src/qbft/fake_clock.rs (test clock changes)
Review crates/core/src/qbft/internal_test.rs (tests/fuzz)
Cross-check with Charon Go source for parity
Post review feedback

Summary

Solid hardening PR. Three real fixes (FakeClock Drop, thread::scope deadlock in compare, more correct value-source preservation), plus much stronger test coverage (compare semantics, chain-split, fuzzing, idle cancellation). I have no blocking concerns. A handful of questions and suggestions below — most are nits or things worth documenting.

Strengths

Drop for FakeClock removal is a real bug fix. Previously, because FakeClock is Clone with shared Arc<Mutex<…>>, Drop::drop ran on every clone going out of scope and indiscriminately cancelled all timers via inner.clients.clear(). Closures or threads that captured a clone would silently kill production timers in tests. Tests now always do explicit clock.cancel() at termination/panic sites, which is the right shape.
compare() now uses thread::spawn + a child CancellationTokenSource instead of thread::scope. The previous thread::scope blocked the entire run loop until the callback returned — so a buggy/blocking Compare could deadlock the algorithm even when the round timer fired. The new design lets the timer arm win and propagates cancellation to the callback via compare_cts. Good parity with Charon's context.WithCancel(ctx) pattern. (Tested directly in compare_timeout_does_not_wait_for_blocked_callback and compare_parent_cancel_cancels_callback_token.)
compare() drain-on-err is more correct than Charon. When compare_err_rx fires with Ok(()), the new code drains any pending compare_value_rx first (crates/core/src/qbft/mod.rs:660-662). Charon's select is non-deterministic and can return the old inputValueSource when the goroutine wrote both channels before the parent picked. The new fixture compare_success_error_cached_value_source_and_timeout at internal_test.rs:1442-1461 exercises this exact race. Worth a one-line code comment noting this is a deliberate divergence so the next reader doesn't "fix" it back.
panic!("bug: expected only …") → Err(QbftError::UnexpectedCompareError) (mod.rs:515). Returning an error instead of panicking is the right call inside a hot loop running under thread::scope. Good.
New fixtures fill real gaps: idle_run_returns_when_cancelled, the chain-split halt case (zz_chain_split_no_consensus_halt), the fuzzer tests, and the compare_* matrix all match Charon's test intent.

Issues & Questions

1. `compare()` busy-cancels after `ct.is_canceled()` (minor)

crates/core/src/qbft/mod.rs:687-691 — once the parent token is observed cancelled, the loop calls compare_cts.cancel() on every 1 ms tick until the callback finally drains. Idempotent but wasteful and a little noisy in profiling. Consider tracking a bool so cancel runs once, and dropping the default-arm to mpmc::never()-equivalent behavior afterwards. Not a correctness issue.

Fix this →

2. `qbft::run` returns `Ok(())` on cancellation, Charon returns `ctx.Err()` (parity question)

mod.rs:591-595 — the cancellation path simply breaks, then Ok(()). Charon's equivalent is case <-ctx.Done(): return ctx.Err() (charon/core/qbft/qbft.go:428). idle_run_returns_when_cancelled codifies the Pluto behaviour. Is the divergence intentional? Callers downstream might rely on observing the cancellation. If intentional, a one-line comment at the break would help.

3. `RUN_CANCELLATION_POLL_INTERVAL` is used by both `run` and `compare` (naming)

mod.rs:36 — the name says "run" but the constant is the poll cadence in both functions. Either generalize the name (e.g. CANCELLATION_POLL_INTERVAL) or define a separate constant for compare.

4. `Definition::compare` is `Arc<…>` while every other callback is `Box<dyn Fn …>` (API asymmetry)

mod.rs:121 — the asymmetry is justified (compare must clone into a spawned thread), but it's a footgun for users who already constructed a Box. Worth a short doc comment noting "Arc, because the callback is shared with a spawned worker thread."

5. `UnexpectedCompareError` is opaque

mod.rs:57-58 — it always means "compare returned a ChannelError or some other surprise." Consider UnexpectedCompareError(QbftError) so debugging logs carry the inner error. Optional.

6. Leader function leaves rounds with no leader at `round % N == 0`

internal_test.rs:600 — (instance + round) % n == process with processes 1..=N means round 4, 8, 12… have leader 0 (no one). For should_halt test this just wastes timeouts, but it's surprising. Charon's IsLeader test helper uses process == round%n with process 0-indexed; Pluto uses 1-indexed processes. If you want exact parity, switch to (instance + round - 1) % n + 1.

7. CodeQL alerts on `0x5142…` and `0x4348…` seeds

internal_test.rs:16, 586 — the latest commit ("removed hard coded salt in tests") suggests these were addressed, but both constants are still present in the file. CodeQL appears to be flagging the literal pattern. Two options: (a) mark the alerts as false-positives in the GitHub UI (they're test-only PRNG seeds, not crypto material), or (b) derive them at runtime (e.g. via std::process::id() or env var with deterministic override) so the literal pattern goes away. (a) is fine if the team accepts.

8. Spawned `compare` callback can leak the thread on a misbehaving callback

mod.rs:638-647 — if d.compare ignores the token and never writes to compare_err/compare_value, the spawned thread outlives compare(). Matches Charon's "leaked goroutine" semantics, so this is parity, but worth a doc note since thread::spawn doesn't carry the same intuition as a goroutine.

9. Test infra: `result_chan_tx` bounded N then re-sent N times

internal_test.rs:91 — capacity is N (4). With consensus, exactly 4 sends happen, so it fits. If a future change adds extra decide callbacks (e.g. retry/duplicate), this becomes a silent deadlock waiting for the receiver. Consider unbounded since this is a test-only channel.

10. Nit: `assert_eq!(vec![true, true], done_rx.try_iter().collect::<Vec<_>>())`

fake_clock.rs:126, 153 — ordering between the two sender threads isn't guaranteed; sorting before compare would be safer if the assertion ever expanded to non-uniform values. For true, true it's fine today.

Verdict

LGTM after addressing the cancellation-return-value parity question (#2) and the CodeQL alerts (#7). Everything else is optional polish. Nice incremental hardening of QBFT and a meaningful test surface improvement.
· Branch: iamquang95/qbft

test: update qbft test

606d5bc

iamquang95 changed the title ~~fix: qbft~~ fix: QBFT May 15, 2026

github-advanced-security AI found potential problems May 15, 2026

View reviewed changes

iamquang95 added 2 commits May 18, 2026 10:22

fix: compare run on retached thread

0ec38de

fix: removed hard coded salt in tests

b752caa

iamquang95 added 7 commits May 18, 2026 12:52

fix: return error on Context cancelled

ed912ee

fix: hash from string, not magic number

e01f64f

fix: one shot cancel when parent is cancelled

b22fb41

fix: fix make_is_leader test

081d8a4

fix: minors naming and comments

1a0a337

Merge remote-tracking branch 'origin/main' into iamquang95/qbft

183f48e

fix: linter

44b9a13

iamquang95 marked this pull request as ready for review May 18, 2026 07:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: QBFT#417

fix: QBFT#417
iamquang95 wants to merge 10 commits into
mainfrom
iamquang95/qbft

iamquang95 commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamquang95 commented May 18, 2026

Uh oh!

claude Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iamquang95 commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamquang95 commented May 18, 2026

Uh oh!

claude Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude PR Review — fix: QBFT

Summary

Strengths

Issues & Questions

1. compare() busy-cancels after ct.is_canceled() (minor)

2. qbft::run returns Ok(()) on cancellation, Charon returns ctx.Err() (parity question)

3. RUN_CANCELLATION_POLL_INTERVAL is used by both run and compare (naming)

4. Definition::compare is Arc<…> while every other callback is Box<dyn Fn …> (API asymmetry)

5. UnexpectedCompareError is opaque

6. Leader function leaves rounds with no leader at round % N == 0

7. CodeQL alerts on 0x5142… and 0x4348… seeds

8. Spawned compare callback can leak the thread on a misbehaving callback

9. Test infra: result_chan_tx bounded N then re-sent N times

10. Nit: assert_eq!(vec![true, true], done_rx.try_iter().collect::<Vec<_>>())

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented May 18, 2026 •

edited

Loading

1. `compare()` busy-cancels after `ct.is_canceled()` (minor)

2. `qbft::run` returns `Ok(())` on cancellation, Charon returns `ctx.Err()` (parity question)

3. `RUN_CANCELLATION_POLL_INTERVAL` is used by both `run` and `compare` (naming)

4. `Definition::compare` is `Arc<…>` while every other callback is `Box<dyn Fn …>` (API asymmetry)

5. `UnexpectedCompareError` is opaque

6. Leader function leaves rounds with no leader at `round % N == 0`

7. CodeQL alerts on `0x5142…` and `0x4348…` seeds

8. Spawned `compare` callback can leak the thread on a misbehaving callback

9. Test infra: `result_chan_tx` bounded N then re-sent N times

10. Nit: `assert_eq!(vec![true, true], done_rx.try_iter().collect::<Vec<_>>())`