Skip to content

feat: distributed probe voting for HA failover coordination#396

Open
Paragrf wants to merge 1 commit into
apache:unstablefrom
Paragrf:ha
Open

feat: distributed probe voting for HA failover coordination#396
Paragrf wants to merge 1 commit into
apache:unstablefrom
Paragrf:ha

Conversation

@Paragrf
Copy link
Copy Markdown
Contributor

@Paragrf Paragrf commented May 25, 2026

Background
To enhance the stability of the Kvrocks system, we have implemented an HA (High Availability) system based on multi-controller consensus.

Key Changes

  • Voting mechanism: Each controller independently probes every Kvrocks node. Before promoting a new master, the leader collects unanimous votes from all peer controllers via POST /internal/vote. A single NO (or unreachable peer with a live lease) blocks the failover.

  • Peer discovery: Controllers register themselves in the store with a heartbeat. ListActivePeers returns peers whose leases are still alive. Expired peers are excluded from quorum rather than blocking failover.

  • Observability:Structured zap logging at every vote decision point (round start, per-peer vote with diagnostic fields, round outcome). Seven new Prometheus metrics track failover proposals, completions, blocks, vote round duration, peer count, per-node failure counts, and individual probe failures.

Related Issues
#392

Implement a multi-controller voting mechanism that prevents split-brain
during failover: each controller independently probes Kvrocks nodes and
the leader collects peer votes before promoting a new master.

- Add VoteCoordinator with concurrent HTTP peer voting and quorum logic
- Add ClusterChecker.ShouldVote with soft threshold and probe freshness
- Add peer registration/discovery and timestamp-based TTL in store layer
- Add /internal/vote endpoint with leader-redirect bypass
- Add Prometheus metrics for vote rounds, failover blocks, probe failures
- Add structured logging for vote diagnostics (failure counts, thresholds)
- Fix probe state pruning, redundant store calls, mutex protection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant