Skip to content

DF-23489 multinode: tighten polling criteria for liveness#101

Open
vlfig wants to merge 2 commits intorm-racy-assertionsfrom
hysteretic-pollfailures
Open

DF-23489 multinode: tighten polling criteria for liveness#101
vlfig wants to merge 2 commits intorm-racy-assertionsfrom
hysteretic-pollfailures

Conversation

@vlfig
Copy link
Copy Markdown

@vlfig vlfig commented Apr 13, 2026

Description

Having ship-shape RPCs is crucial for keeping the odds of missing a transmit as low as possible, which is itself crucial for SVR. It is our suspicion (and telemetry concurs) that our nodes are too lenient on RPCs with respect to their polling failures and that there are gains to be had in booting those out of the alive pool. In particular, our nodes do not detect grey-failures — when RPCs fail polling intermittently under a certain rate.

This changes the behaviour in both directions between the unreachable and alive states:

  1. In the alive loop, instead of resetting the failure counter, successful polls now only decrement it. This way, we keep the same behavior for error rates above 5/6=0.8333, where an RPC gets booted after 5 probes (on average), but we now eventually declare unreachable nodes that sustain error rates between that and 1:1, which we'd previously tolerate;
  2. In the unreachable loop, we add a polling phase to the journey back into Alive, mirroring the other direction and ensuring a node doesn't make it back into alive in the same condition that got it kicked out in the first place. However, instead of following the same "decaying" counter we opted for a stricter flow where a single poll failure resets the node back to square one (dialing). This new phase/step is governed by a new config property PollSuccessThreshold that, defaulting to 0 matches current behavior and allows for progressive, explicit rollout.

Implements DF-23489.

Requires Dependencies

None.

Resolves Dependencies

None.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 13, 2026

⚠️ API Diff Results - github.com/smartcontractkit/chainlink-framework/multinode

⚠️ Breaking Changes (1)

NodeConfig (1)
  • PollSuccessThreshold — ➕ Added

✅ Compatible Changes (3)

config.(*MultiNodeConfig) (1)
  • PollSuccessThreshold — ➕ Added
config.MultiNode (1)
  • PollSuccessThreshold — ➕ Added
config.MultiNodeConfig (1)
  • PollSuccessThreshold — ➕ Added

📄 View full apidiff report

@vlfig vlfig force-pushed the hysteretic-pollfailures branch from a99c652 to 6a44457 Compare April 14, 2026 11:25
vlfig added 2 commits April 14, 2026 14:06
Successful polls now decrement the failure count, don't fully reset it.
So that nodes (RPCs) are eventually declared unreachable if they sustain poll error rates above 1:1.
@vlfig vlfig force-pushed the hysteretic-pollfailures branch from 6a44457 to e242962 Compare April 14, 2026 13:07
@vlfig vlfig changed the base branch from main to rm-racy-assertions April 14, 2026 13:11
@vlfig vlfig marked this pull request as ready for review April 14, 2026 13:12
@vlfig vlfig requested a review from a team as a code owner April 14, 2026 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant