Skip to content

HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up#10059

Open
xichen01 wants to merge 1 commit intoapache:masterfrom
xichen01:HDDS-14989
Open

HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up#10059
xichen01 wants to merge 1 commit intoapache:masterfrom
xichen01:HDDS-14989

Conversation

@xichen01
Copy link
Copy Markdown
Contributor

@xichen01 xichen01 commented Apr 9, 2026

What changes were proposed in this pull request?

Context

Fixed a bug where reading the key could result in a NO_REPLICA_FOUND error during SCM restart and leader transfer.
For details:
https://issues.apache.org/jira/browse/HDDS-14989
Or can see testFollowerCatchupAfterContainerClose for detailed reproduction way

Fix

  • When the Follower SCM starts, it starts the DatanodeProtocolServer to receive FCR and ICR from the Datanode after catching up with the leader's committed log entries.
  • Only allow the leader SCM to update the container via Ratis by executing updateContainerState.

What is the link to the Apache JIRA'

https://issues.apache.org/jira/browse/HDDS-14989

How was this patch tested?

new test

Copy link
Copy Markdown
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for the optimization. This optimization would allow to reduce the possibility to NO_REPLICA_FOUND when doing startup (e.g. we are doing rolling SCM restarts).

However, I think this is a fundamental design issue of SCM where some information need to be replicated, usually user-facing ones (e.g. creating Container Info and updating Container State), but some information is only updated locally from the heartbeat and updated in best-effort (e.g. container replica, the DN -> List mappings between node and containers). This hybrid combinations of eventually consistent heartbeat mechanism and strongly consistent Ratis potentially can introduce similar subtle bugs. For example

  • The container replica might be recorded under the NodeStateManager, but not in ContainerStateManager (e.g. due NotLeaderException).
  • If container 1000 has been created in the leader but follower has not seen it, when the datanode sends a report on the replica for container 1000, the follower will reject it with ContainerNotFoundException. If the follower becomes a leader immediately after, container 1000 will not record this replica.

In the future, we need to revisit the SCM heartbeat and Ratis design for container management.

This patch should improve the situation. Nevertheless, after the initial assertion before starting the DN RPC server, there is no guarantee that this issue will not happen under network partition between SCMs (e.g. SCM leader and SCM follower is partitioned, but the SCM follower can still have heartbeats with the DNs.

We might need to discuss for a more long-term solution.

for (RaftProtos.CommitInfoProto info : division.getCommitInfos()) {
if (info.getServer().getId().equals(leaderId.toByteString())) {
long leaderCommit = info.getCommitIndex();
boolean caughtUp = lastAppliedIndex >= leaderCommit;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's apply the fix where we need to take into account the metadata entry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants