HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up by xichen01 · Pull Request #10059 · apache/ozone

xichen01 · 2026-04-09T10:15:51Z

What changes were proposed in this pull request?

Context

Fixed a bug where reading the key could result in a NO_REPLICA_FOUND error during SCM restart and leader transfer.
For details:
https://issues.apache.org/jira/browse/HDDS-14989
Or can see testFollowerCatchupAfterContainerClose for detailed reproduction way

Fix

When the Follower SCM starts, it starts the DatanodeProtocolServer to receive FCR and ICR from the Datanode after catching up with the leader's committed log entries.
Only allow the leader SCM to update the container via Ratis by executing updateContainerState.

What is the link to the Apache JIRA'

https://issues.apache.org/jira/browse/HDDS-14989

How was this patch tested?

new test

ivandika3

Thanks @xichen01 for the optimization. This optimization would allow to reduce the possibility to NO_REPLICA_FOUND when doing startup (e.g. we are doing rolling SCM restarts).

However, I think this is a fundamental design issue of SCM where some information need to be replicated, usually user-facing ones (e.g. creating Container Info and updating Container State), but some information is only updated locally from the heartbeat and updated in best-effort (e.g. container replica, the DN -> List mappings between node and containers). This hybrid combinations of eventually consistent heartbeat mechanism and strongly consistent Ratis potentially can introduce similar subtle bugs. For example

The container replica might be recorded under the NodeStateManager, but not in ContainerStateManager (e.g. due NotLeaderException).
If container 1000 has been created in the leader but follower has not seen it, when the datanode sends a report on the replica for container 1000, the follower will reject it with ContainerNotFoundException. If the follower becomes a leader immediately after, container 1000 will not record this replica.

In the future, we need to revisit the SCM heartbeat and Ratis design for container management.

This patch should improve the situation. Nevertheless, after the initial assertion before starting the DN RPC server, there is no guarantee that this issue will not happen under network partition between SCMs (e.g. SCM leader and SCM follower is partitioned, but the SCM follower can still have heartbeats with the DNs.

We might need to discuss for a more long-term solution.

ivandika3 · 2026-04-22T03:59:54Z

+        for (RaftProtos.CommitInfoProto info : division.getCommitInfos()) {
+          if (info.getServer().getId().equals(leaderId.toByteString())) {
+            long leaderCommit = info.getCommitIndex();
+            boolean caughtUp = lastAppliedIndex >= leaderCommit;


Let's apply the fix where we need to take into account the metadata entry.

HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up

9a10d34

adoroszlai requested a review from szetszwo April 9, 2026 18:29

ivandika3 requested review from ashishkumar50 and sumitagrawl April 10, 2026 04:22

szetszwo requested a review from sadanand48 April 19, 2026 17:53

ivandika3 reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up#10059

HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up#10059
xichen01 wants to merge 1 commit intoapache:masterfrom
xichen01:HDDS-14989

xichen01 commented Apr 9, 2026 •

edited

Loading

Uh oh!

ivandika3 left a comment •

edited

Loading

Uh oh!

ivandika3 Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xichen01 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Context

Fix

What is the link to the Apache JIRA'

How was this patch tested?

Uh oh!

ivandika3 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivandika3 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xichen01 commented Apr 9, 2026 •

edited

Loading

ivandika3 left a comment •

edited

Loading