Skip to content

HDDS-14646. SCM should not close Ratis pipelines on Finalize#9779

Merged
sodonnel merged 12 commits intoapache:HDDS-14496-zdufrom
sodonnel:close-pipelines
Feb 24, 2026
Merged

HDDS-14646. SCM should not close Ratis pipelines on Finalize#9779
sodonnel merged 12 commits intoapache:HDDS-14496-zdufrom
sodonnel:close-pipelines

Conversation

@sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

When SCM finalizes an upgrade, it should no longer close all the Ratis pipelines on the datanodes. Instead they should be kept open. The SCM finalize command now needs to wait for all healthy datanodes to report matching SLV and MLV versions to know they have been finalized. Only when all datanodes are finalized, should SCM complete the finalization process.

This change is mostly to remove the existing close pipeline code and anything else that depended on it. The only new code added is the change to wait in DNs reporting SVL == MVL, as before new pipelines being created was the trigger to complete the process.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14646

How was this patch tested?

Existing tests, some of which had to be adapted slightly.

@github-actions github-actions bot added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label Feb 17, 2026
@sodonnel
Copy link
Contributor Author

sodonnel commented Feb 18, 2026

The Integration tests seem to fail with:

2026-02-18 10:03:04,129 [main] ERROR volume.MutableVolumeSet (MutableVolumeSet.java:initializeVolumeSet(193)) - Failed to parse the storage location: file:///home/runner/work/ozone/ozone/hadoop-hdds/container-service/target/tmp/dfs/data
org.apache.hadoop.ozone.common.InconsistentStorageStateException: Mismatched DatanodeUUIDs. Version File : /home/runner/work/ozone/ozone/hadoop-hdds/container-service/target/tmp/dfs/data/hdds/VERSION has datanodeUuid: 37fe7285-51fc-49f0-8ced-b4814363d903 and Datanode has datanodeUuid: f2bcb13b-daa7-4306-9fad-7488b660dff6
	at org.apache.hadoop.ozone.container.common.utils.StorageVolumeUtil.getDatanodeUUID(StorageVolumeUtil.java:122)
	at org.apache.hadoop.ozone.container.common.volume.StorageVolume.readVersionFile(StorageVolume.java:381)
	at org.apache.hadoop.ozone.container.common.volume.StorageVolume.initializeImpl(StorageVolume.java:239)
	at org.apache.hadoop.ozone.container.common.volume.StorageVolume.initialize(StorageVolume.java:214)
	at org.apache.hadoop.ozone.container.common.volume.HddsVolume.<init>(HddsVolume.java:155)
	at org.apache.hadoop.ozone.container.common.volume.HddsVolume.<init>(HddsVolume.java:82)

Resulting in no storage locations available on the DN. This was supposed to be fixed by HDDS-14632, which is on the branch, but its still failing. Passes locally.

EDIT: the failure was from an old run. The latest doesn't fail with this error!

@adoroszlai
Copy link
Contributor

This was supposed to be fixed by HDDS-14632, which is on the branch, but its still failing.

It failed for this commit:

HEAD is now at 246e9a4 Merge 9acf575a86b50ff3c59360f1d2d539ec266de1f5 into af22a5c3a435f771c17307d325d7e80eae7361ec

where af22a5c was the target branch state, from two weeks ago.

It passes after feature branch HDDS-14496-zdu was updated to current master.

@sodonnel sodonnel requested a review from errose28 February 18, 2026 19:21
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @sodonnel. I assume we are planning to remove the healthy readonly state and finalization checkpoints in follow up changes. Can you file those Jiras to clarify the scope of this PR?

LOG.info("testPostUpgradeConditionsSCM: container state is {}",
ciState.name());
assertTrue((ciState == HddsProtos.LifeCycleState.CLOSED) ||
// Containers can now be in any state since we no longer close pipelines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this test on container states completely? The new upgrade flow should be independent of any container states. There's a few other places in this test where I think we can remove similar checks.

@sodonnel
Copy link
Contributor Author

Thanks for working on this @sodonnel. I assume we are planning to remove the healthy readonly state and finalization checkpoints in follow up changes. Can you file those Jiras to clarify the scope of this PR?

I have filed the follow to continue this work:

HDDS-14669 Waiting in finalization should not block a server thread
HDDS-14670 SCM should not finalize unless it is out of
HDDS-14671 Remove healthy_readonly state from SCM
HDDS-14672 Remove finalization checkpoints

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionality LGTM given the follow-up tasks we have coming. All comments are just code cleanup that can be done here or later. Additionally we now have the following unused methods:

  • SCMUpgradeFinalizationContext#getConfiguration
  • SCMUpgradeFinalizationContext#getLayoutVersionManager
  • SCMUpgradeFinalizationContext#getPipelineManager

Comment on lines 115 to 118
int countContainers = 0;
for (ContainerInfo ci : scm.getContainerManager().getContainers()) {
HddsProtos.LifeCycleState ciState = ci.getState();
LOG.info("testPostUpgradeConditionsSCM: container state is {}",
ciState.name());
assertTrue((ciState == HddsProtos.LifeCycleState.CLOSED) ||
(ciState == HddsProtos.LifeCycleState.CLOSING) ||
(ciState == HddsProtos.LifeCycleState.DELETING) ||
(ciState == HddsProtos.LifeCycleState.DELETED) ||
(ciState == HddsProtos.LifeCycleState.QUASI_CLOSED));
for (ContainerInfo ignored : scm.getContainerManager().getContainers()) {
countContainers++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit.

Suggested change
int countContainers = 0;
for (ContainerInfo ci : scm.getContainerManager().getContainers()) {
HddsProtos.LifeCycleState ciState = ci.getState();
LOG.info("testPostUpgradeConditionsSCM: container state is {}",
ciState.name());
assertTrue((ciState == HddsProtos.LifeCycleState.CLOSED) ||
(ciState == HddsProtos.LifeCycleState.CLOSING) ||
(ciState == HddsProtos.LifeCycleState.DELETING) ||
(ciState == HddsProtos.LifeCycleState.DELETED) ||
(ciState == HddsProtos.LifeCycleState.QUASI_CLOSED));
for (ContainerInfo ignored : scm.getContainerManager().getContainers()) {
countContainers++;
}
int countContainers = scm.getContainerManager().getContainers().size();

assertEquals(0, dnVersionManager.getMetadataLayoutVersion());
}

int countContainers = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only called by TestHDDSUpgrade which is not run in CI, we should remove the container state check here for correctness even though the current CI run won't flag it.

Comment on lines +183 to 186
// Verify containers are in acceptable states (OPEN is now allowed).
for (Container<?> ignored :
dsm.getContainer().getController().getContainers()) {
assertTrue(closeStates.stream().anyMatch(
state -> container.getContainerState().equals(state)),
"Container had unexpected state " + container.getContainerState());
countContainers++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here is stale as well.

Suggested change
// Verify containers are in acceptable states (OPEN is now allowed).
for (Container<?> ignored :
dsm.getContainer().getController().getContainers()) {
assertTrue(closeStates.stream().anyMatch(
state -> container.getContainerState().equals(state)),
"Container had unexpected state " + container.getContainerState());
countContainers++;
countContainers += dsm.getContainer().getContainerSet().containerCount();

@sodonnel sodonnel merged commit bb546a2 into apache:HDDS-14496-zdu Feb 24, 2026
44 checks passed
@sodonnel
Copy link
Contributor Author

@errose28 Thanks for the review. I will clean up the nits on the next change.

errose28 added a commit to errose28/ozone that referenced this pull request Feb 26, 2026
* HDDS-14496-zdu:
  HDDS-14712. Rename DatanodeVersion to HDDSVersion (apache#9822)
  HDDS-14646. SCM should not close Ratis pipelines on Finalize (apache#9779)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants