HDDS-14646. SCM should not close Ratis pipelines on Finalize#9779
HDDS-14646. SCM should not close Ratis pipelines on Finalize#9779sodonnel merged 12 commits intoapache:HDDS-14496-zdufrom
Conversation
c8172b8 to
fe8fff6
Compare
.../server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java
Show resolved
Hide resolved
.../server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java
Show resolved
Hide resolved
… all finalize before finalization is complete
9acf575 to
9f6bac6
Compare
|
The Integration tests seem to fail with: Resulting in no storage locations available on the DN. This was supposed to be fixed by HDDS-14632, which is on the branch, but its still failing. Passes locally. EDIT: the failure was from an old run. The latest doesn't fail with this error! |
It failed for this commit: where af22a5c was the target branch state, from two weeks ago. It passes after feature branch |
...ervice/src/main/java/org/apache/hadoop/ozone/container/upgrade/DataNodeUpgradeFinalizer.java
Outdated
Show resolved
Hide resolved
...cm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/FinalizationStateManagerImpl.java
Show resolved
Hide resolved
.../server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/FinalizationManager.java
Show resolved
Hide resolved
.../server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java
Show resolved
Hide resolved
| LOG.info("testPostUpgradeConditionsSCM: container state is {}", | ||
| ciState.name()); | ||
| assertTrue((ciState == HddsProtos.LifeCycleState.CLOSED) || | ||
| // Containers can now be in any state since we no longer close pipelines |
There was a problem hiding this comment.
Can we remove this test on container states completely? The new upgrade flow should be independent of any container states. There's a few other places in this test where I think we can remove similar checks.
I have filed the follow to continue this work: HDDS-14669 Waiting in finalization should not block a server thread |
There was a problem hiding this comment.
Functionality LGTM given the follow-up tasks we have coming. All comments are just code cleanup that can be done here or later. Additionally we now have the following unused methods:
SCMUpgradeFinalizationContext#getConfigurationSCMUpgradeFinalizationContext#getLayoutVersionManagerSCMUpgradeFinalizationContext#getPipelineManager
| int countContainers = 0; | ||
| for (ContainerInfo ci : scm.getContainerManager().getContainers()) { | ||
| HddsProtos.LifeCycleState ciState = ci.getState(); | ||
| LOG.info("testPostUpgradeConditionsSCM: container state is {}", | ||
| ciState.name()); | ||
| assertTrue((ciState == HddsProtos.LifeCycleState.CLOSED) || | ||
| (ciState == HddsProtos.LifeCycleState.CLOSING) || | ||
| (ciState == HddsProtos.LifeCycleState.DELETING) || | ||
| (ciState == HddsProtos.LifeCycleState.DELETED) || | ||
| (ciState == HddsProtos.LifeCycleState.QUASI_CLOSED)); | ||
| for (ContainerInfo ignored : scm.getContainerManager().getContainers()) { | ||
| countContainers++; | ||
| } |
There was a problem hiding this comment.
nit.
| int countContainers = 0; | |
| for (ContainerInfo ci : scm.getContainerManager().getContainers()) { | |
| HddsProtos.LifeCycleState ciState = ci.getState(); | |
| LOG.info("testPostUpgradeConditionsSCM: container state is {}", | |
| ciState.name()); | |
| assertTrue((ciState == HddsProtos.LifeCycleState.CLOSED) || | |
| (ciState == HddsProtos.LifeCycleState.CLOSING) || | |
| (ciState == HddsProtos.LifeCycleState.DELETING) || | |
| (ciState == HddsProtos.LifeCycleState.DELETED) || | |
| (ciState == HddsProtos.LifeCycleState.QUASI_CLOSED)); | |
| for (ContainerInfo ignored : scm.getContainerManager().getContainers()) { | |
| countContainers++; | |
| } | |
| int countContainers = scm.getContainerManager().getContainers().size(); |
| assertEquals(0, dnVersionManager.getMetadataLayoutVersion()); | ||
| } | ||
|
|
||
| int countContainers = 0; |
There was a problem hiding this comment.
This is only called by TestHDDSUpgrade which is not run in CI, we should remove the container state check here for correctness even though the current CI run won't flag it.
| // Verify containers are in acceptable states (OPEN is now allowed). | ||
| for (Container<?> ignored : | ||
| dsm.getContainer().getController().getContainers()) { | ||
| assertTrue(closeStates.stream().anyMatch( | ||
| state -> container.getContainerState().equals(state)), | ||
| "Container had unexpected state " + container.getContainerState()); | ||
| countContainers++; |
There was a problem hiding this comment.
The comment here is stale as well.
| // Verify containers are in acceptable states (OPEN is now allowed). | |
| for (Container<?> ignored : | |
| dsm.getContainer().getController().getContainers()) { | |
| assertTrue(closeStates.stream().anyMatch( | |
| state -> container.getContainerState().equals(state)), | |
| "Container had unexpected state " + container.getContainerState()); | |
| countContainers++; | |
| countContainers += dsm.getContainer().getContainerSet().containerCount(); |
|
@errose28 Thanks for the review. I will clean up the nits on the next change. |
* HDDS-14496-zdu: HDDS-14712. Rename DatanodeVersion to HDDSVersion (apache#9822) HDDS-14646. SCM should not close Ratis pipelines on Finalize (apache#9779)
What changes were proposed in this pull request?
When SCM finalizes an upgrade, it should no longer close all the Ratis pipelines on the datanodes. Instead they should be kept open. The SCM finalize command now needs to wait for all healthy datanodes to report matching SLV and MLV versions to know they have been finalized. Only when all datanodes are finalized, should SCM complete the finalization process.
This change is mostly to remove the existing close pipeline code and anything else that depended on it. The only new code added is the change to wait in DNs reporting SVL == MVL, as before new pipelines being created was the trigger to complete the process.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14646
How was this patch tested?
Existing tests, some of which had to be adapted slightly.