HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary#10074
HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary#10074devmadhuu wants to merge 11 commits intoapache:masterfrom
Conversation
…d Container Summary Totals - Initial Commit.
sumitagrawl
left a comment
There was a problem hiding this comment.
@devmadhuu Thanks for working over this, few points to be considered,
- We support moving back container from deleted to closed / quasi-closed based on state from DN
- open containers can be incremental based on last sync container ID, as new containers are created in increment id order. But closed / quasi-closed needs to be full sunc. So full sync can be 3 hours gap or more.
- Closing may not be required as its temporary state for few minutes, DN sync can help over this.
- For stale DN or some volume failure, there can be sudden spike of container mismatch like container moving to closing state from open state. Need consider full db sync for open container difference -- may be not required. And for quasi-closed/closed state, any way doing sync container IDs
Do we really need full db sync for quasi-closed/closed only OR for Open container state ?
Thanks @sumitagrawl for your review. Kindly have a re-look into the code. I have pushed the changes and now doing OPEN containers sync incrementally as you suggested. |
priyeshkaratha
left a comment
There was a problem hiding this comment.
Thanks @devmadhuu for working on this. Overall changes LGTM
…mplementation for getContainerCount API.
| containers.addContainer(container); | ||
| if (container.getState() == LifeCycleState.OPEN) { | ||
| if (container.getState() == LifeCycleState.OPEN | ||
| && container.getPipelineID() != null) { |
There was a problem hiding this comment.
any specific reason for adding this check? Do possible that container open is reported but pipeline does not exist (closed due to some error), can have some impact with the check
| equals(LifeCycleState.OPEN)) { | ||
| // Pipeline should exist, but not | ||
| throw new PipelineNotFoundException(); | ||
| if (pipelineID != null) { |
There was a problem hiding this comment.
Null pipelineId also represent PipelineNotFound Exception
| try { | ||
| ContainerInfo info = scm.getContainerManager() | ||
| .getContainer(ContainerID.valueOf(containerID)); | ||
| cpList.add(new ContainerWithPipeline(info, null)); |
There was a problem hiding this comment.
are we returning pipeline as null here ?
What changes were proposed in this pull request?
This PR addresses the issue of large deviations between Recon and SCM related to container counts, container Ids and their respective states.
Largely the PR addresses two issues:
OPEN,CLOSING, orQUASI_CLOSEDafter SCM has already advanced it.The implementation now has two SCM sync mechanisms:
Full snapshot sync
This remains the safety net. Recon replaces its SCM DB view from a fresh SCM
checkpoint on the existing snapshot schedule.
Incremental targeted sync
This now runs on its own schedule and decides between:
NO_ACTIONTARGETED_SYNCFULL_SNAPSHOTThe targeted sync is implemented as a four-pass workflow:
Pass 1:
CLOSEDAdd missing
CLOSEDcontainers and correct staleOPEN/CLOSING/QUASI_CLOSEDcontainers to
CLOSED.Pass 2:
OPENAdd missing
OPENcontainers only. No downgrades and no state correction.Pass 3:
QUASI_CLOSEDAdd missing
QUASI_CLOSEDcontainers and correct staleOPEN/CLOSINGcontainers up to
QUASI_CLOSED.Pass 4: retirement for
DELETING/DELETEDStart from Recon's own
CLOSEDandQUASI_CLOSEDcontainers and move themforward only when SCM explicitly returns
DELETINGorDELETED.Root Causes Addressed
1. DN-report path could not advance beyond
CLOSING2. Sync used to be add-only for
CLOSED3. Recon could miss
OPENandQUASI_CLOSEDcontainers entirely4. Recon never retired stale live states based on SCM deletion progress
5. SCM batch API could drop containers when pipeline lookup failed
6. Recon add path and SCM state manager were not null-pipeline safe
7. Open-container count per pipeline could drift
8.
decideSyncAction()became a real tiered decisionCurrent logic:
>ozone.recon.scm.container.threshold:FULL_SNAPSHOT> 0:TARGETED_SYNCOPENQUASI_CLOSEDCLOSEDremainderozone.recon.scm.per.state.drift.threshold:TARGETED_SYNCNO_ACTION9. Incremental sync got its own schedule
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14758
How was this patch tested?
Tests added or updated
TestReconSCMContainerSyncIntegration.java]TestReconStorageContainerSyncHelper.java]TestTriggerDBSyncEndpoint.java]TestUnhealthyContainersDerbyPerformance.java]TestReconContainerHealthSummaryEndToEnd.java]