fix(controllers): replace a persistently crash-looping member by androndo · Pull Request #336 · cozystack/etcd-operator

Andrey Kolkov (androndo) · 2026-06-15T16:21:53Z

A non-bootstrap PVC member whose etcd cannot start — classically because its data dir was lost while the cluster membership moved on, leaving its frozen --initial-cluster stale (etcd refuses to boot: "error validating peerURLs ... member count is unequal") — crash-loops forever with no recovery path. The existing self-heal covers only memory-medium members (pod lost => data lost); for PVC members the PVC survives pod restarts, so "pod lost" != "data lost" and the stale --initial-cluster is never escaped.

Detect such a member (etcd container not ready and restarted past a threshold, excluding OOMKilled) and delete it for replacement — but only when the rest of the cluster still has quorum, so a cluster-wide outage never cascades into mass deletion (the finalizer's MemberRemove is quorum-gated too). The cluster controller then gap-fills a fresh member with a current --initial-cluster.

Also extend the Kamaji e2e to wait for readyMembers==3, then churn all three members one at a time and re-verify the tenant API still roundtrips through the fully replaced (hash-named) member set — guarding that member naming/replacement stays transparent to a Kamaji DataStore (stable -client Service + wildcard server-cert SAN).

Summary by CodeRabbit

Bug Fixes
- Implemented automatic detection and recovery for persistently crashing etcd members, automatically replacing them when the cluster can safely maintain quorum.

A non-bootstrap PVC member whose etcd cannot start — classically because its data dir was lost while the cluster membership moved on, leaving its frozen --initial-cluster stale (etcd refuses to boot: "error validating peerURLs ... member count is unequal") — crash-loops forever with no recovery path. The existing self-heal covers only memory-medium members (pod lost => data lost); for PVC members the PVC survives pod restarts, so "pod lost" != "data lost" and the stale --initial-cluster is never escaped. Detect such a member (etcd container not ready and restarted past a threshold, excluding OOMKilled) and delete it for replacement — but only when the rest of the cluster still has quorum, so a cluster-wide outage never cascades into mass deletion (the finalizer's MemberRemove is quorum-gated too). The cluster controller then gap-fills a fresh member with a current --initial-cluster. Also extend the Kamaji e2e to wait for readyMembers==3, then churn all three members one at a time and re-verify the tenant API still roundtrips through the fully replaced (hash-named) member set — guarding that member naming/replacement stays transparent to a Kamaji DataStore (stable <cluster>-client Service + wildcard server-cert SAN). Signed-off-by: Andrey Kolkov <androndo@gmail.com>

coderabbitai · 2026-06-15T16:22:06Z

📝 Walkthrough

Walkthrough

Adds self-healing for etcd members stuck in persistent crash loops. The controller gains etcdContainerStuck and clusterHasQuorumWithout helpers; updateStatus now deletes non-bootstrap EtcdMember CRs when the etcd container is persistently failing and the remaining cluster retains quorum. Unit tests and an e2e member-churn scenario are included.

Changes

Crash-loop self-heal for EtcdMember

Layer / File(s)	Summary
Self-heal helpers and updateStatus integration `controllers/etcdmember_controller.go`	Introduces `dataLossRestartThreshold`, `etcdContainerStuck` (detects stuck non-OOMKilled etcd container), and `clusterHasQuorumWithout` (quorum check excluding the current member). `updateStatus` deletes a non-bootstrap PVC-backed EtcdMember when both conditions hold.
Unit tests for stuck detection and updateStatus `controllers/etcdmember_controller_test.go`	Adds `TestEtcdContainerStuck` for threshold/OOMKilled cases, `crashLoopPod`/`clusterWithReady` test helpers, and three `updateStatus` scenario tests: replace-on-quorum, preserve-without-quorum, preserve-bootstrap.
E2E member churn test `test/e2e/kamaji_datastore_test.go`	Extends `TestKamajiDataStore` with a sequential member-deletion churn phase, waiting for each member to be gone and the cluster to recover to 3 ready members, then asserting no original names remain and tenant etcd writes survive. Adds `memberNames` helper.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 A bunny watched the etcd loop,
Crash, restart, crash — oh what a droop!
"Enough!" cried the controller with a hop,
"Quorum's safe, so let this member drop!"
New pod springs up, fresh from the ground,
And all the cluster data is found. 🌿

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main objective of the PR—introducing self-healing logic to replace persistently crash-looping etcd members.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/fail-pvc-on-member

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces self-healing capabilities for unrecoverable etcd members by automatically deleting and replacing persistently crash-looping members, provided the rest of the cluster maintains quorum. It includes helper functions to detect stuck containers and verify quorum, along with comprehensive unit and E2E tests. Feedback on the changes suggests checking the Pod's deletion timestamp and current container state to avoid false-positive self-healing triggers during graceful termination or active OOMKilled events. Additionally, it is recommended to account for stale status by decrementing the ready member count if the member under evaluation is still marked as ready, preventing potential race conditions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-15T16:23:41Z

+func etcdContainerStuck(pod *corev1.Pod) bool {
+	for _, cs := range pod.Status.ContainerStatuses {
+		if cs.Name != "etcd" {
+			continue
+		}
+		if cs.Ready || cs.RestartCount < dataLossRestartThreshold {
+			return false
+		}
+		if t := cs.LastTerminationState.Terminated; t != nil && t.Reason == "OOMKilled" {
+			return false
+		}
+		return true
+	}
+	return false
+}


Issue 1: Missing check for Pod deletion state

If a Pod is currently being deleted (e.g., due to a manual restart, node drain, or eviction), its DeletionTimestamp is set, and its containers will eventually terminate. If we evaluate it as "stuck" during this graceful termination window, we might prematurely delete the EtcdMember CR and trigger an expensive member replacement instead of letting the Pod naturally restart or reschedule.

Issue 2: Missing check for current OOMKilled state

The current implementation only checks cs.LastTerminationState.Terminated for the "OOMKilled" reason. However, if the container has just been OOMKilled and is currently in the Terminated state (but has not yet transitioned to Waiting / CrashLoopBackOff), the "OOMKilled" reason will be in cs.State.Terminated instead. Checking both states ensures we don't falsely trigger self-healing for resource-limit issues.

func etcdContainerStuck(pod *corev1.Pod) bool { if pod.DeletionTimestamp != nil { return false } for _, cs := range pod.Status.ContainerStatuses { if cs.Name != "etcd" { continue } if cs.Ready || cs.RestartCount < dataLossRestartThreshold { return false } if t := cs.State.Terminated; t != nil && t.Reason == "OOMKilled" { return false } if t := cs.LastTerminationState.Terminated; t != nil && t.Reason == "OOMKilled" { return false } return true } return false }

gemini-code-assist · 2026-06-15T16:23:41Z

+func (r *EtcdMemberReconciler) clusterHasQuorumWithout(ctx context.Context, member *lll.EtcdMember) bool {
+	cluster, err := r.clusterFor(ctx, member)
+	if err != nil {
+		return false
+	}
+	desired := 0
+	if cluster.Status.Observed != nil {
+		desired = int(cluster.Status.Observed.Replicas)
+	}
+	if desired == 0 && cluster.Spec.Replicas != nil {
+		desired = int(*cluster.Spec.Replicas)
+	}
+	if desired == 0 {
+		return false
+	}
+	return int(cluster.Status.ReadyMembers) >= desired/2+1
+}


Issue: Potential race condition with stale ReadyMembers count

cluster.Status.ReadyMembers is updated asynchronously by the cluster controller. If this member was previously healthy (and thus counted as ready in ReadyMembers), but has recently started crash-looping, cluster.Status.ReadyMembers might still include it if the cluster controller hasn't reconciled yet.

If another member is also down, ReadyMembers might incorrectly indicate that we still have quorum to perform a deletion, when in reality we do not. To prevent cascading outages during multi-member failures, we should decrement the ready count if this member's CR status still lists it as Ready.

func (r *EtcdMemberReconciler) clusterHasQuorumWithout(ctx context.Context, member *lll.EtcdMember) bool { cluster, err := r.clusterFor(ctx, member) if err != nil { return false } desired := 0 if cluster.Status.Observed != nil { desired = int(cluster.Status.Observed.Replicas) } if desired == 0 && cluster.Spec.Replicas != nil { desired = int(*cluster.Spec.Replicas) } if desired == 0 { return false } readyMembers := int(cluster.Status.ReadyMembers) for _, cond := range member.Status.Conditions { if cond.Type == lll.MemberReady && cond.Status == metav1.ConditionTrue { readyMembers-- break } } return readyMembers >= desired/2+1 }

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/kamaji_datastore_test.go`:
- Around line 181-183: The memberNames helper function calls t.Fatalf internally
when encountering List errors, which causes the test to fail immediately when
invoked inside the waitFor callback, bypassing the retry logic. To fix this,
create a variant of memberNames that returns an error value instead of calling
t.Fatalf, or inline the list call directly within the waitFor callback body to
handle errors gracefully and allow waitFor to retry on transient API failures.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76cd05b0-e161-44f0-8a10-4521832db9dd

📥 Commits

Reviewing files that changed from the base of the PR and between 4248ade and b54a52f.

📒 Files selected for processing (3)

controllers/etcdmember_controller.go
controllers/etcdmember_controller_test.go
test/e2e/kamaji_datastore_test.go

coderabbitai · 2026-06-15T16:26:58Z

+				if names := memberNames(ctx, t); len(names) != 3 {
+					return fmt.Errorf("have %d members, want 3: %v", len(names), names)
+				}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

memberNames calls t.Fatalf inside waitFor callback, bypassing retry logic.

The memberNames helper (line 406) calls t.Fatalf on List errors. When invoked inside the waitFor callback, a transient API error will immediately fail the test instead of allowing waitFor to retry. Consider using a local error-returning variant or inline the list call with error return.

Proposed inline fix

- if names := memberNames(ctx, t); len(names) != 3 { - return fmt.Errorf("have %d members, want 3: %v", len(names), names) - } + list := &etcdv1alpha2.EtcdMemberList{} + if err := kube.List(ctx, list, client.InNamespace(e2eNamespace), + client.MatchingLabels{"etcd-operator.cozystack.io/cluster": clusterName}); err != nil { + return err + } + if len(list.Items) != 3 { + return fmt.Errorf("have %d members, want 3", len(list.Items)) + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if names := memberNames(ctx, t); len(names) != 3 {

return fmt.Errorf("have %d members, want 3: %v", len(names), names)

}

list := &etcdv1alpha2.EtcdMemberList{}

if err := kube.List(ctx, list, client.InNamespace(e2eNamespace),

client.MatchingLabels{"etcd-operator.cozystack.io/cluster": clusterName}); err != nil {

return err

}

if len(list.Items) != 3 {

return fmt.Errorf("have %d members, want 3", len(list.Items))

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/kamaji_datastore_test.go` around lines 181 - 183, The memberNames helper function calls t.Fatalf internally when encountering List errors, which causes the test to fail immediately when invoked inside the waitFor callback, bypassing the retry logic. To fix this, create a variant of memberNames that returns an error value instead of calling t.Fatalf, or inline the list call directly within the waitFor callback body to handle errors gracefully and allow waitFor to retry on transient API failures.

github-actions Bot added bugfix controllers labels Jun 15, 2026

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controllers): replace a persistently crash-looping member#336

fix(controllers): replace a persistently crash-looping member#336
Andrey Kolkov (androndo) wants to merge 1 commit into
mainfrom
fix/fail-pvc-on-member

Andrey Kolkov (androndo) commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-				if names := memberNames(ctx, t); len(names) != 3 {
-					return fmt.Errorf("have %d members, want 3: %v", len(names), names)
-				}
+			list := &etcdv1alpha2.EtcdMemberList{}
+			if err := kube.List(ctx, list, client.InNamespace(e2eNamespace),
+				client.MatchingLabels{"etcd-operator.cozystack.io/cluster": clusterName}); err != nil {
+				return err
+			}
+			if len(list.Items) != 3 {
+				return fmt.Errorf("have %d members, want 3", len(list.Items))
+			}

Conversation

Andrey Kolkov (androndo) commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Issue 1: Missing check for Pod deletion state

Issue 2: Missing check for current OOMKilled state

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Issue: Potential race condition with stale ReadyMembers count

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Andrey Kolkov (androndo) commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Issue: Potential race condition with stale `ReadyMembers` count