fix(controllers): replace a persistently crash-looping member#336
fix(controllers): replace a persistently crash-looping member#336Andrey Kolkov (androndo) wants to merge 1 commit into
Conversation
A non-bootstrap PVC member whose etcd cannot start — classically because its data dir was lost while the cluster membership moved on, leaving its frozen --initial-cluster stale (etcd refuses to boot: "error validating peerURLs ... member count is unequal") — crash-loops forever with no recovery path. The existing self-heal covers only memory-medium members (pod lost => data lost); for PVC members the PVC survives pod restarts, so "pod lost" != "data lost" and the stale --initial-cluster is never escaped. Detect such a member (etcd container not ready and restarted past a threshold, excluding OOMKilled) and delete it for replacement — but only when the rest of the cluster still has quorum, so a cluster-wide outage never cascades into mass deletion (the finalizer's MemberRemove is quorum-gated too). The cluster controller then gap-fills a fresh member with a current --initial-cluster. Also extend the Kamaji e2e to wait for readyMembers==3, then churn all three members one at a time and re-verify the tenant API still roundtrips through the fully replaced (hash-named) member set — guarding that member naming/replacement stays transparent to a Kamaji DataStore (stable <cluster>-client Service + wildcard server-cert SAN). Signed-off-by: Andrey Kolkov <androndo@gmail.com>
📝 WalkthroughWalkthroughAdds self-healing for etcd members stuck in persistent crash loops. The controller gains ChangesCrash-loop self-heal for EtcdMember
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces self-healing capabilities for unrecoverable etcd members by automatically deleting and replacing persistently crash-looping members, provided the rest of the cluster maintains quorum. It includes helper functions to detect stuck containers and verify quorum, along with comprehensive unit and E2E tests. Feedback on the changes suggests checking the Pod's deletion timestamp and current container state to avoid false-positive self-healing triggers during graceful termination or active OOMKilled events. Additionally, it is recommended to account for stale status by decrementing the ready member count if the member under evaluation is still marked as ready, preventing potential race conditions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| func etcdContainerStuck(pod *corev1.Pod) bool { | ||
| for _, cs := range pod.Status.ContainerStatuses { | ||
| if cs.Name != "etcd" { | ||
| continue | ||
| } | ||
| if cs.Ready || cs.RestartCount < dataLossRestartThreshold { | ||
| return false | ||
| } | ||
| if t := cs.LastTerminationState.Terminated; t != nil && t.Reason == "OOMKilled" { | ||
| return false | ||
| } | ||
| return true | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
Issue 1: Missing check for Pod deletion state
If a Pod is currently being deleted (e.g., due to a manual restart, node drain, or eviction), its DeletionTimestamp is set, and its containers will eventually terminate. If we evaluate it as "stuck" during this graceful termination window, we might prematurely delete the EtcdMember CR and trigger an expensive member replacement instead of letting the Pod naturally restart or reschedule.
Issue 2: Missing check for current OOMKilled state
The current implementation only checks cs.LastTerminationState.Terminated for the "OOMKilled" reason. However, if the container has just been OOMKilled and is currently in the Terminated state (but has not yet transitioned to Waiting / CrashLoopBackOff), the "OOMKilled" reason will be in cs.State.Terminated instead. Checking both states ensures we don't falsely trigger self-healing for resource-limit issues.
func etcdContainerStuck(pod *corev1.Pod) bool {
if pod.DeletionTimestamp != nil {
return false
}
for _, cs := range pod.Status.ContainerStatuses {
if cs.Name != "etcd" {
continue
}
if cs.Ready || cs.RestartCount < dataLossRestartThreshold {
return false
}
if t := cs.State.Terminated; t != nil && t.Reason == "OOMKilled" {
return false
}
if t := cs.LastTerminationState.Terminated; t != nil && t.Reason == "OOMKilled" {
return false
}
return true
}
return false
}| func (r *EtcdMemberReconciler) clusterHasQuorumWithout(ctx context.Context, member *lll.EtcdMember) bool { | ||
| cluster, err := r.clusterFor(ctx, member) | ||
| if err != nil { | ||
| return false | ||
| } | ||
| desired := 0 | ||
| if cluster.Status.Observed != nil { | ||
| desired = int(cluster.Status.Observed.Replicas) | ||
| } | ||
| if desired == 0 && cluster.Spec.Replicas != nil { | ||
| desired = int(*cluster.Spec.Replicas) | ||
| } | ||
| if desired == 0 { | ||
| return false | ||
| } | ||
| return int(cluster.Status.ReadyMembers) >= desired/2+1 | ||
| } |
There was a problem hiding this comment.
Issue: Potential race condition with stale ReadyMembers count
cluster.Status.ReadyMembers is updated asynchronously by the cluster controller. If this member was previously healthy (and thus counted as ready in ReadyMembers), but has recently started crash-looping, cluster.Status.ReadyMembers might still include it if the cluster controller hasn't reconciled yet.
If another member is also down, ReadyMembers might incorrectly indicate that we still have quorum to perform a deletion, when in reality we do not. To prevent cascading outages during multi-member failures, we should decrement the ready count if this member's CR status still lists it as Ready.
func (r *EtcdMemberReconciler) clusterHasQuorumWithout(ctx context.Context, member *lll.EtcdMember) bool {
cluster, err := r.clusterFor(ctx, member)
if err != nil {
return false
}
desired := 0
if cluster.Status.Observed != nil {
desired = int(cluster.Status.Observed.Replicas)
}
if desired == 0 && cluster.Spec.Replicas != nil {
desired = int(*cluster.Spec.Replicas)
}
if desired == 0 {
return false
}
readyMembers := int(cluster.Status.ReadyMembers)
for _, cond := range member.Status.Conditions {
if cond.Type == lll.MemberReady && cond.Status == metav1.ConditionTrue {
readyMembers--
break
}
}
return readyMembers >= desired/2+1
}There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/kamaji_datastore_test.go`:
- Around line 181-183: The memberNames helper function calls t.Fatalf internally
when encountering List errors, which causes the test to fail immediately when
invoked inside the waitFor callback, bypassing the retry logic. To fix this,
create a variant of memberNames that returns an error value instead of calling
t.Fatalf, or inline the list call directly within the waitFor callback body to
handle errors gracefully and allow waitFor to retry on transient API failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 76cd05b0-e161-44f0-8a10-4521832db9dd
📒 Files selected for processing (3)
controllers/etcdmember_controller.gocontrollers/etcdmember_controller_test.gotest/e2e/kamaji_datastore_test.go
| if names := memberNames(ctx, t); len(names) != 3 { | ||
| return fmt.Errorf("have %d members, want 3: %v", len(names), names) | ||
| } |
There was a problem hiding this comment.
memberNames calls t.Fatalf inside waitFor callback, bypassing retry logic.
The memberNames helper (line 406) calls t.Fatalf on List errors. When invoked inside the waitFor callback, a transient API error will immediately fail the test instead of allowing waitFor to retry. Consider using a local error-returning variant or inline the list call with error return.
Proposed inline fix
- if names := memberNames(ctx, t); len(names) != 3 {
- return fmt.Errorf("have %d members, want 3: %v", len(names), names)
- }
+ list := &etcdv1alpha2.EtcdMemberList{}
+ if err := kube.List(ctx, list, client.InNamespace(e2eNamespace),
+ client.MatchingLabels{"etcd-operator.cozystack.io/cluster": clusterName}); err != nil {
+ return err
+ }
+ if len(list.Items) != 3 {
+ return fmt.Errorf("have %d members, want 3", len(list.Items))
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if names := memberNames(ctx, t); len(names) != 3 { | |
| return fmt.Errorf("have %d members, want 3: %v", len(names), names) | |
| } | |
| list := &etcdv1alpha2.EtcdMemberList{} | |
| if err := kube.List(ctx, list, client.InNamespace(e2eNamespace), | |
| client.MatchingLabels{"etcd-operator.cozystack.io/cluster": clusterName}); err != nil { | |
| return err | |
| } | |
| if len(list.Items) != 3 { | |
| return fmt.Errorf("have %d members, want 3", len(list.Items)) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/kamaji_datastore_test.go` around lines 181 - 183, The memberNames
helper function calls t.Fatalf internally when encountering List errors, which
causes the test to fail immediately when invoked inside the waitFor callback,
bypassing the retry logic. To fix this, create a variant of memberNames that
returns an error value instead of calling t.Fatalf, or inline the list call
directly within the waitFor callback body to handle errors gracefully and allow
waitFor to retry on transient API failures.
A non-bootstrap PVC member whose etcd cannot start — classically because its data dir was lost while the cluster membership moved on, leaving its frozen --initial-cluster stale (etcd refuses to boot: "error validating peerURLs ... member count is unequal") — crash-loops forever with no recovery path. The existing self-heal covers only memory-medium members (pod lost => data lost); for PVC members the PVC survives pod restarts, so "pod lost" != "data lost" and the stale --initial-cluster is never escaped.
Detect such a member (etcd container not ready and restarted past a threshold, excluding OOMKilled) and delete it for replacement — but only when the rest of the cluster still has quorum, so a cluster-wide outage never cascades into mass deletion (the finalizer's MemberRemove is quorum-gated too). The cluster controller then gap-fills a fresh member with a current --initial-cluster.
Also extend the Kamaji e2e to wait for readyMembers==3, then churn all three members one at a time and re-verify the tenant API still roundtrips through the fully replaced (hash-named) member set — guarding that member naming/replacement stays transparent to a Kamaji DataStore (stable -client Service + wildcard server-cert SAN).
Summary by CodeRabbit