Skip to content

replace-pod / observe-image: envtest cache-lag coverage + NotFound→Terminal escalation #218

@bdchatham

Description

@bdchatham

Two non-blocking follow-ups from the final cross-review of #216 (replace-pod task). Both are correctness/observability hardening on the same pair of tasks.

1. envtest harness for cache-lag regression coverage

Background. #216 fixed a same-reconcile read-after-write race where replace-pod and observe-image read the StatefulSet through the cached client (mgr.GetClient()) immediately after apply-statefulset SSA-wrote it. The cache-lag could return pre-apply state and cause replace-pod to no-op (deadlocking the chain-upgrade rollout). The fix plumbed mgr.GetAPIReader() through ExecutionConfig and switched both task reads to APIReader.Get.

Gap. Unit tests use sigs.k8s.io/controller-runtime/pkg/client/fake, which is synchronous and has no informer cache. So the unit tests cannot distinguish KubeClient.Get from APIReader.Get — a future regression that flips one back to KubeClient would pass the suite.

Ask. Add an envtest harness that:

  • Spins up a real apiserver + informer cache (controller-runtime's envtest package).
  • Drives apply-statefulset followed immediately by replace-pod / observe-image in the same reconcile pass, before the watch event has had a chance to update the cache.
  • Asserts the second task observes the post-apply spec (i.e., cache-lag does not produce false-completion).

This is the only way to lock in the fix at the test layer.

2. Escalate STS NotFound to Terminal in replace-pod and observe-image

Current behavior. Both tasks treat apierrors.IsNotFound on the StatefulSet Get as a transient wait (return nil; caller re-runs). Rationale was that apply-statefulset precedes them in the plan, so a NotFound is just informer-cache lag and the next reconcile will see it.

Concern. With APIReader plumbed in (#216), the read is strongly consistent. A NotFound now genuinely means the StatefulSet is missing — not a cache artifact. The most plausible explanation is external deletion mid-rollout (operator action, GC, namespace teardown). In that case the plan will spin forever returning ExecutionRunning, with no signal to the user.

Ask. Switch both NotFound paths to Terminal(fmt.Errorf(\"statefulset %q missing during rollout (apply-statefulset preceded us); external deletion?\", name)). The planner already surfaces Terminal errors as plan failure with a Condition; this turns a silent stall into an actionable surface.

Out of scope. Init plans where the STS is created later in the same plan — this only affects NodeUpdate plans where apply-statefulset runs upstream of these two tasks.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions