Problem
#247 adds a terminal-failure path on Certificate.Ready=False/Reason=Failed, which catches the most common stuck case (bad IssuerName). Two gaps remain:
- No wall-clock deadline. Other terminal cert-manager states (e.g., issuer CA crashed mid-issuance, network partition to a remote ACME server) leave
Certificate.Ready=False with reasons other than Failed. The task polls forever.
- Transient diagnostic invisibility. While
Certificate.Ready=False/Reason=Issuing is in progress, operators see only condition=NodeUpdateInProgress, reason=TLSToggleStarted — no insight into why issuance is slow. cert-manager's message field on the Ready condition would help.
Proposed scope
- Add a per-task deadline (configurable; default ~10min for cert-manager issuance) — return
Terminal if Secret.tls.crt is still empty past the deadline
- Surface
Certificate.Ready.message into the task error (non-Terminal) on each poll so it lands in status.plan.tasks[i].error and operators can grep for it via kubectl describe seinode
Why deferred from #247
The Terminal-on-Failed fix lands the primary blocker. The deadline + diagnostic surfacing is defense-in-depth and a UX improvement, not a correctness gap on the rollout path.
References
Problem
#247 adds a terminal-failure path on
Certificate.Ready=False/Reason=Failed, which catches the most common stuck case (badIssuerName). Two gaps remain:Certificate.Ready=Falsewith reasons other thanFailed. The task polls forever.Certificate.Ready=False/Reason=Issuingis in progress, operators see onlycondition=NodeUpdateInProgress, reason=TLSToggleStarted— no insight into why issuance is slow. cert-manager's message field on the Ready condition would help.Proposed scope
TerminalifSecret.tls.crtis still empty past the deadlineCertificate.Ready.messageinto the task error (non-Terminal) on each poll so it lands instatus.plan.tasks[i].errorand operators can grep for it viakubectl describe seinodeWhy deferred from #247
The Terminal-on-Failed fix lands the primary blocker. The deadline + diagnostic surfacing is defense-in-depth and a UX improvement, not a correctness gap on the rollout path.
References
internal/task/wait_for_sidecar_tls_secret.go