Problem
Surfaced by security + kubernetes cross-review on #258 (round 3). LLD §4 documents the intended behavior: if the operator-provisioned TLS Secret is deleted (or its SANs drift to a mismatch), kube-rbac-proxy keeps serving on its currently-bound cert until the pod cycles for unrelated reasons (node drain, OOM, etc.). The plan-creation gate prevents the controller from cycling the pod itself. So an unhealthy state can persist invisibly.
The controller publishes `SidecarTLSSecretReady=False` so a `kubectl describe seinode` reveals it, but there is no Prometheus/AlertManager-level signal. Operators not actively inspecting per-SeiNode status would miss it.
Proposed scope
- Emit a controller metric `seinode_sidecar_tls_secret_ready{namespace, name}` with the condition status as a gauge value.
- Define an alert rule: `seinode_sidecar_tls_secret_ready == 0 for 5m` → page.
- Document the expected operator response in the runbook.
Why deferred from #258
Observability tooling lives in the platform repo, not the controller; the controller's job here is to publish the signal. The metric + alert rule are platform follow-up work.
References
Problem
Surfaced by security + kubernetes cross-review on #258 (round 3). LLD §4 documents the intended behavior: if the operator-provisioned TLS Secret is deleted (or its SANs drift to a mismatch), kube-rbac-proxy keeps serving on its currently-bound cert until the pod cycles for unrelated reasons (node drain, OOM, etc.). The plan-creation gate prevents the controller from cycling the pod itself. So an unhealthy state can persist invisibly.
The controller publishes `SidecarTLSSecretReady=False` so a `kubectl describe seinode` reveals it, but there is no Prometheus/AlertManager-level signal. Operators not actively inspecting per-SeiNode status would miss it.
Proposed scope
Why deferred from #258
Observability tooling lives in the platform repo, not the controller; the controller's job here is to publish the signal. The metric + alert rule are platform follow-up work.
References