You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#258 landed the externalized-cert + immutable-spec model: spec.sidecar.tls is fixed at creation, and toggling TLS on an existing SeiNode requires delete + recreate with PVC retention (LLD §5). The deferral was deliberate at ~10 nodes — but the operational pain is real per-node, and the externalized-cert foundation #258 built actually makes in-place enable much simpler than the original PR #254 attempt was.
LLD §6 explicitly flagged this as "re-open as a separate design if the fleet scale or operational model changes." This issue is that reopen.
classifyPlan task-list-sniffing got brittle as new plan types accumulated.
Under #258's foundation, all three of those are dissolved:
Cert is operator-provisioned externally; controller doesn't create it.
No Issuer fields in spec; trust-anchor selection lives entirely outside the controller's blast radius.
The preflight already validates the cert; the plan can rely on SidecarTLSSecretReady=True as a precondition rather than reproducing validation logic inline.
The residual surface is a focused planner extension plus one transport-mode consistency fix.
Proposed approach
Relax CEL. Allow only the tls: nil → tls: {secretName: X} transition. Still block disable (set → nil) and swap (X → Y); those are the genuinely irreversible operations that should require delete + recreate. The current rule:
First branch flips from "require self also has no TLS" to "accept anything" — that's the enable case. Second branch unchanged.
Reintroduce Status.CurrentSidecarTLS mirror. Just {SecretName string} now — way simpler than PR feat(planner): TLS toggle on Running SeiNodes via NodeUpdate plan #254's {IssuerName, IssuerKind} struct. Source of truth for what TLS state the live pod is actually serving (vs. what spec wants).
Drift detector. Narrow trigger: Spec.Sidecar.TLS != nil && Status.CurrentSidecarTLS == nil && ConditionSidecarTLSSecretReady == True. Only fires when (a) operator added TLS post-creation, (b) preflight already validated the Secret. Disable / swap remain CEL-blocked so the detector never sees those cases.
Plan composition. Extend buildNodeUpdatePlan to compose ApplyRBACProxyConfig → ApplyStatefulSet → ApplyService → ReplacePod → ObserveSidecarTLS → MarkReady when TLS-enable drift fires. Image drift and TLS-enable drift co-compose in a single pod cycle (same pattern PR feat(planner): TLS toggle on Running SeiNodes via NodeUpdate plan #254 explored, but without the cert-creation pre-tasks).
Transport-mode source switch. This is the load-bearing fix that PR feat(planner): TLS toggle on Running SeiNodes via NodeUpdate plan #254 missed: SidecarURLForNode currently reads Spec.Sidecar.TLS to decide HTTP-vs-HTTPS. Mid-rollout, spec disagrees with the live pod, so controller→sidecar calls fail (controller tries HTTPS against an HTTP-only pod, or vice versa). Switch the function to read from Status.CurrentSidecarTLS so the controller's client uses whatever transport the rollout has actually landed.
classifyPlan label. Add node-update-tls-enable for plans triggered by TLS-enable drift only; node-update (image-only) and the co-drift case (node-update-tls-enable+image-update?) follow.
Tests. Drift-detector matrix, plan composition across (image-only / tls-enable-only / both), CEL allows-enable / blocks-disable / blocks-swap, transport-mode-source switch under each plan phase.
Estimated size: ~250–400 LOC.
Acceptance criteria
CEL relaxed; envtest confirms enable allowed, disable + swap rejected
Status.CurrentSidecarTLS field + DeepCopy regen
sidecarTLSEnableDrift detector in internal/planner/planner.go
Swap path (tls: A → tls: B). Issuer-rotation use case; same reasoning as disable. Operationally rare.
SND-level rolling enable orchestrated by the parent controller. Once per-SeiNode enable works, the SND can drive it by editing the template; this issue does not address SND-level orchestration.
Problem
#258 landed the externalized-cert + immutable-spec model:
spec.sidecar.tlsis fixed at creation, and toggling TLS on an existing SeiNode requires delete + recreate with PVC retention (LLD §5). The deferral was deliberate at ~10 nodes — but the operational pain is real per-node, and the externalized-cert foundation #258 built actually makes in-place enable much simpler than the original PR #254 attempt was.LLD §6 explicitly flagged this as "re-open as a separate design if the fleet scale or operational model changes." This issue is that reopen.
Why this is easier than PR #254 was
The original in-place toggle PR was rejected because:
Certificateresources → coupled cert lifecycle to plan lifecycle (drift detector + observer + wait task + cert-manager-Ready terminal logic).IssuerName/IssuerKindwere operator-spec-supplied → security gap (Sidecar TLS trust-anchor hardening: pin Issuer + pod-attestation observer #251).classifyPlantask-list-sniffing got brittle as new plan types accumulated.Under #258's foundation, all three of those are dissolved:
SidecarTLSSecretReady=Trueas a precondition rather than reproducing validation logic inline.The residual surface is a focused planner extension plus one transport-mode consistency fix.
Proposed approach
Relax CEL. Allow only the
tls: nil → tls: {secretName: X}transition. Still block disable (set → nil) and swap (X → Y); those are the genuinely irreversible operations that should require delete + recreate. The current rule:becomes:
First branch flips from "require self also has no TLS" to "accept anything" — that's the enable case. Second branch unchanged.
Reintroduce
Status.CurrentSidecarTLSmirror. Just{SecretName string}now — way simpler than PR feat(planner): TLS toggle on Running SeiNodes via NodeUpdate plan #254's{IssuerName, IssuerKind}struct. Source of truth for what TLS state the live pod is actually serving (vs. what spec wants).Drift detector. Narrow trigger:
Spec.Sidecar.TLS != nil && Status.CurrentSidecarTLS == nil && ConditionSidecarTLSSecretReady == True. Only fires when (a) operator added TLS post-creation, (b) preflight already validated the Secret. Disable / swap remain CEL-blocked so the detector never sees those cases.Plan composition. Extend
buildNodeUpdatePlanto composeApplyRBACProxyConfig → ApplyStatefulSet → ApplyService → ReplacePod → ObserveSidecarTLS → MarkReadywhen TLS-enable drift fires. Image drift and TLS-enable drift co-compose in a single pod cycle (same pattern PR feat(planner): TLS toggle on Running SeiNodes via NodeUpdate plan #254 explored, but without the cert-creation pre-tasks).ObserveSidecarTLStask. StampsStatus.CurrentSidecarTLS = {SecretName: Spec.Sidecar.TLS.SecretName}after rollout convergence. MirrorsObserveImage.Transport-mode source switch. This is the load-bearing fix that PR feat(planner): TLS toggle on Running SeiNodes via NodeUpdate plan #254 missed:
SidecarURLForNodecurrently readsSpec.Sidecar.TLSto decide HTTP-vs-HTTPS. Mid-rollout, spec disagrees with the live pod, so controller→sidecar calls fail (controller tries HTTPS against an HTTP-only pod, or vice versa). Switch the function to read fromStatus.CurrentSidecarTLSso the controller's client uses whatever transport the rollout has actually landed.classifyPlanlabel. Addnode-update-tls-enablefor plans triggered by TLS-enable drift only;node-update(image-only) and the co-drift case (node-update-tls-enable+image-update?) follow.Tests. Drift-detector matrix, plan composition across (image-only / tls-enable-only / both), CEL allows-enable / blocks-disable / blocks-swap, transport-mode-source switch under each plan phase.
Estimated size: ~250–400 LOC.
Acceptance criteria
Status.CurrentSidecarTLSfield + DeepCopy regensidecarTLSEnableDriftdetector ininternal/planner/planner.goObserveSidecarTLStask + registry entry + paramsForTaskType entrybuildNodeUpdatePlancomposes pre-tasks (RBAC ConfigMap) and observer when TLS drift presentSidecarURLForNodereads from status, not spec — verified by a test that exercises the mid-rollout stateclassifyPlanemits the new label(s)spec.sidecar.tls.secretNameto a Running SeiNode with a pre-provisioned Secret; pod cycles; serves on:8443Out of scope (deferred)
tls: set → nil). Still requires delete + recreate. Same as Sidecar TLS disable path on Running SeiNodes #250 was — reopen if it becomes load-bearing.tls: A → tls: B). Issuer-rotation use case; same reasoning as disable. Operationally rare.References