Skip to content

Enable TLS on existing Running SeiNodes via in-place plan #262

@bdchatham

Description

@bdchatham

Problem

#258 landed the externalized-cert + immutable-spec model: spec.sidecar.tls is fixed at creation, and toggling TLS on an existing SeiNode requires delete + recreate with PVC retention (LLD §5). The deferral was deliberate at ~10 nodes — but the operational pain is real per-node, and the externalized-cert foundation #258 built actually makes in-place enable much simpler than the original PR #254 attempt was.

LLD §6 explicitly flagged this as "re-open as a separate design if the fleet scale or operational model changes." This issue is that reopen.

Why this is easier than PR #254 was

The original in-place toggle PR was rejected because:

  1. Controller was minting cert-manager Certificate resources → coupled cert lifecycle to plan lifecycle (drift detector + observer + wait task + cert-manager-Ready terminal logic).
  2. IssuerName/IssuerKind were operator-spec-supplied → security gap (Sidecar TLS trust-anchor hardening: pin Issuer + pod-attestation observer #251).
  3. classifyPlan task-list-sniffing got brittle as new plan types accumulated.

Under #258's foundation, all three of those are dissolved:

  • Cert is operator-provisioned externally; controller doesn't create it.
  • No Issuer fields in spec; trust-anchor selection lives entirely outside the controller's blast radius.
  • The preflight already validates the cert; the plan can rely on SidecarTLSSecretReady=True as a precondition rather than reproducing validation logic inline.

The residual surface is a focused planner extension plus one transport-mode consistency fix.

Proposed approach

  1. Relax CEL. Allow only the tls: nil → tls: {secretName: X} transition. Still block disable (set → nil) and swap (X → Y); those are the genuinely irreversible operations that should require delete + recreate. The current rule:

    (!has(oldSelf.sidecar) || !has(oldSelf.sidecar.tls)) ? (!has(self.sidecar) || !has(self.sidecar.tls)) : (...equality...)
    

    becomes:

    (!has(oldSelf.sidecar) || !has(oldSelf.sidecar.tls)) ? true : (has(self.sidecar) && has(self.sidecar.tls) && self.sidecar.tls == oldSelf.sidecar.tls)
    

    First branch flips from "require self also has no TLS" to "accept anything" — that's the enable case. Second branch unchanged.

  2. Reintroduce Status.CurrentSidecarTLS mirror. Just {SecretName string} now — way simpler than PR feat(planner): TLS toggle on Running SeiNodes via NodeUpdate plan #254's {IssuerName, IssuerKind} struct. Source of truth for what TLS state the live pod is actually serving (vs. what spec wants).

  3. Drift detector. Narrow trigger: Spec.Sidecar.TLS != nil && Status.CurrentSidecarTLS == nil && ConditionSidecarTLSSecretReady == True. Only fires when (a) operator added TLS post-creation, (b) preflight already validated the Secret. Disable / swap remain CEL-blocked so the detector never sees those cases.

  4. Plan composition. Extend buildNodeUpdatePlan to compose ApplyRBACProxyConfig → ApplyStatefulSet → ApplyService → ReplacePod → ObserveSidecarTLS → MarkReady when TLS-enable drift fires. Image drift and TLS-enable drift co-compose in a single pod cycle (same pattern PR feat(planner): TLS toggle on Running SeiNodes via NodeUpdate plan #254 explored, but without the cert-creation pre-tasks).

  5. ObserveSidecarTLS task. Stamps Status.CurrentSidecarTLS = {SecretName: Spec.Sidecar.TLS.SecretName} after rollout convergence. Mirrors ObserveImage.

  6. Transport-mode source switch. This is the load-bearing fix that PR feat(planner): TLS toggle on Running SeiNodes via NodeUpdate plan #254 missed: SidecarURLForNode currently reads Spec.Sidecar.TLS to decide HTTP-vs-HTTPS. Mid-rollout, spec disagrees with the live pod, so controller→sidecar calls fail (controller tries HTTPS against an HTTP-only pod, or vice versa). Switch the function to read from Status.CurrentSidecarTLS so the controller's client uses whatever transport the rollout has actually landed.

  7. classifyPlan label. Add node-update-tls-enable for plans triggered by TLS-enable drift only; node-update (image-only) and the co-drift case (node-update-tls-enable+image-update?) follow.

  8. Tests. Drift-detector matrix, plan composition across (image-only / tls-enable-only / both), CEL allows-enable / blocks-disable / blocks-swap, transport-mode-source switch under each plan phase.

Estimated size: ~250–400 LOC.

Acceptance criteria

  • CEL relaxed; envtest confirms enable allowed, disable + swap rejected
  • Status.CurrentSidecarTLS field + DeepCopy regen
  • sidecarTLSEnableDrift detector in internal/planner/planner.go
  • ObserveSidecarTLS task + registry entry + paramsForTaskType entry
  • buildNodeUpdatePlan composes pre-tasks (RBAC ConfigMap) and observer when TLS drift present
  • SidecarURLForNode reads from status, not spec — verified by a test that exercises the mid-rollout state
  • classifyPlan emits the new label(s)
  • Unit tests for drift matrix + plan composition
  • envtest for CEL transitions
  • End-to-end on harbor: add spec.sidecar.tls.secretName to a Running SeiNode with a pre-provisioned Secret; pod cycles; serves on :8443

Out of scope (deferred)

  • Disable path (tls: set → nil). Still requires delete + recreate. Same as Sidecar TLS disable path on Running SeiNodes #250 was — reopen if it becomes load-bearing.
  • Swap path (tls: A → tls: B). Issuer-rotation use case; same reasoning as disable. Operationally rare.
  • SND-level rolling enable orchestrated by the parent controller. Once per-SeiNode enable works, the SND can drive it by editing the template; this issue does not address SND-level orchestration.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions