Skip to content

feat(seinode): sidecar TLS via externally-provisioned Secret#258

Open
bdchatham wants to merge 2 commits into
mainfrom
feat/seinode-sidecar-tls-externalized-secret
Open

feat(seinode): sidecar TLS via externally-provisioned Secret#258
bdchatham wants to merge 2 commits into
mainfrom
feat/seinode-sidecar-tls-externalized-secret

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

  • Implements Sidecar TLS via externally-provisioned Secret #255 per the merged LLD at docs/design-seinode-sidecar-tls-toggle-lld.md.
  • Externalizes cert ownership: operator/platform tooling provisions the kubernetes.io/tls Secret; the controller validates SANs in a steady-state preflight branch and gates init-plan creation on ConditionSidecarTLSSecretReady. Mirrors SigningKeyReady / NodeKeyReady / OperatorKeyringReady / imported-PVC patterns already in the controller.
  • spec.sidecar.tls is immutable post-creation via CRD CEL. Toggle = delete + recreate.

Closes #255.

Design at a glance

Surface Before After
SidecarTLSSpec {IssuerName, IssuerKind} {SecretName}
Cert provisioning Controller mints cert-manager.io/v1 Certificate via ApplySidecarCert plan task Operator provisions kubernetes.io/tls Secret externally
Cert validation None (cert-manager Ready was the gate) Controller preflight: type, non-empty tls.crt/tls.key, x509.ParseCertificate, SAN superset check
Status Nothing status.sidecarTLS.{secretName, requiredDNSNames} published as machine-readable contract
Condition None ConditionSidecarTLSSecretReady with TLSSecretReady / TLSSecretNotFound / TLSSecretMalformed / TLSSecretSANsMismatch
In-place toggle Silently ignored on Running CEL rejects mutation; toggle requires delete + recreate

What's deleted

  • internal/task/apply_sidecar_cert.go (task + deserializer registry entry)
  • GenerateSidecarCertificate from internal/noderesource/noderesource.go
  • SidecarTLSSpec.IssuerName / IssuerKind fields

Operator workflow

  1. Pre-compute the required DNS names from the target spec: {name}.{ns}.svc.cluster.local + {name}-0.{name}.{ns}.svc.cluster.local. (Also published in status.sidecarTLS.requiredDNSNames after creation for verification.)
  2. Provision a kubernetes.io/tls Secret with those SANs (e.g., via a cert-manager Certificate in GitOps).
  3. Create the SeiNode with spec.sidecar.tls.secretName: <name>.
  4. SeiNode stays Pending with SidecarTLSSecretReady=False until the Secret resolves; transitions to Running once SANs validate.

To toggle TLS on existing nodes: delete + recreate with PVC retained via spec.dataVolume.import.pvcName. At ~10 nodes this is manual; documented in LLD §5.

Cross-review

platform-engineer ran against the diff. Verdict: Blockers → fixed; Ready with non-blocker fixups applied.

Blocker (fixed in commit 2):

  • CEL immutability rule was too permissive — the (!has(oldSelf.sidecar.tls)) ? true : ... short-circuit allowed enabling TLS on an existing Running SeiNode via spec mutation, which would publish status.sidecarTLS and gate init plans but the planner's phase != Running bypass meant buildRunningPlan only handled image drift — StatefulSet pod template (no proxy container, no Secret mount) never re-applied. Status would lie about transport mode. Tightened the rule: TLS state at creation is final.

Non-blocker fixups (commit 2):

  • Reason string values prefixed (ReadyTLSSecretReady, etc.) to match the existing SigningKey* / NodeKey* / OperatorKeyring* family for SRE grep consistency.
  • Transient-error message in validateTLSSecret prefixed so operators who find the Secret present can disambiguate from genuine NotFound.
  • Doc-comment on SidecarTLSStatus.RequiredDNSNames clarifies the names are pre-computable; published for verification, not as a prerequisite handshake.
  • Doc-comment on SidecarTLSSecretName notes callers should gate on SidecarTLSEnabled.

Deferred (non-blocker, tracked as future work or LLD-aligned intent):

  • No event/log breadcrumb on the planner gate. The SidecarTLSSecretReady=False condition is the breadcrumb; operators see it via kubectl describe. If kubectl get consumers complain, add an event in a follow-up.
  • Running-phase image-drift rollout doesn't re-gate on SidecarTLSSecretReady. LLD §4 explicitly allows this — TLS readiness is observability on Running nodes, not enforcement. Pod cycles on image drift even if the cert went bad. Operators should check the condition before rolling.
  • Cert/key cryptographic pairing not validated by preflight. Intentional per LLD §2 — kube-rbac-proxy detects mismatch at bind time; we validate cert SHAPE only.

Test plan

  • make test — full suite green
  • golangci-lint run --new-from-rev=origin/main ./... — 0 issues
  • make manifests generate — diff clean, CEL rule reflects the tightened immutability
  • Unit: validateTLSSecret matrix (Ready, NotFound, WrongType, EmptyCert, EmptyKey, NotPEM, PEMButUnparseable, SANsMismatch, PartialSANsMismatch, SupersetSANsOK)
  • Unit: requiredDNSNames for service + pod0 SANs, varying namespace/name
  • Unit: reconcileSidecarTLSReady lifecycle (TLS-enabled+ready, missing, SANs-mismatch, disabled-clears, enable→disable transition)
  • envtest: CEL rejects post-creation spec.sidecar.tls mutation (deferred — adds envtest infra cost; CEL rule is human-verifiable in the generated CRD YAML)
  • End-to-end on harbor: create SeiNode with externally-applied Secret; verify pod serves on :8443 and controller TLS client connects. Delete + recreate with TLS removed; verify plain HTTP on legacy port.

References

🤖 Generated with Claude Code

bdchatham and others added 2 commits May 16, 2026 10:56
Implements #255 per the merged LLD at docs/design-seinode-sidecar-tls-toggle-lld.md.

Cert ownership moves out of the controller. Operator/platform tooling
provisions a kubernetes.io/tls Secret; the controller references it by
name, validates its presence + cert SANs in a steady-state preflight
branch, and gates init-plan creation on ConditionSidecarTLSSecretReady.

Mirrors signingKey / nodeKey / operatorKeyring / imported-PVC patterns
already in the controller. spec.sidecar.tls is immutable post-creation
(CRD CEL); toggle = delete + recreate.

API:
- SidecarTLSSpec rewritten: drop IssuerName/IssuerKind, add SecretName
- SidecarTLSStatus added: {SecretName, RequiredDNSNames} — machine-
  readable contract for platform tooling, replacing naming-convention docs
- SeiNodeStatus.SidecarTLS *SidecarTLSStatus
- ConditionSidecarTLSSecretReady with reasons Ready / NotFound /
  Malformed / SANsMismatch
- CEL immutability rule on spec.sidecar.tls

Controller:
- reconcileSidecarTLSReady method in internal/controller/node/preflight_sidecar_tls.go
- Reads Secret via APIReader (bypass cache), validates type + non-empty
  tls.crt/tls.key + PEM-decode + x509.ParseCertificate + SAN superset
- Publishes status.sidecarTLS.requiredDNSNames whenever TLS is enabled
- APIReader plumbed through SeiNodeReconciler

Planner:
- ResolvePlan gates init-plan creation when TLS enabled and condition
  not Ready; steady-state (Running) plans bypass the gate so image-drift
  rollouts proceed independently of TLS posture
- buildBasePlan / buildBootstrapPlan / buildGenesisPlan emit only
  ApplyRBACProxyConfig under TLS (ApplySidecarCert removed)

Code deletion:
- internal/task/apply_sidecar_cert.go (task + deserializer registry entry)
- GenerateSidecarCertificate from internal/noderesource/noderesource.go

Tests:
- preflight_sidecar_tls_test.go: validateTLSSecret matrix, requiredDNSNames,
  reconcileSidecarTLSReady lifecycle (enable/disable transitions, missing
  Secret, SANs mismatch). Uses in-memory ECDSA P-256 self-signed certs.
- Existing sidecar_tls / plan_execution / reconciler / noderesource tests
  updated for the new spec shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… + doc

Addresses platform-engineer cross-review against the implementation
commit. One blocker (CEL too permissive), rest are non-blocker /
ergonomics fixups.

- api: tighten the spec.sidecar.tls immutability CEL rule. The prior
  shape short-circuited to true whenever oldSelf.sidecar or
  oldSelf.sidecar.tls was unset, which silently permitted enabling TLS
  on an existing Running SeiNode via spec mutation. Under the new rule,
  TLS state at creation is final: a SeiNode that started without TLS
  stays without TLS for its lifetime, and a SeiNode that started with
  TLS stays at the same TLS spec for its lifetime. Matches LLD §0.1(b).
- api: prefix reason string values (TLSSecretReady, TLSSecretNotFound,
  TLSSecretMalformed, TLSSecretSANsMismatch) to match the existing
  SigningKey/NodeKey/OperatorKeyring family so SRE tooling can grep
  across condition types consistently.
- task: prefix the transient-error message in validateTLSSecret so an
  operator who sees Reason=NotFound but finds the Secret present can
  disambiguate "transient error" from "Secret missing" without grepping
  controller logs.
- api: doc-comment on SidecarTLSStatus.RequiredDNSNames notes the names
  are pre-computable from the target spec; published for verification,
  not as a prerequisite handshake.
- noderesource: doc-comment on SidecarTLSSecretName notes callers
  should gate on SidecarTLSEnabled before relying on the empty return.
- test: clarify the TransitionEnabledToDisabled test comment now that
  the CEL rule actually blocks the mutation in prod.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 16, 2026

You have used all Bugbot PR reviews included in your free trial for your GitHub account on this workspace.

To continue using Bugbot reviews, enable Bugbot for your team in the Cursor dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sidecar TLS via externally-provisioned Secret

1 participant