feat(buildsecrets): sync tenant build-secrets from AWS Secrets Manager (no ESO)#10
Open
nicacioliveira wants to merge 5 commits into
Open
feat(buildsecrets): sync tenant build-secrets from AWS Secrets Manager (no ESO)#10nicacioliveira wants to merge 5 commits into
nicacioliveira wants to merge 5 commits into
Conversation
Adds a BuildSecretsReconciler that materialises an ExternalSecret per opted-in site namespace. The EE pulls `sites/<site>/build` from the cluster-local AWS Secrets Manager into a K8s Secret named `build-secrets`, which the builder Job already consumes via envFrom with optional:true (admin PR #3201). Opt-in is the namespace annotation `deco.sites/build-secrets-managed: enabled`. Removing it (or deleting the namespace) deletes the EE and ESO cascades the K8s Secret via `creationPolicy: Owner`. Encapsulates the sync mechanism behind a single buildsecrets package so swapping away from ESO touches one file. The Reconciler watches Namespace + ExternalSecret (unstructured) and re-reconciles on either. Force-sync recipes documented inline: - Per site: `kubectl annotate es build-secrets force-sync=$(date +%s) --overwrite` - All sites: filter by label `deco.sites/feature=build-secrets`
There was a problem hiding this comment.
1 issue found across 6 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…-name filter
siteNameFromNamespace also strips Valkey-reserved usernames ('default',
'redis-root') — irrelevant to build-secrets reconciliation. A site
namespace called 'sites-default' would have been silently skipped from
sync due to the Valkey coupling. Strip the prefix inline instead.
Identified by cubic.
Pivot away from the ESO ExternalSecret intermediary. The reconciler
now talks to the upstream secret backend (AWS Secrets Manager today)
through a small Source interface and materialises the K8s Secret
`build-secrets` itself.
Why:
- Per-tenant secrets have a different lifecycle than the shared
secrets that justify ESO: "upstream not provisioned yet" is a
normal state, not an error to surface in `kubectl get es`.
- One component in the critical path instead of two. Easier to
reason about, fewer failure modes.
- The Source interface lets us swap the backend (GCP Secret Manager,
Vault, even ESO if we ever want to revert) with a single new file
— the rest of the operator and the K8s Secret contract stay put.
What changed:
- internal/buildsecrets/source.go — interface + constants
- internal/buildsecrets/aws_source.go — AWS SM implementation
- internal/buildsecrets/secret.go — Sync/Remove of the K8s Secret
- internal/buildsecrets/*_test.go — fake Source + mock SM client
- internal/controller/buildsecrets_controller.go — uses Source, watches
Namespace + Secret
- cmd/main.go — instantiates Source via --build-secrets-backend
(env BUILD_SECRETS_BACKEND, default "aws"; "disabled"
bypasses the reconciler entirely)
- RBAC: drops externalsecrets.external-secrets.io, adds Secret update
delete patch (regenerated via `make manifests && make helm`)
- go.mod: adds aws-sdk-go-v2/service/secretsmanager, bumps the SDK
Adoption safety: a Secret named `build-secrets` that lacks the
operator's labels is treated as user-managed. The reconciler will
not adopt, update, or delete it — useful while migrating away from
the manual Secrets we already created by hand (e.g. frigidaire-*).
Force-sync interface (annotation on the Namespace) preserved.
Companion PR (deploys before this): IAM in terraform-eks-cluster
adding secretsmanager:GetSecretValue to the operator Pod Identity.
There was a problem hiding this comment.
1 issue found across 13 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
All tests use the same testNamespace; param was always 'sites-acme'.
json.Unmarshal('null') into a map[string]string leaves the map nil
without erroring. Previously we'd return (nil, true, nil) and the
reconciler would happily create an empty K8s Secret, which is not
what the tenant meant by 'null' in their payload. Treat it as a
malformed upstream and surface the error.
Empty object {} stays valid — yields a non-nil empty map and a
materialised but empty Secret.
Identified by cubic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Operator reconciles a K8s Secret named
build-secretsin each opted-in site namespace, sourcing data from AWS Secrets Manager via the cluster-local account. Builder Jobs already consume the Secret opportunistically viaenvFrom: { optional: true }(admin #3201), so the chain is end-to-end with no further admin changes.Previous approach (delegated to ESO via ExternalSecret) was scrapped — see "Why direct, not ESO" below.
How it fits in the bigger picture
flowchart LR subgraph p3 ["Phase 3 — NOT DONE"] UI["Admin UI / API"] end subgraph aws ["AWS Secrets Manager"] SM["sites/<site>/build<br/>JSON: { TENANT_TOKEN: ..., FOO: ... }"] end subgraph p2 ["Phase 2 — this PR"] OP["Operator<br/>BuildSecretsReconciler"] end subgraph cluster ["Site namespace (sites-<site>)"] NS["Namespace<br/>annotation: deco.sites/build-secrets-managed=enabled"] SEC["K8s Secret<br/>build-secrets<br/>(labels: managed-by=operator)"] JOB["Build Job"] end subgraph p1 ["Phase 1 — already merged (admin #3201)"] ENVFROM["envFrom: { secretRef, optional: true }"] end UI -.->|"writes (not yet)"| SM NS -->|"watch"| OP OP -->|"GetSecretValue (lazy, on reconcile)"| SM OP -->|"create / update / delete"| SEC ENVFROM -.->|"declared in admin Job spec"| JOB SEC -->|"env vars at pod start"| JOBPhase status
deco-sites/adminenvFromopportunistic mount ofbuild-secretsdecocms/operator(this PR)deco-sites/admin(future)Why direct, not ESO
ESO is great for shared secrets that always exist (cloudflare-token, github-token). Per-tenant secrets are the opposite shape — "no upstream data yet" is normal, not an error to surface in
kubectl get externalsecret. Direct-source gives us:SecretSyncedErrornoise for un-provisioned tenants.Sourceinterface lives in this repo. Swapping backend (GCP SM, Vault, even back to ESO via anESOSource) is a new file, not a CRD migration.The interface:
AWSSourceimplements it viaaws-sdk-go-v2/service/secretsmanager. Selected via env var:Reconciliation rules
Adoption safety: A K8s Secret named
build-secretswithout labelsdeco.sites/managed-by=operator+deco.sites/feature=build-secretsis treated as user-managed. The operator refuses to touch it (returnsErrNotOwned). This protects the Secrets currently created by hand (frigidaire-pr,frigidaire-es, etc.) — they keep working unchanged after this lands.Force-sync (no HTTP endpoint, annotation only — see controller doc comment)
Re-fetch ONE site from AWS immediately, without waiting for
ResyncPeriod(15min default):Re-fetch ALL managed sites at once:
kubectl annotate ns -l deco.sites/build-secrets-managed=enabled \ deco.sites/build-secrets-sync=$(date +%s) --overwriteFiles
internal/buildsecrets/source.go—Sourceinterface + shared constantsinternal/buildsecrets/aws_source.go—AWSSource(AWS Secrets Manager)internal/buildsecrets/secret.go—Sync/Removeof the K8s Secretinternal/buildsecrets/*_test.go— fakeSource+ mock SM client, 81% coverageinternal/controller/buildsecrets_controller.go— Reconciler, watches Namespace + Secretcmd/main.go— wires Source via env var, registers reconciler conditionallyconfig/rbac/role.yaml+chart/templates/clusterrole-operator-manager-role.yaml— drops ESO verbs, adds Secret CRUD (regenerated)go.mod/go.sum— addsaws-sdk-go-v2/service/secretsmanager, bumps SDK to v1.41.7Companion PR (deploy first)
Operator needs
secretsmanager:GetSecretValueon the cluster-local account. To be added interraform-eks-cluster/eks-setupon the deco-operator Pod Identity, scoped toarn:aws:secretsmanager:us-west-2:<account>:secret:sites/*/build*. Will open separately.Test plan
After IAM PR merges + this operator chart bump lands in
infra_applications/provisioning/deco-operator/main/values.yaml:sites/build-igorteste/build={"TENANT_TEST_KEY":"from-sm","TENANT_FROM_AWS":"yes"}kubectl --context eks-envs annotate ns sites-build-igorteste deco.sites/build-secrets-managed=enabledSecret/build-secretsmaterialises with labelsdeco.sites/managed-by=operator,deco.sites/feature=build-secrets, data matches SMbuild-igorteste— confirm env vars present in builder containerbuild-secrets(e.g.sites-frigidaire-pr), annotate it. Confirm operator logsSkipping unowned Secretand the manual Secret stays intact (existing builds keep working)Out of scope