Skip to content

feat(buildsecrets): sync tenant build-secrets from AWS Secrets Manager (no ESO)#10

Open
nicacioliveira wants to merge 5 commits into
mainfrom
feat/tenant-build-secrets-sync
Open

feat(buildsecrets): sync tenant build-secrets from AWS Secrets Manager (no ESO)#10
nicacioliveira wants to merge 5 commits into
mainfrom
feat/tenant-build-secrets-sync

Conversation

@nicacioliveira
Copy link
Copy Markdown
Collaborator

@nicacioliveira nicacioliveira commented May 20, 2026

Summary

Operator reconciles a K8s Secret named build-secrets in each opted-in site namespace, sourcing data from AWS Secrets Manager via the cluster-local account. Builder Jobs already consume the Secret opportunistically via envFrom: { optional: true } (admin #3201), so the chain is end-to-end with no further admin changes.

Previous approach (delegated to ESO via ExternalSecret) was scrapped — see "Why direct, not ESO" below.

How it fits in the bigger picture

flowchart LR
    subgraph p3 ["Phase 3 — NOT DONE"]
        UI["Admin UI / API"]
    end

    subgraph aws ["AWS Secrets Manager"]
        SM["sites/&lt;site&gt;/build<br/>JSON: { TENANT_TOKEN: ..., FOO: ... }"]
    end

    subgraph p2 ["Phase 2 — this PR"]
        OP["Operator<br/>BuildSecretsReconciler"]
    end

    subgraph cluster ["Site namespace (sites-&lt;site&gt;)"]
        NS["Namespace<br/>annotation: deco.sites/build-secrets-managed=enabled"]
        SEC["K8s Secret<br/>build-secrets<br/>(labels: managed-by=operator)"]
        JOB["Build Job"]
    end

    subgraph p1 ["Phase 1 — already merged (admin #3201)"]
        ENVFROM["envFrom: { secretRef, optional: true }"]
    end

    UI -.->|"writes (not yet)"| SM
    NS -->|"watch"| OP
    OP -->|"GetSecretValue (lazy, on reconcile)"| SM
    OP -->|"create / update / delete"| SEC
    ENVFROM -.->|"declared in admin Job spec"| JOB
    SEC -->|"env vars at pod start"| JOB
Loading

Phase status

Phase Repo What Status
1 deco-sites/admin Builder Job envFrom opportunistic mount of build-secrets ✅ merged (admin #3201)
2 decocms/operator (this PR) Reconciler syncs SM ↔ K8s Secret based on namespace annotation 🟡 this PR
3 deco-sites/admin (future) UI/API for tenant to write into AWS SM + set annotation ❌ not started

Why direct, not ESO

ESO is great for shared secrets that always exist (cloudflare-token, github-token). Per-tenant secrets are the opposite shape — "no upstream data yet" is normal, not an error to surface in kubectl get externalsecret. Direct-source gives us:

  • Lazy creation: operator only materialises the K8s Secret when the upstream key actually exists. No SecretSyncedError noise for un-provisioned tenants.
  • Single component in the critical path: easier to debug, fewer failure modes.
  • Cleaner abstraction: Source interface lives in this repo. Swapping backend (GCP SM, Vault, even back to ESO via an ESOSource) is a new file, not a CRD migration.
  • Phase 3 simpler: admin writes to SM, sets the annotation. Operator handles the rest. No second sync layer to orchestrate.

The interface:

type Source interface {
    Get(ctx, key) (data map[string]string, exists bool, err error)
}

AWSSource implements it via aws-sdk-go-v2/service/secretsmanager. Selected via env var:

BUILD_SECRETS_BACKEND=aws        # default
BUILD_SECRETS_BACKEND=disabled   # bypass the reconciler entirely

Reconciliation rules

Annotation Upstream SM Local K8s Secret Reconcile action
off no-op
off operator-managed delete
off unowned leave alone
on missing no-op (status: upstream-missing)
on missing operator-managed delete
on exists create with operator labels
on exists operator-managed, data drifted update
on exists operator-managed, data matches skip write (idempotent)
on exists unowned log warning, no-op

Adoption safety: A K8s Secret named build-secrets without labels deco.sites/managed-by=operator + deco.sites/feature=build-secrets is treated as user-managed. The operator refuses to touch it (returns ErrNotOwned). This protects the Secrets currently created by hand (frigidaire-pr, frigidaire-es, etc.) — they keep working unchanged after this lands.

Force-sync (no HTTP endpoint, annotation only — see controller doc comment)

Re-fetch ONE site from AWS immediately, without waiting for ResyncPeriod (15min default):

kubectl annotate ns sites-<site> \
  deco.sites/build-secrets-sync=$(date +%s) --overwrite

Re-fetch ALL managed sites at once:

kubectl annotate ns -l deco.sites/build-secrets-managed=enabled \
  deco.sites/build-secrets-sync=$(date +%s) --overwrite

Files

  • internal/buildsecrets/source.goSource interface + shared constants
  • internal/buildsecrets/aws_source.goAWSSource (AWS Secrets Manager)
  • internal/buildsecrets/secret.goSync / Remove of the K8s Secret
  • internal/buildsecrets/*_test.go — fake Source + mock SM client, 81% coverage
  • internal/controller/buildsecrets_controller.go — Reconciler, watches Namespace + Secret
  • cmd/main.go — wires Source via env var, registers reconciler conditionally
  • config/rbac/role.yaml + chart/templates/clusterrole-operator-manager-role.yaml — drops ESO verbs, adds Secret CRUD (regenerated)
  • go.mod / go.sum — adds aws-sdk-go-v2/service/secretsmanager, bumps SDK to v1.41.7

Companion PR (deploy first)

Operator needs secretsmanager:GetSecretValue on the cluster-local account. To be added in terraform-eks-cluster/eks-setup on the deco-operator Pod Identity, scoped to arn:aws:secretsmanager:us-west-2:<account>:secret:sites/*/build*. Will open separately.

Test plan

After IAM PR merges + this operator chart bump lands in infra_applications/provisioning/deco-operator/main/values.yaml:

  • Create AWS SM key (account 610552715391, us-west-2 — eks-envs first): sites/build-igorteste/build = {"TENANT_TEST_KEY":"from-sm","TENANT_FROM_AWS":"yes"}
  • kubectl --context eks-envs annotate ns sites-build-igorteste deco.sites/build-secrets-managed=enabled
  • Confirm Secret/build-secrets materialises with labels deco.sites/managed-by=operator, deco.sites/feature=build-secrets, data matches SM
  • Trigger a build of build-igorteste — confirm env vars present in builder container
  • Update SM with new JSON, force-sync via annotation bump, confirm Secret updated within seconds
  • Remove namespace annotation — confirm Secret deleted; next build runs without the env vars
  • Re-annotate — Secret recreated from SM
  • On a namespace with a manually-created build-secrets (e.g. sites-frigidaire-pr), annotate it. Confirm operator logs Skipping unowned Secret and the manual Secret stays intact (existing builds keep working)
  • Repeat on eks-serverless (account 520819256705)

Out of scope

  • Admin UI for tenants to manage these secrets (Phase 3)
  • Migration of existing manually-created Secrets into operator-managed (intentional — they keep working as-is until someone deletes them and lets the operator recreate from SM)

Adds a BuildSecretsReconciler that materialises an ExternalSecret per
opted-in site namespace. The EE pulls `sites/<site>/build` from the
cluster-local AWS Secrets Manager into a K8s Secret named
`build-secrets`, which the builder Job already consumes via envFrom
with optional:true (admin PR #3201).

Opt-in is the namespace annotation `deco.sites/build-secrets-managed:
enabled`. Removing it (or deleting the namespace) deletes the EE and
ESO cascades the K8s Secret via `creationPolicy: Owner`.

Encapsulates the sync mechanism behind a single buildsecrets package
so swapping away from ESO touches one file. The Reconciler watches
Namespace + ExternalSecret (unstructured) and re-reconciles on either.

Force-sync recipes documented inline:
- Per site:   `kubectl annotate es build-secrets force-sync=$(date +%s) --overwrite`
- All sites:  filter by label `deco.sites/feature=build-secrets`
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread internal/controller/buildsecrets_controller.go Outdated
…-name filter

siteNameFromNamespace also strips Valkey-reserved usernames ('default',
'redis-root') — irrelevant to build-secrets reconciliation. A site
namespace called 'sites-default' would have been silently skipped from
sync due to the Valkey coupling. Strip the prefix inline instead.

Identified by cubic.
Pivot away from the ESO ExternalSecret intermediary. The reconciler
now talks to the upstream secret backend (AWS Secrets Manager today)
through a small Source interface and materialises the K8s Secret
`build-secrets` itself.

Why:
- Per-tenant secrets have a different lifecycle than the shared
  secrets that justify ESO: "upstream not provisioned yet" is a
  normal state, not an error to surface in `kubectl get es`.
- One component in the critical path instead of two. Easier to
  reason about, fewer failure modes.
- The Source interface lets us swap the backend (GCP Secret Manager,
  Vault, even ESO if we ever want to revert) with a single new file
  — the rest of the operator and the K8s Secret contract stay put.

What changed:
- internal/buildsecrets/source.go        — interface + constants
- internal/buildsecrets/aws_source.go    — AWS SM implementation
- internal/buildsecrets/secret.go        — Sync/Remove of the K8s Secret
- internal/buildsecrets/*_test.go        — fake Source + mock SM client
- internal/controller/buildsecrets_controller.go — uses Source, watches
                                            Namespace + Secret
- cmd/main.go — instantiates Source via --build-secrets-backend
                (env BUILD_SECRETS_BACKEND, default "aws"; "disabled"
                bypasses the reconciler entirely)
- RBAC: drops externalsecrets.external-secrets.io, adds Secret update
        delete patch (regenerated via `make manifests && make helm`)
- go.mod: adds aws-sdk-go-v2/service/secretsmanager, bumps the SDK

Adoption safety: a Secret named `build-secrets` that lacks the
operator's labels is treated as user-managed. The reconciler will
not adopt, update, or delete it — useful while migrating away from
the manual Secrets we already created by hand (e.g. frigidaire-*).

Force-sync interface (annotation on the Namespace) preserved.

Companion PR (deploys before this): IAM in terraform-eks-cluster
adding secretsmanager:GetSecretValue to the operator Pod Identity.
@nicacioliveira nicacioliveira changed the title feat(buildsecrets): sync tenant build-secrets via ExternalSecret feat(buildsecrets): sync tenant build-secrets from AWS Secrets Manager (no ESO) May 20, 2026
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 13 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread internal/buildsecrets/aws_source.go
All tests use the same testNamespace; param was always 'sites-acme'.
json.Unmarshal('null') into a map[string]string leaves the map nil
without erroring. Previously we'd return (nil, true, nil) and the
reconciler would happily create an empty K8s Secret, which is not
what the tenant meant by 'null' in their payload. Treat it as a
malformed upstream and surface the error.

Empty object {} stays valid — yields a non-nil empty map and a
materialised but empty Secret.

Identified by cubic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant