Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions docs/plans/2026-03-13-kms-encryption-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# KMS Encryption for GitOps

## Overview

Replace git-crypt with a SOPS-inspired envelope encryption system using AWS KMS. Files are encrypted inline (value-level) so YAML structure remains readable and diffable. A crypt daemon caches unwrapped DEKs in memory to minimise KMS calls.

## Phase 1: Manual CLI Commands

### Core Encryption Model

Envelope encryption with AES-256-GCM and KMS-wrapped DEKs.

Each encrypted file gets one DEK (Data Encryption Key). The DEK is generated via KMS `GenerateDataKey`, giving both plaintext and ciphertext (wrapped) forms. Individual YAML values are encrypted locally using AES-256-GCM with the plaintext DEK — each value gets a unique IV and auth tag.

The wrapped DEK is stored as a YAML comment header in the file, base64-encoded JSON containing one wrapped copy per KMS key in the rule set.

**Deterministic IV:** The IV for each value is derived via `HMAC-SHA256(DEK, plaintext)` truncated to 12 bytes. This means identical plaintext values encrypted with the same DEK produce identical ciphertext, enabling diff and duplicate detection. This is safe because the DEK is unique per file — the (key, IV) pair never repeats across different files.

### File Format

After encryption:

```yaml
# gitops-crypt: eyJ2ZXJzaW9uIjogMSwgImRla3MiOiB7InByb2R1Y3Rpb24iOiAiYmFzZTY0Li4uIiwgImNpIjogImJhc2U2NC4uLiJ9fQ==
secrets:
DB_PASSWORD: ENC[data:U2FsdGVk==,iv:AAAAAAAAAA==,tag:BBBBBBBBBB==]
API_KEY: ENC[data:Y2lwaGVy==,iv:CCCCCCCCCC==,tag:DDDDDDDDDD==]
NEW_SECRET: my-plaintext-value
```

Header payload (base64-decoded):

```json
{
"version": 1,
"deks": {
"production": "base64-wrapped-dek...",
"ci": "base64-wrapped-dek..."
}
}
```

- `version` — format version for future changes
- `deks` — the same DEK wrapped by each KMS key, keyed by name from `.gitops.toml`

Encrypted value format:

```
ENC[data:<base64>,iv:<base64>,tag:<base64>]
```

- `data` — AES-256-GCM ciphertext
- `iv` — 12-byte deterministic IV (HMAC-derived)
- `tag` — 16-byte GCM authentication tag

### Configuration: `.gitops.toml`

```toml
[encryption]
daemon_idle_timeout = 3600 # seconds, default 1 hour

[encryption.keys]
production = "arn:aws:kms:ap-southeast-2:123456789:alias/gitops-prod"
ci = "arn:aws:kms:ap-southeast-2:123456789:alias/gitops-ci"
internal = "arn:aws:kms:ap-southeast-2:123456789:alias/gitops-internal"

[[encryption.rules]]
pattern = "apps/internal/*/secrets.yml"
keys = ["internal", "ci"]

[[encryption.rules]]
pattern = "apps/*/secrets.yml"
keys = ["production", "ci"]
```

- First matching rule wins (order matters — most specific first)
- `--keys` CLI flag overrides the rule's key list for that invocation
- Files not matching any rule are ignored

### CLI Interface

```bash
# Encrypt all files matching rules in .gitops.toml
gitops encrypt

# Decrypt all files matching rules
gitops decrypt

# Operate on specific files
gitops encrypt apps/billing/secrets.yml
gitops decrypt apps/billing/secrets.yml

# Override which keys wrap the DEK
gitops encrypt apps/billing/secrets.yml --keys production,ci

# Re-wrap DEK with a different key set (values unchanged)
gitops encrypt apps/billing/secrets.yml --keys production,ci,new-team --rekey
```

Behaviour:
- `encrypt` — encrypts plaintext values, skips already-encrypted ones (`ENC[...]` detection)
- `decrypt` — decrypts `ENC[...]` values to plaintext, skips already-plaintext ones
- `--rekey` — unwraps DEK with an existing key, re-wraps with the new key set. Values untouched.
- Both connect to the daemon for DEK caching, auto-starting it if needed
- Progress output consistent with existing CLI style

### Crypt Daemon

Auto-starts on first `encrypt`/`decrypt` call. Caches unwrapped DEKs in memory keyed by hash of wrapped DEK ciphertext.

```
┌─────────────┐ Unix Socket ┌──────────────────┐
│ gitops CLI │ <────────────────────> │ crypt-daemon │
│ │ request: wrapped DEK │ │
│ encrypt/ │ response: plain DEK │ In-memory cache │
│ decrypt │ │ {hash → DEK} │
└─────────────┘ │ │
│ KMS client │
└────────┬─────────┘
│ only on cache miss
v
┌──────────────────┐
│ AWS KMS │
└──────────────────┘
```

- Socket: `~/.gitops/crypt-daemon.sock`
- PID file: `~/.gitops/crypt-daemon.pid`
- Auto-starts when CLI can't connect to socket
- Auto-exits after configurable idle timeout (default 1 hour)
- DEKs only in memory, never written to disk
- Supports batch requests — CLI sends all unique wrapped DEKs in one call, daemon resolves cache hits instantly and makes concurrent KMS calls for misses

### Batch Pipeline

`gitops decrypt` (no args) flow:

1. Scan all files matching `.gitops.toml` rules
2. Parse headers, collect all unique wrapped DEKs
3. Send all unique wrapped DEKs to daemon in one batch request
4. Daemon returns all plaintext DEKs (cache hits instant, misses go to KMS concurrently)
5. Decrypt all values locally with concurrency (implementation TBD — profile to pick threads vs multiprocessing vs asyncio)

100 files sharing 3 unique DEKs = 3 KMS calls max on cold cache, 0 on warm.

### Server Integration

The gitops server (`gitops_server/`) runs continuously and decrypts on every webhook-triggered deploy. It uses the daemon for DEK caching — the daemon runs alongside the server process.

Post-clone flow:
1. Clone repo (as today)
2. Run `gitops decrypt` (replaces both `git-crypt unlock` and `helm-secrets` decryption)
3. Deploy from plaintext working tree

Helm-secrets is no longer needed — all decryption happens via `gitops decrypt` before helm runs.

### Migration from git-crypt

Incremental, file-by-file. Both systems coexist during migration:

1. Add `.gitops.toml` with encryption rules
2. For each file: `git-crypt unlock` → file is plaintext → `gitops encrypt` → commit
3. Update server's `clone_repo()` to run `gitops decrypt` after clone
4. Once all files migrated, remove `git-crypt` dependency and `GIT_CRYPT_KEY_FILE` config

## Phase 2: Git Filter Integration (Future)

Wire up `.gitattributes` for automatic encrypt-on-add / decrypt-on-checkout:

```
apps/*/secrets.yml filter=gitops-crypt diff=gitops-crypt
```

Use git's **long-running filter process** protocol (`filter.gitops-crypt.process`) rather than per-file smudge/clean invocations. A single persistent process handles all files in a checkout via stdin/stdout, avoiding process startup overhead per file. The crypt daemon serves double duty — it acts as both the DEK cache and the long-running filter process.

The phase 1 design (`.gitops.toml` config, file format, daemon) is built to accommodate this without rework.
4 changes: 3 additions & 1 deletion gitops/common/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,8 +179,9 @@ class Chart:
type: git
git_sha: develop
git_repo_url: https://github.com/uptick/workforce
path: charts/myapp # optional subpath within the repo

of
or
chart:
type: helm
helm_repo: brigade
Expand All @@ -197,6 +198,7 @@ def __init__(self, definition: dict[str, Any] | str):
self.type = "git"
self.git_sha = None
self.git_repo_url = definition or None
self.path = None
elif isinstance(definition, dict):
self.type = definition["type"]
self.git_sha = definition.get("git_sha")
Expand Down
5 changes: 3 additions & 2 deletions gitops_server/workers/deployer/deploy.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,8 +178,9 @@ async def _update_app_deployment(self, app: App) -> UpdateAppResult | None:
span.set_attribute("gitops.chart.type", "git")
assert app.chart.git_repo_url
async with temp_repo(app.chart.git_repo_url, ref=app.chart.git_sha) as chart_folder_path:
chart_path = f"{chart_folder_path}/{app.chart.path}" if app.chart.path else chart_folder_path
with tracer.start_as_current_span("helm_dependency_build"):
await run(f"cd {chart_folder_path}; helm dependency build")
await run(f"cd {chart_path}; helm dependency build")

with tempfile.NamedTemporaryFile(suffix=".yml") as cfg:
cfg.write(json.dumps(app.values).encode())
Expand All @@ -198,7 +199,7 @@ async def upgrade_helm_git() -> RunOutput:
f" -f {cfg.name}"
f" --namespace={app.namespace}"
f" {app.name}"
f" {chart_folder_path}",
f" {chart_path}",
suppress_errors=True,
)
return result
Expand Down
35 changes: 35 additions & 0 deletions tests/test_deploy.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,41 @@ async def test_deployer_update_helm_app(self, temp_repo_mock, post_mock, run_moc
)
assert post_mock.call_count == 1

@patch("gitops_server.workers.deployer.deploy.run")
@patch("gitops_server.workers.deployer.deploy.Deployer.post_result")
@patch("gitops_server.workers.deployer.deploy.load_app_definitions", mock_load_app_definitions)
@patch("gitops_server.workers.deployer.deploy.temp_repo")
async def test_deployer_git_with_subpath(self, temp_repo_mock, post_mock, run_mock):
"""Deploy a git chart with a subpath specified."""
run_mock.return_value = {"exit_code": 0, "output": ""}
temp_repo_mock.return_value.__aenter__.return_value = "mock-repo"
git_app_with_path = App(
"git_app",
deployments={
"chart": {
"type": "git",
"git_repo_url": "https://github.com/some/repo",
"git_sha": "main",
"path": "charts/myapp",
},
"namespace": "mynamespace",
"tags": ["tag1"],
"cluster": "test-cluster",
},
)

semaphore_manager = AppSemaphoreManager()
deployer = await Deployer.from_push_event(SAMPLE_GITHUB_PAYLOAD, semaphore_manager)
await deployer.update_app_deployment(git_app_with_path)

assert run_mock.call_count == 2
assert run_mock.call_args_list[0][0][0] == "cd mock-repo/charts/myapp; helm dependency build"
assert re.match(
r"helm secrets upgrade --create-namespace --history-max 3 --install --timeout=600s -f .+\.yml"
r" --namespace=mynamespace git_app mock-repo/charts/myapp",
run_mock.call_args_list[1][0][0],
)

@patch("gitops_server.workers.deployer.deploy.run")
@patch("gitops_server.utils.slack.post")
@patch("gitops_server.workers.deployer.deploy.load_app_definitions", mock_load_app_definitions)
Expand Down
Loading