diff --git a/.agents/skills/build-from-issue/SKILL.md b/.agents/skills/build-from-issue/SKILL.md
index 7b906fc2f..c91c712b6 100644
--- a/.agents/skills/build-from-issue/SKILL.md
+++ b/.agents/skills/build-from-issue/SKILL.md
@@ -402,29 +402,24 @@ git diff --name-only main -- e2e/
If there are no changes under `e2e/`, skip this phase entirely.
-If E2E files were modified, deploy to the local cluster and run the E2E test suite:
+If E2E files were modified, run the relevant E2E lane for the driver touched by the change:
```bash
-# Deploy all changes to the local k3s cluster
-mise run cluster:deploy
-
-# Run the E2E sandbox tests
-mise run test:e2e:sandbox
+# Docker-backed gateway smoke E2E
+mise run e2e:docker
```
-`mise run test:e2e:sandbox` depends on `cluster:deploy` and `python:proto`, then runs `uv run pytest -o python_files='test_*.py' e2e/python`. However, since the cluster may need explicit deploy for code changes beyond just E2E test files, always run `mise run cluster:deploy` first as a separate step to ensure all sandbox/proxy/policy changes are live on the cluster before running E2E tests.
+Use `mise run e2e:podman`, `mise run e2e:vm`, or a Helm-backed Kubernetes E2E lane when the change targets those drivers.
**E2E retry loop** (up to 3 attempts):
-1. Run `mise run cluster:deploy` (only on the first attempt, or if code was changed between attempts).
-2. Run `mise run test:e2e:sandbox`.
-3. If tests fail:
+1. Run the selected E2E lane.
+2. If tests fail:
- Read the pytest output carefully — identify which tests failed and why.
- Distinguish between **test bugs** (the test itself is wrong) and **implementation bugs** (the code under test is wrong).
- Fix the failing code or tests.
- - If code changes were made (not just test fixes), re-run `mise run cluster:deploy` before retrying.
- Decrement the retry counter and try again.
-4. If tests pass, Phase 2 is green.
+3. If tests pass, Phase 2 is green.
**If all 3 E2E attempts fail**, stop and report to the user:
- Which E2E tests are failing
@@ -576,8 +571,8 @@ Local E2E tests passed. CI does not currently run E2E tests, so this comment ser
| Field | Value |
|-------|-------|
| **Commit** | `` |
-| **Command** | `mise run test:e2e:sandbox` |
-| **Cluster deploy** | `mise run cluster:deploy` (completed before test run) |
+| **Command** | `` |
+| **Gateway mode** | `` |
| **Result** | ✅ All passed |
### Test Summary
@@ -645,8 +640,9 @@ If the `state:in-progress` label is present, the skill was previously started bu
| `gh pr create --title "..." --body "..."` | Create a pull request |
| `gh api user --jq '.login'` | Get current GitHub username |
| `mise run pre-commit` | Run pre-commit checks (includes unit tests, lint, format) |
-| `mise run cluster:deploy` | Deploy all changes to local k3s cluster |
-| `mise run test:e2e:sandbox` | Run E2E sandbox tests (depends on cluster:deploy) |
+| `mise run e2e:docker` | Run smoke E2E against a standalone Docker-backed gateway |
+| `mise run e2e:podman` | Run smoke E2E against a Podman-backed gateway |
+| `mise run e2e:vm` | Run smoke E2E against the VM compute driver |
## Example Usage
diff --git a/.agents/skills/debug-inference/SKILL.md b/.agents/skills/debug-inference/SKILL.md
index 26f87b916..6770da598 100644
--- a/.agents/skills/debug-inference/SKILL.md
+++ b/.agents/skills/debug-inference/SKILL.md
@@ -174,7 +174,7 @@ openshell sandbox create -- curl https://inference.local/v1/chat/completions --j
Interpretation:
-- **`cluster inference is not configured`**: set the managed route with `openshell inference set`
+- **`cluster inference is not configured`**: set the managed gateway route with `openshell inference set`
- **`connection not allowed by policy`** on `inference.local`: unsupported method or path
- **`no compatible route`**: provider type and client API shape do not match
- **Connection refused / upstream unavailable / verification failures**: base URL, bind address, topology, or credentials are wrong
@@ -232,7 +232,7 @@ In this case, OpenShell routing is usually working correctly. The failing hop is
This is not the same issue as the Colima CoreDNS fix.
-OpenShell injects `host.docker.internal` and `host.openshell.internal` into sandbox pods with `hostAliases`. That path bypasses cluster DNS lookup. If the request still times out, the usual cause is host firewall or network policy, not CoreDNS.
+OpenShell injects `host.docker.internal` and `host.openshell.internal` into sandbox workloads when the selected compute platform supports it. That path bypasses runtime DNS lookup. If the request still times out, the usual cause is host firewall or network policy, not DNS.
### Verify the Problem
@@ -248,40 +248,41 @@ OpenShell injects `host.docker.internal` and `host.openshell.internal` into sand
curl -sS http://172.17.0.1:11434/v1/models
```
-3. Test the same endpoint from the OpenShell cluster container:
+3. Test the same endpoint from a gateway or sandbox container on the Docker network:
```bash
- docker exec openshell-cluster- wget -qO- -T 5 http://host.docker.internal:11434/v1/models
+ docker ps --filter name=openshell --format '{{.Names}}'
+ docker exec wget -qO- -T 5 http://host.docker.internal:11434/v1/models
```
If steps 1 and 2 succeed but step 3 times out, the host firewall or network configuration is blocking the container-to-host path.
### Fix
-Allow the Docker bridge network used by the OpenShell cluster to reach the host-local inference port. The exact command depends on your firewall tooling (iptables, nftables, firewalld, UFW, etc.), but the rule should allow:
+Allow the Docker bridge network used by the OpenShell gateway and sandbox containers to reach the host-local inference port. The exact command depends on your firewall tooling (iptables, nftables, firewalld, UFW, etc.), but the rule should allow:
-- **Source**: the Docker bridge subnet used by the OpenShell cluster container (commonly `172.18.0.0/16`)
-- **Destination**: the host gateway IP injected into sandbox pods for `host.docker.internal` (commonly `172.17.0.1`)
+- **Source**: the Docker bridge subnet used by OpenShell containers (commonly `172.18.0.0/16`)
+- **Destination**: the host gateway IP injected into sandbox workloads for `host.docker.internal` (commonly `172.17.0.1`)
- **Port**: the inference server port (e.g. `11434/tcp` for Ollama)
To find the actual values on your system:
```bash
-# Docker bridge subnet for the OpenShell cluster network
+# Docker bridge subnet for the OpenShell network
docker network inspect $(docker network ls --filter name=openshell -q) --format '{{range .IPAM.Config}}{{.Subnet}}{{end}}'
# Host gateway IP visible from inside the container
-docker exec openshell-cluster- cat /etc/hosts | grep host.docker.internal
+docker exec cat /etc/hosts | grep host.docker.internal
```
Adjust the source subnet, destination IP, or port to match your local Docker network layout.
### Verify the Fix
-1. Re-run the cluster container check:
+1. Re-run the container network check:
```bash
- docker exec openshell-cluster- wget -qO- -T 5 http://host.docker.internal:11434/v1/models
+ docker exec wget -qO- -T 5 http://host.docker.internal:11434/v1/models
```
2. Re-test from a sandbox:
diff --git a/.agents/skills/debug-openshell-cluster/SKILL.md b/.agents/skills/debug-openshell-cluster/SKILL.md
index b883631b7..fa01993d1 100644
--- a/.agents/skills/debug-openshell-cluster/SKILL.md
+++ b/.agents/skills/debug-openshell-cluster/SKILL.md
@@ -1,432 +1,204 @@
---
name: debug-openshell-cluster
-description: Debug why a openshell cluster failed to start or is unhealthy. Use when the user has a failed `openshell gateway start`, cluster health check failure, or wants to diagnose cluster infrastructure issues. Trigger keywords - debug cluster, cluster failing, cluster not starting, deploy failed, cluster troubleshoot, cluster health, cluster diagnose, why won't my cluster start, health check failed, gateway start failed, gateway not starting.
+description: Debug why an OpenShell gateway deployment is unhealthy, unreachable, or unable to create sandboxes. Use when the user has a gateway health failure, Docker/Podman runtime issue, Helm install failure, Kubernetes scheduling issue, TLS secret issue, VM driver issue, or sandbox startup problem. Trigger keywords - debug gateway, gateway failing, deployment failing, helm install failing, cluster health, gateway health, gateway not starting, health check failed, sandbox pending, docker driver, podman driver, vm driver.
---
-# Debug OpenShell Cluster
+# Debug OpenShell Gateway Deployment
-Diagnose why a openshell cluster failed to start after `openshell gateway start`.
+Diagnose a gateway and its selected compute platform. Do not assume OpenShell provisions Kubernetes or runs a k3s container. OpenShell targets a reachable gateway endpoint backed by Docker, Podman, Kubernetes, or the experimental VM driver.
-Use **only** `openshell` CLI commands (`openshell status`, `openshell doctor logs`, `openshell doctor exec`) to inspect and fix the cluster. Do **not** use raw `docker`, `podman`, `ssh`, or `kubectl` commands directly — always go through the `openshell doctor` interface. The CLI auto-resolves local vs remote gateways, so the same commands work everywhere.
+Use `openshell` first to identify the active endpoint. Then use the platform tools that match the gateway's compute driver: `docker`, `podman`, `kubectl`/`helm`, or VM driver logs.
## Overview
-`openshell gateway start` creates a container (via Docker or Podman) running k3s with the OpenShell server deployed via Helm. The build and deploy scripts use a container-engine abstraction layer (`tasks/scripts/container-engine.sh`) that auto-detects Docker or Podman and provides a unified `ce` command interface. Set `CONTAINER_ENGINE=docker` or `CONTAINER_ENGINE=podman` to override auto-detection. The deployment stages, in order, are:
+The target deployment flow is:
-1. **Pre-deploy check**: `openshell gateway start` in interactive mode prompts to **reuse** (keep volume, clean stale nodes) or **recreate** (destroy everything, fresh start). `mise run cluster` always recreates before deploy.
-2. Ensure cluster image is available (local build or remote pull)
-3. Create container network (`openshell-cluster`) and volume (`openshell-cluster-{name}`)
-4. Create and start a privileged container (`openshell-cluster-{name}`)
-5. Wait for k3s to generate kubeconfig (up to 60s)
-6. **Clean stale nodes**: Remove any `NotReady` k3s nodes left over from previous container instances that reused the same persistent volume
-7. **Prepare local images** (if `OPENSHELL_PUSH_IMAGES` is set): In `internal` registry mode, bootstrap waits for the in-cluster registry and pushes tagged images there. In `external` mode, bootstrap uses legacy `ctr -n k8s.io images import` push-mode behavior.
-8. **Reconcile TLS PKI**: Load existing TLS secrets from the cluster; if missing, incomplete, or malformed, generate fresh PKI (CA + server + client certs). Apply secrets to cluster. If rotation happened and the OpenShell workload is already running, rollout restart and wait for completion (failed rollout aborts deploy).
-9. **Store CLI mTLS credentials**: Persist client cert/key/CA locally for CLI authentication.
-10. Wait for cluster health checks to pass (up to 6 min):
- - k3s API server readiness (`/readyz`)
- - `openshell` statefulset ready in `openshell` namespace
- - TLS secrets `openshell-server-tls` and `openshell-client-tls` exist in `openshell` namespace
- - Sandbox supervisor binary exists at `/opt/openshell/bin/openshell-sandbox` (emits `HEALTHCHECK_MISSING_SUPERVISOR` marker if absent)
+1. Operator starts or deploys the gateway.
+2. Operator configures the compute driver.
+3. Operator provides TLS and SSH relay material for the deployment mode.
+4. The CLI registers a reachable gateway endpoint with `openshell gateway add`.
+5. The gateway creates sandboxes through the selected compute driver.
-For local deploys, metadata endpoint selection depends on the container engine and its connectivity:
-
-- default local Docker socket (`unix:///var/run/docker.sock`): `https://127.0.0.1:{port}` (default port 8080)
-- TCP Docker daemon (`DOCKER_HOST=tcp://:`): `https://:{port}` for non-loopback hosts
-- Podman (rootless): `https://127.0.0.1:{port}` via `host.containers.internal` (the Podman equivalent of `host.docker.internal`)
-
-The host port is configurable via `--port` on `openshell gateway start` (default 8080) and is stored in `ClusterMetadata.gateway_port`.
-
-The TCP host is also added as an extra gateway TLS SAN so mTLS hostname validation succeeds.
-
-The default cluster name is `openshell`. The container is `openshell-cluster-{name}`.
+For local evaluation only, TLS may be disabled and the gateway can be reached through `http://127.0.0.1:`.
## Prerequisites
-- Docker or Podman must be running (locally or on the remote host). The build system auto-detects which engine is available; set `CONTAINER_ENGINE=docker` or `CONTAINER_ENGINE=podman` to override.
-- For rootless Podman: ensure the Podman socket is active (`systemctl --user start podman.socket`) and subuid/subgid ranges are configured (`sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 $(whoami)`)
-- The `openshell` CLI must be available
-- For remote clusters: SSH access to the remote host
-
-## Tools Available
-
-All diagnostics go through three `openshell` commands. They auto-resolve local vs remote gateways — the same commands work for both:
-
-```bash
-# Quick connectivity check
-openshell status
-
-# Fetch container logs
-openshell doctor logs --lines 100
-openshell doctor logs --tail # stream live
-
-# Run any command inside the gateway container (KUBECONFIG is pre-configured)
-openshell doctor exec -- kubectl get pods -A
-openshell doctor exec -- kubectl -n openshell logs statefulset/openshell --tail=100
-openshell doctor exec -- cat /etc/rancher/k3s/registries.yaml
-openshell doctor exec -- df -h /
-openshell doctor exec -- free -h
-openshell doctor exec -- sh # interactive shell
-```
+- The `openshell` CLI must be available for endpoint checks.
+- Know the active gateway name and endpoint, or be able to inspect local gateway metadata.
+- Know the compute platform: Docker, Podman, Kubernetes, or VM.
+- For Kubernetes: `kubectl` must target the cluster that hosts OpenShell and Helm 3 must be available.
+- For Docker or Podman: the runtime socket must be reachable from the gateway host.
## Workflow
-When the user asks to debug a cluster failure, **run diagnostics automatically** through the steps below in order. Stop and report findings as soon as a root cause is identified. Do not ask the user to choose which checks to run.
+Run diagnostics in order and stop once the root cause is clear.
-### Determine Context
-
-Before running commands, establish:
-
-1. **Cluster name**: Default is `openshell`, giving container name `openshell-cluster-openshell`
-2. **Remote or local**: The `openshell doctor` commands auto-resolve this from gateway metadata — no special flags needed for the active gateway
-3. **Config directory**: `~/.config/openshell/gateways/{name}/`
-
-### Step 0: Quick Connectivity Check
-
-Run `openshell status` first. This immediately reveals:
-- Which gateway and endpoint the CLI is targeting
-- Whether the CLI can reach the server (mTLS handshake success/failure)
-- The server version if connected
-
-Common errors at this stage:
-- **`tls handshake eof`**: The server isn't running or mTLS credentials are missing/mismatched
-- **`connection refused`**: The container isn't running or port mapping is broken
-- **`No gateway configured`**: No gateway has been deployed yet
-
-### Step 1: Check Container Logs
-
-Get recent container logs to identify startup failures:
+### Step 1: Check CLI Reachability
```bash
-openshell doctor logs --lines 100
+openshell gateway info
+openshell status
```
-Look for:
-
-- DNS resolution failures in the entrypoint script
-- k3s startup errors (certificate issues, port binding failures)
-- Manifest copy errors from `/opt/openshell/manifests/`
-- `iptables` or `cgroup` errors (privilege/capability issues)
-- `Warning: br_netfilter does not appear to be loaded` — this is advisory only; many kernels work without the explicit module. Only act on it if you also see DNS failures or pod-to-service connectivity problems (see Common Failure Patterns).
-
-### Step 2: Check k3s Cluster Health
-
-Verify k3s itself is functional:
+Common findings:
-```bash
-# API server readiness
-openshell doctor exec -- kubectl get --raw="/readyz"
-
-# Node status
-openshell doctor exec -- kubectl get nodes -o wide
+- `No active gateway`: register one with `openshell gateway add `.
+- Connection refused: gateway process is not running, service exposure is wrong, or a port-forward/proxy is not active.
+- TLS/certificate errors: CLI mTLS bundle does not match the gateway CA, or the gateway is running with unexpected TLS settings.
-# All pods
-openshell doctor exec -- kubectl get pods -A -o wide
-```
+### Step 2: Identify the Compute Platform
-If `/readyz` fails, k3s is still starting or has crashed. Check container logs (Step 1).
+Use gateway metadata, deployment values, or the user's setup notes to identify the driver.
-If pods are in `CrashLoopBackOff`, `ImagePullBackOff`, or `Pending`, investigate those pods specifically.
+| Platform | Primary checks |
+|---|---|
+| Docker | Gateway process/container logs, Docker daemon health, sandbox containers, image pulls. |
+| Podman | Podman socket, rootless networking, sandbox containers, image pulls. |
+| Kubernetes | Helm release, StatefulSet, service, secrets, sandbox pods, events. |
+| VM | VM driver logs, rootfs availability, host virtualization support. |
-Also check for node pressure conditions that cause the kubelet to evict pods and reject scheduling:
+### Step 3: Check Docker-Backed Gateways
```bash
-# Check node conditions (DiskPressure, MemoryPressure, PIDPressure)
-openshell doctor exec -- kubectl get nodes -o jsonpath="{range .items[*]}{.metadata.name}{range .status.conditions[*]} {.type}={.status}{end}{\"\n\"}{end}"
-
-# Check disk usage inside the container
-openshell doctor exec -- df -h /
-
-# Check memory usage
-openshell doctor exec -- free -h
+docker info
+docker ps --filter name=openshell
+docker logs --tail=200
+openshell status
```
-If any pressure condition is `True`, pods will be evicted and new ones rejected. The bootstrap now detects `HEALTHCHECK_NODE_PRESSURE` markers from the health-check script and aborts early with a clear diagnosis. To fix: free disk/memory on the host, then recreate the gateway.
+Common findings:
-### Step 3: Check OpenShell Server StatefulSet
+- Docker daemon unavailable: start Docker Desktop or Docker Engine.
+- Gateway container stopped: inspect exit status and logs.
+- Sandbox image missing or pull denied: verify image reference and registry credentials.
+- Sandbox never registers: check gateway logs and supervisor callback endpoint.
-The OpenShell server is deployed via a HelmChart CR as a StatefulSet named `openshell` in the `openshell` namespace. Check its status:
+For source checkout development, restart the local gateway with:
```bash
-# StatefulSet status
-openshell doctor exec -- kubectl -n openshell get statefulset/openshell -o wide
-
-# OpenShell pod logs
-openshell doctor exec -- kubectl -n openshell logs statefulset/openshell --tail=100
-
-# Describe statefulset for events
-openshell doctor exec -- kubectl -n openshell describe statefulset/openshell
-
-# Helm install job logs (the job that installs the OpenShell chart)
-openshell doctor exec -- kubectl -n kube-system logs -l job-name=helm-install-openshell --tail=200
+mise run gateway:docker
```
-Common issues:
-
-- **Replicas 0/0**: The StatefulSet has been scaled to zero — no pods are running. This can happen after a failed deploy, manual scale-down, or Helm values misconfiguration. Fix: `openshell doctor exec -- kubectl -n openshell scale statefulset openshell --replicas=1`
-- **ImagePullBackOff**: The component image failed to pull. In `internal` mode, verify internal registry readiness and pushed image tags (Step 5). In `external` mode, check `/etc/rancher/k3s/registries.yaml` credentials/endpoints and DNS (Step 8). Default external registry is `ghcr.io/nvidia/openshell/` (public, no auth required). If using a private registry, ensure `--registry-username` and `--registry-token` (or `OPENSHELL_REGISTRY_USERNAME`/`OPENSHELL_REGISTRY_TOKEN`) were provided during deploy.
-- **CrashLoopBackOff**: The server is crashing. Check pod logs for the actual error.
-- **Pending**: Insufficient resources or scheduling constraints.
-
-### Step 4: Check Networking
-
-The OpenShell server is exposed via a NodePort service on port `30051`:
+### Step 4: Check Podman-Backed Gateways
```bash
-# Service status
-openshell doctor exec -- kubectl -n openshell get service/openshell
+podman info
+podman ps --filter name=openshell
+podman logs --tail=200
+openshell status
```
-Expected port: `30051/tcp` (mapped to configurable host port, default 8080; set via `--port` on deploy).
-
-### Step 5: Check Image Availability
+Common findings:
-Component images (server, sandbox) can reach kubelet via two paths:
+- Podman socket unavailable: start or expose the user socket.
+- Rootless networking unavailable: inspect Podman network configuration.
+- Sandbox image missing or pull denied: verify image reference and registry credentials.
+- Supervisor cannot call back: check callback endpoint and gateway logs.
-**Local/external pull mode** (default local via `mise run cluster`): Local images are tagged to the configured local registry base (default `127.0.0.1:5000/openshell/*`), pushed to that registry, and pulled by k3s via `registries.yaml` mirror endpoint (typically `host.docker.internal:5000`). The `cluster` task pushes prebuilt local tags (`openshell/*:dev`, falling back to `localhost:5000/openshell/*:dev` or `127.0.0.1:5000/openshell/*:dev`).
-
-Gateway image builds now stage a partial Rust workspace from `deploy/docker/Dockerfile.images`. If cargo fails with a missing manifest under `/build/crates/...`, or an imported symbol exists locally but is missing in the image build, verify that every current gateway dependency crate (including `openshell-driver-docker`, `openshell-driver-kubernetes`, and `openshell-ocsf`) is copied into the staged workspace there.
+### Step 5: Check Kubernetes Helm Gateways
```bash
-# Verify image refs currently used by openshell deployment
-openshell doctor exec -- kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec.containers[*].image}"
-
-# Verify registry mirror/auth endpoint configuration
-openshell doctor exec -- cat /etc/rancher/k3s/registries.yaml
+helm -n openshell status openshell
+helm -n openshell get values openshell
+kubectl -n openshell get statefulset,pod,svc,pvc
+kubectl -n openshell logs statefulset/openshell --tail=200
+kubectl -n openshell rollout status statefulset/openshell
```
-**Legacy push mode**: Images are imported into the k3s containerd `k8s.io` namespace.
+Look for failed installs, unexpected values, missing namespace, wrong image tag, TLS settings that do not match the registered endpoint, and scheduling failures.
-```bash
-# Check if images were imported into containerd (k3s default namespace is k8s.io)
-openshell doctor exec -- ctr -a /run/k3s/containerd/containerd.sock images ls | grep openshell
-```
-
-**External pull mode** (remote deploy, or local with `OPENSHELL_REGISTRY_HOST`/`IMAGE_REPO_BASE` pointing at a non-local registry): Images are pulled from an external registry at runtime. The entrypoint generates `/etc/rancher/k3s/registries.yaml`.
+Check required Helm deployment secrets:
```bash
-# Verify registries.yaml exists and has credentials
-openshell doctor exec -- cat /etc/rancher/k3s/registries.yaml
-
-# Test pulling an image manually from inside the cluster
-openshell doctor exec -- crictl pull ghcr.io/nvidia/openshell/gateway:latest
+kubectl -n openshell get secret \
+ openshell-ssh-handshake \
+ openshell-server-tls \
+ openshell-server-client-ca \
+ openshell-client-tls
```
-If `registries.yaml` is missing or has wrong values, verify env wiring (`OPENSHELL_REGISTRY_HOST`, `OPENSHELL_REGISTRY_INSECURE`, username/password for authenticated registries).
-
-### Step 6: Check mTLS / PKI
-
-TLS certificates are generated by the `openshell-bootstrap` crate (using `rcgen`) and stored as K8s secrets before the Helm release installs. There is no PKI job or cert-manager — certificates are applied directly via `kubectl apply`.
+For plaintext local evaluation, confirm the chart has:
```bash
-# Check if the three TLS secrets exist
-openshell doctor exec -- kubectl -n openshell get secret openshell-server-tls openshell-server-client-ca openshell-client-tls
-
-# Inspect server cert expiry (if openssl is available in the container)
-openshell doctor exec -- sh -c 'kubectl -n openshell get secret openshell-server-tls -o jsonpath="{.data.tls\.crt}" | base64 -d | openssl x509 -noout -dates 2>/dev/null || echo "openssl not available"'
-
-# Check if CLI-side mTLS files exist locally
-ls -la ~/.config/openshell/gateways//mtls/
+helm -n openshell get values openshell | grep -E 'disableTls|grpcEndpoint'
```
-On redeploy, bootstrap reuses existing secrets if they are valid PEM. If secrets are missing or malformed, fresh PKI is generated and the OpenShell workload is automatically restarted. If the rollout restart fails after rotation, the deploy aborts and CLI-side certs are not updated. Certificates use rcgen defaults (effectively never expire).
-
-If the local mTLS files are missing but the secrets exist in the cluster, you can extract them manually:
+Expected shape:
-```bash
-mkdir -p ~/.config/openshell/gateways//mtls
-openshell doctor exec -- kubectl -n openshell get secret openshell-client-tls -o jsonpath='{.data.ca\.crt}' | base64 -d > ~/.config/openshell/gateways//mtls/ca.crt
-openshell doctor exec -- kubectl -n openshell get secret openshell-client-tls -o jsonpath='{.data.tls\.crt}' | base64 -d > ~/.config/openshell/gateways//mtls/tls.crt
-openshell doctor exec -- kubectl -n openshell get secret openshell-client-tls -o jsonpath='{.data.tls\.key}' | base64 -d > ~/.config/openshell/gateways//mtls/tls.key
+```yaml
+server:
+ disableTls: true
+ grpcEndpoint: http://openshell.openshell.svc.cluster.local:8080
```
-Common mTLS issues:
-- **Secrets missing**: The `openshell` namespace may not have been created yet (Helm controller race). Bootstrap waits up to 2 minutes for the namespace.
-- **mTLS mismatch after manual secret deletion**: Delete all three secrets and redeploy — bootstrap will regenerate and restart the workload.
-- **CLI can't connect after redeploy**: Check that `~/.config/openshell/gateways//mtls/` contains `ca.crt`, `tls.crt`, `tls.key` and that they were updated at deploy time.
-- **Local mTLS files missing**: The gateway was deployed but CLI credentials weren't persisted (e.g., interrupted deploy). Extract from the cluster secret as shown above.
-
-### Step 7: Check Kubernetes Events
-
-Events catch scheduling failures, image pull errors, and resource issues:
+Check service exposure:
```bash
-openshell doctor exec -- kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
+kubectl -n openshell get svc openshell -o wide
+kubectl -n openshell get endpoints openshell
```
-Look for:
-
-- `FailedScheduling` — resource constraints
-- `ImagePullBackOff` / `ErrImagePull` — registry auth failure or DNS issue (check `/etc/rancher/k3s/registries.yaml`)
-- `CrashLoopBackOff` — application crashes
-- `OOMKilled` — memory limits too low
-- `FailedMount` — volume issues
-
-### Step 8: Check GPU Device Plugin and CDI (GPU gateways only)
-
-Skip this step for non-GPU gateways.
-
-The NVIDIA device plugin DaemonSet must be running and healthy before GPU sandboxes can be created. It uses CDI injection (`deviceListStrategy: cdi-cri`) to inject GPU devices into sandbox pods — no `runtimeClassName` is set on sandbox pods.
+For local port-forward testing:
```bash
-# DaemonSet status — numberReady must be >= 1
-openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin
-
-# Device plugin pod logs — look for "CDI" lines confirming CDI mode is active
-openshell doctor exec -- kubectl logs -n nvidia-device-plugin -l app.kubernetes.io/name=nvidia-device-plugin --tail=50
-
-# List CDI devices registered by the device plugin (requires nvidia-ctk in the cluster image).
-# Device plugin CDI entries use the vendor string "k8s.device-plugin.nvidia.com" so entries
-# will be prefixed "k8s.device-plugin.nvidia.com/gpu=". If the list is empty, CDI spec
-# generation has not completed yet.
-openshell doctor exec -- nvidia-ctk cdi list
-
-# Verify CDI spec files were generated on the node
-openshell doctor exec -- ls /var/run/cdi/
-
-# Helm install job logs for the device plugin chart
-openshell doctor exec -- kubectl -n kube-system logs -l job-name=helm-install-nvidia-device-plugin --tail=100
-
-# Confirm a GPU sandbox pod has no runtimeClassName (CDI injection, not runtime class)
-openshell doctor exec -- kubectl get pod -n openshell -o jsonpath='{range .items[*]}{.metadata.name}{" runtimeClassName="}{.spec.runtimeClassName}{"\n"}{end}'
+kubectl -n openshell port-forward svc/openshell 8080:8080
+openshell gateway add http://127.0.0.1:8080 --local --name local
+openshell status
```
-Common issues:
-
-- **DaemonSet 0/N ready**: The device plugin chart may still be deploying (k3s Helm controller can take 1–2 min) or the pod is crashing. Check pod logs.
-- **`nvidia-ctk cdi list` returns no `k8s.device-plugin.nvidia.com/gpu=` entries**: CDI spec generation has not completed. The device plugin may still be starting or the `cdi-cri` strategy isn't active. Verify `deviceListStrategy: cdi-cri` is in the rendered Helm values.
-- **No CDI spec files at `/var/run/cdi/`**: Same as above — device plugin hasn't written CDI specs yet.
-- **`HEALTHCHECK_GPU_DEVICE_PLUGIN_NOT_READY` in health check logs**: Device plugin has no ready pods. Check DaemonSet events and pod logs.
-
-### Step 9: Check DNS Resolution
-
-DNS misconfiguration is a common root cause, especially on remote/Linux hosts:
+If the gateway is healthy but sandbox creation fails:
```bash
-# Check the resolv.conf k3s is using for pod DNS
-openshell doctor exec -- cat /etc/rancher/k3s/resolv.conf
-
-# Test DNS from inside the container
-openshell doctor exec -- sh -c 'nslookup google.com || wget -q -O /dev/null http://google.com && echo "network ok" || echo "network unreachable"'
+kubectl -n openshell get pods
+kubectl -n openshell get events --sort-by=.lastTimestamp | tail -n 50
+kubectl -n openshell logs statefulset/openshell --tail=200
```
-Check the entrypoint's DNS decision in the container logs:
+Check the configured sandbox namespace:
```bash
-openshell doctor logs --lines 20
+helm -n openshell get values openshell | grep sandboxNamespace
```
-The entrypoint (`cluster-entrypoint.sh`) sets k3s pod DNS via a single strategy with one fallback:
-
-1. **Docker DNS proxy** (`setup_dns_proxy()`): reads the `DOCKER_OUTPUT` iptables chain to discover where Docker's embedded DNS (`127.0.0.11`) is actually listening, installs DNAT rules so k3s pods can reach it via the container's `eth0` IP, and writes `nameserver ` to `/etc/rancher/k3s/resolv.conf`. On success logs: `Setting up DNS proxy: :53 -> 127.0.0.11`.
-2. **Public DNS fallback**: if `setup_dns_proxy()` fails for any reason, logs `Warning: Could not discover Docker DNS ports from iptables` and writes `nameserver 8.8.8.8` / `nameserver 8.8.4.4` to `/etc/rancher/k3s/resolv.conf`.
-
-After either path, `verify_dns()` runs `nslookup` (5 retries) against the configured registry host (default `ghcr.io`). On failure it emits `DNS_PROBE_FAILED` into the logs. The Rust-side bootstrap (`runtime.rs` / `lib.rs`) watches for this marker and aborts early rather than spinning for the full 6-minute timeout.
+Then inspect sandbox resources in that namespace.
-**Important:** there are two independent DNS paths inside the cluster container. The entrypoint only writes `/etc/rancher/k3s/resolv.conf` (pod DNS). The container's system `/etc/resolv.conf` (used by containerd for image pulls and by `nslookup`) is set by Docker or Podman at container start and is never touched by the entrypoint. These can point at different nameservers.
+### Step 6: Check VM-Backed Gateways
-**Under Podman:** `setup_dns_proxy()` always fails — Podman does not create a `DOCKER_OUTPUT` chain. k3s pod DNS always falls back to `8.8.8.8`/`8.8.4.4`. The cluster container runs on a named Podman bridge network, which uses **netavark + aardvark-dns**. Aardvark-dns listens on the bridge gateway IP (e.g. `10.89.x.1`) and forwards external queries to the host resolver. Podman sets the container's system `/etc/resolv.conf` to that address — so `nslookup ghcr.io` works fine even when `8.8.8.8` is blocked. This means **`DNS_PROBE_FAILED` is never emitted under Podman** even when pod-level DNS is broken: the entrypoint's `verify_dns()` and the Rust-side `probe_container_dns()` both call bare `nslookup`, which hits aardvark-dns via the system resolv.conf, not the k3s resolv.conf. Pod DNS failures surface later as CoreDNS upstream forwarding timeouts, not as an early bootstrap abort.
+Use the VM driver logs and host diagnostics available in the user's environment. Verify:
-To debug Podman pod DNS failures: check `/etc/rancher/k3s/resolv.conf` confirms `8.8.8.8` is there, then verify `8.8.8.8:53` UDP is reachable from the host with `nc -vzu 8.8.8.8 53`.
+- The VM driver process is running and reachable by the gateway.
+- The runtime rootfs exists and matches the expected architecture.
+- Host virtualization support is enabled.
+- The sandbox supervisor can establish its callback connection to the gateway.
-If DNS is broken, all image pulls from the distribution registry will fail, as will pods that need external network access.
-
-**Sandbox agent DNS is a separate enforcement layer.** The cluster DNS above controls what k3s pods and containerd can resolve. It has no bearing on what agent workloads inside sandboxes can reach. The sandbox supervisor (`openshell-sandbox`) creates an isolated Linux network namespace for each agent process with a veth pair, then installs iptables rules inside that namespace that REJECT all outbound UDP — including port 53 — except traffic destined for the supervisor's CONNECT proxy. Agent workloads cannot make raw DNS queries regardless of what nameservers are configured anywhere in the cluster. DNS must go through the HTTP CONNECT proxy. This is a kernel-enforced boundary at the netns level, not a configuration setting. The bypass monitor detects and logs any direct DNS attempt with the hint: `"DNS queries should route through the sandbox proxy; check resolver configuration"`.
-
-## Common Failure Patterns
-
-| Symptom | Likely Cause | Fix |
-|---------|-------------|-----|
-| `tls handshake eof` from `openshell status` | Server not running or mTLS credentials missing/mismatched | Check StatefulSet replicas (Step 3) and mTLS files (Step 6) |
-| StatefulSet `0/0` replicas | StatefulSet scaled to zero (failed deploy, manual scale-down, or Helm misconfiguration) | `openshell doctor exec -- kubectl -n openshell scale statefulset openshell --replicas=1` |
-| Local mTLS files missing | Deploy was interrupted before credentials were persisted | Extract from cluster secret `openshell-client-tls` (Step 6) |
-| Container not found | Image not built | `mise run docker:build:cluster` (local, works with both Docker and Podman) or re-deploy (remote) |
-| Container exited, OOMKilled | Insufficient memory | Increase host memory or reduce workload |
-| Container exited, non-zero exit | k3s crash, port conflict, privilege issue | Check `openshell doctor logs` for details |
-| `/readyz` fails | k3s still starting or crashed | Wait longer or check container logs for k3s errors |
-| OpenShell pods `Pending` | Insufficient CPU/memory for scheduling, or PVC not bound | `openshell doctor exec -- kubectl describe pod -n openshell` and `openshell doctor exec -- kubectl get pvc -n openshell` |
-| OpenShell pods `CrashLoopBackOff` | Server application error | `openshell doctor exec -- kubectl -n openshell logs statefulset/openshell` |
-| OpenShell pods `ImagePullBackOff` (push mode) | Images not imported or wrong containerd namespace | Check `openshell doctor exec -- ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images ls` (Step 5) |
-| OpenShell pods `ImagePullBackOff` (pull mode) | Registry auth or DNS issue | Check `openshell doctor exec -- cat /etc/rancher/k3s/registries.yaml` and DNS (Step 8) |
-| Image import fails | Corrupt tar stream or containerd not ready | Retry after k3s is fully started; check container logs |
-| Push mode images not found by kubelet | Imported into wrong containerd namespace | Must use `k3s ctr -n k8s.io images import`, not `k3s ctr images import` |
-| mTLS secrets missing | Bootstrap couldn't apply secrets (namespace not ready) | Check deploy logs and verify `openshell` namespace exists (Step 6) |
-| mTLS mismatch after redeploy | PKI rotated but workload not restarted, or rollout failed | Check that all three TLS secrets exist and that the openshell pod restarted after cert rotation (Step 6) |
-| Helm install job failed | Chart values error or dependency issue | `openshell doctor exec -- kubectl -n kube-system logs -l job-name=helm-install-openshell` |
-| NFD/GFD DaemonSets present (`node-feature-discovery`, `gpu-feature-discovery`) | Cluster was deployed before NFD/GFD were disabled (pre-simplify-device-plugin change) | These are harmless but add overhead. Clean up: `openshell doctor exec -- kubectl delete daemonset -n nvidia-device-plugin -l app.kubernetes.io/name=node-feature-discovery` and similarly for GFD. The `nvidia.com/gpu.present` node label is no longer applied; device plugin scheduling no longer requires it. |
-| Podman socket not found | Rootless Podman service not started | `systemctl --user start podman.socket` and verify with `podman info` |
-| Container creation fails with subuid/subgid error (Podman) | Missing user namespace ID mappings | `sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 $(whoami)` then `podman system migrate` |
-| Cgroups v1 detected (Podman) | Podman driver requires unified cgroup hierarchy | Set `systemd.unified_cgroup_hierarchy=1` kernel parameter and reboot |
-| `--restart=always` ignored (Podman rootless) | Rootless Podman does not support `--restart=always` for containers | Use a systemd user service instead: `loginctl enable-linger $(whoami)` then create a `~/.config/systemd/user/` unit |
-| Architecture mismatch (remote) | Built on arm64, deploying to amd64 | Cross-build the image for the target architecture |
-| Port conflict | Another service on the configured gateway host port (default 8080) | Stop conflicting service or use `--port` on `openshell gateway start` to pick a different host port |
-| gRPC connect refused to `127.0.0.1:443` in CI | Docker daemon is remote (`DOCKER_HOST=tcp://...`) but metadata still points to loopback | Verify metadata endpoint host matches `DOCKER_HOST` and includes non-loopback host |
-| DNS failures inside container (Docker) | `setup_dns_proxy()` failed to find `DOCKER_OUTPUT` iptables chain | `openshell doctor logs --lines 20` for `Warning: Could not discover Docker DNS ports`; try `docker network prune -f` and redeploy |
-| Pod external name resolution fails (Podman) | `setup_dns_proxy()` always fails under Podman; k3s resolv.conf falls back to `8.8.8.8`/`8.8.4.4`, which is blocked on this network | `DNS_PROBE_FAILED` will NOT appear — entrypoint and Rust-side probes resolve via aardvark-dns (system `/etc/resolv.conf`), not the k3s resolv.conf; check `openshell doctor exec -- cat /etc/rancher/k3s/resolv.conf` to confirm fallback; verify `8.8.8.8:53` UDP reachable from host via `nc -vzu 8.8.8.8 53` |
-| Pods can't reach kube-dns / ClusterIP services | `br_netfilter` not loaded; bridge traffic bypasses iptables DNAT rules | `sudo modprobe br_netfilter` on the host, then `echo br_netfilter \| sudo tee /etc/modules-load.d/br_netfilter.conf` to persist. Known to be required on Jetson Linux 5.15-tegra; other kernels (e.g. standard x86/aarch64 Linux) may have bridge netfilter built in and work without the module. The entrypoint logs a warning when `/proc/sys/net/bridge/bridge-nf-call-iptables` is absent but does not abort — only act on it if DNS or service connectivity is actually broken. |
-| Node DiskPressure / MemoryPressure / PIDPressure | Insufficient disk, memory, or PIDs on host | Free disk (`docker system prune -a --volumes`), increase memory, or expand host resources. Bootstrap auto-detects via `HEALTHCHECK_NODE_PRESSURE` marker |
-| Pods evicted with "The node had condition: [DiskPressure]" | Host disk full, kubelet evicting pods | Free disk space on host, then `openshell gateway destroy && openshell gateway start` |
-| `metrics-server` errors in logs | Normal k3s noise, not the root cause | These errors are benign — look for the actual failing health check component |
-| Stale NotReady nodes from previous deploys | Volume reused across container recreations | The deploy flow now auto-cleans stale nodes; if it still fails, manually delete NotReady nodes (see Step 2) or choose "Recreate" when prompted |
-| gRPC `UNIMPLEMENTED` for newer RPCs in push mode | Helm values still point at older pulled images instead of the pushed refs | Verify rendered `openshell-helmchart.yaml` uses the expected push refs (`server`, `sandbox`, `pki-job`) and not `:latest` |
-| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` target in `deploy/docker/Dockerfile.images`. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
-| `HEALTHCHECK_MISSING_SUPERVISOR` in health check logs | `/opt/openshell/bin/openshell-sandbox` not found in gateway container | Rebuild cluster image: `mise run docker:build:cluster`, then `openshell gateway destroy && openshell gateway start` |
-| `nvidia-ctk cdi list` returns no `k8s.device-plugin.nvidia.com/gpu=` entries | CDI specs not yet generated by device plugin | Device plugin may still be starting; wait and retry, or check pod logs (Step 8) |
-
-## Full Diagnostic Dump
-
-Run all diagnostics at once for a comprehensive report:
+Then run:
```bash
-echo "=== Connectivity Check ==="
openshell status
+openshell logs
+```
-echo "=== Container Logs (last 50 lines) ==="
-openshell doctor logs --lines 50
-
-echo "=== k3s Readiness ==="
-openshell doctor exec -- kubectl get --raw='/readyz'
-
-echo "=== Nodes ==="
-openshell doctor exec -- kubectl get nodes -o wide
-
-echo "=== Node Conditions ==="
-openshell doctor exec -- kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{range .status.conditions[*]} {.type}={.status}{end}{"\n"}{end}'
-
-echo "=== Disk Usage ==="
-openshell doctor exec -- df -h /
-
-echo "=== All Pods ==="
-openshell doctor exec -- kubectl get pods -A -o wide
-
-echo "=== Failing Pods ==="
-openshell doctor exec -- kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
-
-echo "=== OpenShell StatefulSet ==="
-openshell doctor exec -- kubectl -n openshell get statefulset/openshell -o wide
-
-echo "=== OpenShell Service ==="
-openshell doctor exec -- kubectl -n openshell get service/openshell
-
-echo "=== TLS Secrets ==="
-openshell doctor exec -- kubectl -n openshell get secret openshell-server-tls openshell-server-client-ca openshell-client-tls
-
-echo "=== Recent Events ==="
-openshell doctor exec -- kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
-
-echo "=== Helm Install OpenShell Logs ==="
-openshell doctor exec -- kubectl -n kube-system logs -l job-name=helm-install-openshell --tail=100
-
-echo "=== Registry Configuration ==="
-openshell doctor exec -- cat /etc/rancher/k3s/registries.yaml
-
-echo "=== Supervisor Binary ==="
-openshell doctor exec -- ls -la /opt/openshell/bin/openshell-sandbox
-
-echo "=== DNS Configuration ==="
-openshell doctor exec -- cat /etc/rancher/k3s/resolv.conf
+## Common Failure Patterns
-# GPU gateways only
-echo "=== GPU Device Plugin ==="
-openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin
-openshell doctor exec -- nvidia-ctk cdi list
-```
+| Symptom | Likely cause | Check |
+|---|---|---|
+| `openshell status` fails | Gateway endpoint unreachable or auth mismatch | `openshell gateway info`, gateway logs |
+| Gateway starts but sandbox create fails | Compute driver cannot reach runtime | Docker/Podman/Kubernetes/VM driver logs |
+| Docker or Podman sandbox never registers | Wrong callback endpoint or supervisor startup failure | Gateway logs and sandbox container logs |
+| Kubernetes gateway pod pending | PVC unbound, taint, selector, or insufficient resources | `kubectl -n openshell describe pod ` |
+| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs statefulset/openshell` |
+| CLI TLS error | Local mTLS bundle does not match server cert/CA | Check `~/.config/openshell/gateways//mtls/` |
+| Image pull failure | Gateway or sandbox image cannot be pulled | Runtime events and image pull credentials |
+
+## Reporting
+
+When handing results back to the user, include:
+
+- Active gateway endpoint and auth mode.
+- Compute platform and driver.
+- Gateway process or workload status.
+- Recent gateway log summary.
+- Missing or malformed TLS or SSH relay material.
+- Service exposure status.
+- Sandbox workload status.
+- The exact command that failed and the shortest fix.
diff --git a/.agents/skills/generate-sandbox-policy/SKILL.md b/.agents/skills/generate-sandbox-policy/SKILL.md
index 7149bbd61..95767fe3f 100644
--- a/.agents/skills/generate-sandbox-policy/SKILL.md
+++ b/.agents/skills/generate-sandbox-policy/SKILL.md
@@ -439,7 +439,7 @@ network_policies:
#
```
-The `filesystem_policy`, `landlock`, and `process` sections above are sensible defaults. Tell the user these are defaults and may need adjustment for their environment. Cluster inference is configured separately through `openshell cluster inference set/get`. The generated `network_policies` block is the primary output.
+The `filesystem_policy`, `landlock`, and `process` sections above are sensible defaults. Tell the user these are defaults and may need adjustment for their environment. Gateway inference is configured separately through `openshell inference set/get`. The generated `network_policies` block is the primary output.
If the user provides a file path, write to it. Otherwise, ask where to place it. A common convention is a project-local policy file (e.g., `sandbox-policy.yaml`) passed to `openshell sandbox create --policy ` or set via the `OPENSHELL_SANDBOX_POLICY` env var.
diff --git a/.agents/skills/generate-sandbox-policy/examples.md b/.agents/skills/generate-sandbox-policy/examples.md
index ede511134..b6acbee8b 100644
--- a/.agents/skills/generate-sandbox-policy/examples.md
+++ b/.agents/skills/generate-sandbox-policy/examples.md
@@ -858,7 +858,7 @@ network_policies:
- { path: /usr/local/bin/claude }
```
-The agent notes that `filesystem_policy`, `landlock`, and `process` are sensible defaults that may need adjustment, and that cluster inference is configured separately via `openshell cluster inference set/get` rather than an `inference` policy block.
+The agent notes that `filesystem_policy`, `landlock`, and `process` are sensible defaults that may need adjustment, and that gateway inference is configured separately via `openshell inference set/get` rather than an `inference` policy block.
---
diff --git a/.agents/skills/openshell-cli/SKILL.md b/.agents/skills/openshell-cli/SKILL.md
index d639e82b5..39af09e08 100644
--- a/.agents/skills/openshell-cli/SKILL.md
+++ b/.agents/skills/openshell-cli/SKILL.md
@@ -1,6 +1,6 @@
---
name: openshell-cli
-description: Guide agents through using the OpenShell CLI (openshell) for sandbox management, provider configuration, policy iteration, BYOC workflows, and inference routing. Covers basic through advanced multi-step workflows. Trigger keywords - openshell, sandbox create, sandbox connect, logs, provider create, policy set, policy get, image push, forward, port forward, BYOC, bring your own container, use openshell, run openshell, CLI usage, manage sandbox, manage provider, gateway start, gateway select.
+description: Guide agents through using the OpenShell CLI (openshell) for sandbox management, gateway registration, provider configuration, policy iteration, BYOC workflows, and inference routing. Covers basic through advanced multi-step workflows. Trigger keywords - openshell, sandbox create, sandbox connect, logs, provider create, policy set, policy get, image push, forward, port forward, BYOC, bring your own container, use openshell, run openshell, CLI usage, manage sandbox, manage provider, gateway add, gateway select.
---
# OpenShell CLI
@@ -26,8 +26,9 @@ This is your primary fallback. Use it freely -- the CLI's help output is authori
## Prerequisites
- `openshell` is on the PATH (install via `cargo install --path crates/openshell-cli`)
-- Docker is running (required for gateway operations and BYOC)
-- For remote clusters: SSH access to the target host
+- A reachable OpenShell gateway backed by Docker, Podman, Kubernetes, or the experimental VM driver
+- Docker is running only when using BYOC local builds or Docker-backed development workflows
+- For Kubernetes deployments: `kubectl` and Helm access to the target cluster
## Command Reference
@@ -37,29 +38,23 @@ See [cli-reference.md](cli-reference.md) for the full command tree with all flag
## Workflow 1: Getting Started
-Use this workflow when no cluster exists yet and the user wants to get a sandbox running for the first time.
+Use this workflow when the user has a gateway endpoint and wants to get a sandbox running for the first time.
-### Step 1: Bootstrap a cluster
+### Step 1: Register a gateway
```bash
-openshell gateway start
+openshell gateway add http://127.0.0.1:8080 --local --name local
```
-This provisions a local k3s cluster in Docker. The CLI will prompt interactively if a cluster already exists. The cluster is automatically set as the active gateway.
+Use an `http://` endpoint only for trusted local port-forwarding or a protected private path. For a gateway behind an authenticated reverse proxy, register its HTTPS endpoint with `openshell gateway add https://gateway.example.com`.
-For remote deployment:
-
-```bash
-openshell gateway start --remote user@host --ssh-key ~/.ssh/id_rsa
-```
-
-### Step 2: Verify the cluster
+### Step 2: Verify the gateway
```bash
openshell status
```
-Confirm the cluster is reachable and shows a version.
+Confirm the gateway is reachable and shows a version.
### Step 3: Create a sandbox
@@ -69,7 +64,7 @@ The simplest way to get a sandbox running:
openshell sandbox create
```
-This creates a sandbox with defaults and drops you into an interactive shell. The CLI auto-bootstraps a cluster if none exists.
+This creates a sandbox with defaults and drops you into an interactive shell.
**Shortcut for known tools**: When the trailing command is a recognized tool, the CLI auto-creates the required provider from local credentials:
@@ -325,7 +320,7 @@ openshell sandbox create --from ./Dockerfile --name my-app
The `--from` flag accepts a Dockerfile path, a directory containing a Dockerfile, a full image reference (e.g. `myregistry.com/img:tag`), or a community sandbox name (e.g. `openclaw`).
-When given a Dockerfile or directory, the image is built locally via Docker and imported directly into the cluster's containerd runtime. No external registry needed.
+When given a Dockerfile or directory, the image is built locally via Docker and delivered through the selected compute driver. Docker and Podman-backed gateways can use local images directly. Kubernetes gateways usually need the image available to the cluster through a registry or driver-supported image push path.
When `--from` is specified, the CLI:
- Clears default `run_as_user`/`run_as_group` (custom images may not have the `sandbox` user)
@@ -471,8 +466,8 @@ openshell inference get
### How sandboxes use it
- Agents send HTTPS requests to `inference.local`.
-- The sandbox intercepts those requests locally and routes them through the cluster inference config.
-- Sandbox policy is separate from cluster inference configuration.
+- The sandbox intercepts those requests locally and routes them through the gateway inference config.
+- Sandbox policy is separate from gateway inference configuration.
---
@@ -482,35 +477,29 @@ openshell inference get
```bash
openshell gateway select # See all gateways (no args shows list)
-openshell gateway select my-cluster # Switch active gateway
+openshell gateway select production # Switch active gateway
openshell status # Verify connectivity
```
-### Lifecycle
+### Registration
```bash
-openshell gateway start # Start local cluster
-openshell gateway stop # Stop (preserves state)
-openshell gateway start # Restart (reuses state)
-openshell gateway destroy # Destroy permanently
+openshell gateway add http://127.0.0.1:8080 --local --name local
+openshell gateway add https://gateway.example.com --name production
+openshell gateway destroy --name local # Remove local registration
```
-### Remote clusters
+### Platform-specific deployment inspection
```bash
-# Deploy to remote host
-openshell gateway start --remote user@host --ssh-key ~/.ssh/id_rsa --name remote-cluster
-
-# View gateway container logs
-openshell doctor logs --name remote-cluster
-
-# Run kubectl inside the remote gateway container
-openshell doctor exec --name remote-cluster -- kubectl get pods -A
-
-# Get cluster info
-openshell gateway info --name remote-cluster
+# Inspect a Kubernetes Helm release and gateway pod
+helm -n openshell status openshell
+kubectl -n openshell get pods,svc
+kubectl -n openshell logs statefulset/openshell --tail=100
```
+For Docker, Podman, and VM-backed gateways, inspect the gateway process or container logs and the selected runtime directly.
+
---
## Self-Teaching via `--help`
@@ -539,8 +528,8 @@ $ openshell sandbox upload --help
| Task | Command |
|------|---------|
-| Deploy local cluster | `openshell gateway start` |
-| Check cluster health | `openshell status` |
+| Register local port-forwarded gateway | `openshell gateway add http://127.0.0.1:8080 --local --name local` |
+| Check gateway health | `openshell status` |
| List/switch gateways | `openshell gateway select [name]` |
| Create sandbox (interactive) | `openshell sandbox create` |
| Create sandbox with tool | `openshell sandbox create -- claude` |
@@ -560,7 +549,7 @@ $ openshell sandbox upload --help
| Configure gateway inference | `openshell inference set --provider P --model M` |
| View gateway inference | `openshell inference get` |
| Delete sandbox | `openshell sandbox delete ` |
-| Destroy cluster | `openshell gateway destroy` |
+| Remove gateway registration | `openshell gateway destroy --name ` |
| Self-teach any command | `openshell --help` |
## Companion Skills
@@ -568,6 +557,6 @@ $ openshell sandbox upload --help
| Skill | When to use |
|-------|------------|
| `generate-sandbox-policy` | Creating or modifying policy YAML content (network rules, L7 inspection, access presets, endpoint configuration) |
-| `debug-openshell-cluster` | Diagnosing cluster startup or health failures |
+| `debug-openshell-cluster` | Diagnosing gateway deployment, runtime, or health failures |
| `debug-inference` | Diagnosing `inference.local`, host-backed local inference, and provider base URL issues |
| `tui-development` | Developing features for the OpenShell TUI (`openshell term`) |
diff --git a/.agents/skills/openshell-cli/cli-reference.md b/.agents/skills/openshell-cli/cli-reference.md
index bdbb35572..6a22617d2 100644
--- a/.agents/skills/openshell-cli/cli-reference.md
+++ b/.agents/skills/openshell-cli/cli-reference.md
@@ -25,8 +25,8 @@ Quick-reference for the `openshell` command-line interface. For workflow guidanc
```
openshell
├── gateway
-│ ├── start [opts]
-│ ├── stop [opts]
+│ ├── add [opts]
+│ ├── login [name]
│ ├── destroy [opts]
│ ├── info [--name]
│ └── select [name]
@@ -73,40 +73,37 @@ openshell
## Gateway Commands
-### `openshell gateway start`
+### `openshell gateway add `
-Provision or start a cluster (local or remote).
-
-| Flag | Default | Description |
-|------|---------|-------------|
-| `--name ` | `openshell` | Cluster name |
-| `--remote ` | none | SSH destination for remote deployment |
-| `--ssh-key ` | none | SSH private key for remote deployment |
-| `--port ` | 8080 | Host port mapped to gateway |
-| `--gateway-host ` | none | Override gateway host in metadata |
-| `--recreate` | false | Destroy and recreate from scratch if a gateway already exists (skips interactive prompt) |
-
-### `openshell gateway stop`
-
-Stop a cluster (preserves state for later restart).
+Register an existing gateway endpoint.
| Flag | Description |
|------|-------------|
-| `--name ` | Cluster name (defaults to active) |
-| `--remote ` | SSH destination |
-| `--ssh-key ` | SSH private key |
+| `--name ` | Gateway name |
+| `--local` | Register a local endpoint, commonly a trusted port-forward |
+| `--remote ` | Register a remote gateway associated with an SSH destination |
+| `--ssh-key ` | SSH private key for the remote host |
+
+Examples:
+
+- `openshell gateway add http://127.0.0.1:8080 --local --name local`
+- `openshell gateway add https://gateway.example.com --name production`
### `openshell gateway destroy`
-Destroy a cluster and all its state. Same flags as `stop`.
+Remove a gateway registration. For Helm deployments this affects local CLI metadata only; it does not uninstall the Helm release.
+
+### `openshell gateway login [name]`
+
+Refresh browser-based authentication for a gateway behind an edge proxy.
### `openshell gateway info`
-Show deployment details: endpoint and remote host.
+Show gateway details: endpoint, auth mode, and remote host metadata when present.
| Flag | Description |
|------|-------------|
-| `--name ` | Cluster name (defaults to active) |
+| `--name ` | Gateway name (defaults to active) |
### `openshell gateway select [name]`
@@ -118,7 +115,7 @@ Set the active gateway. Writes to `~/.config/openshell/active_gateway`. When cal
### `openshell doctor logs`
-Fetch logs from the gateway Docker container.
+Fetch logs from a legacy gateway container when metadata supports it. For Helm deployments, prefer `kubectl -n openshell logs statefulset/openshell`.
| Flag | Default | Description |
|------|---------|-------------|
@@ -130,8 +127,7 @@ Fetch logs from the gateway Docker container.
### `openshell doctor exec -- `
-Run a command inside the gateway container with KUBECONFIG pre-configured.
-Launches an interactive `docker exec` session (tunnelled over SSH for remote gateways).
+Run a command inside a legacy gateway container when metadata supports it. For Helm deployments, prefer direct `kubectl` and `helm` commands.
| Flag | Default | Description |
|------|---------|-------------|
@@ -140,9 +136,9 @@ Launches an interactive `docker exec` session (tunnelled over SSH for remote gat
| `--ssh-key ` | none | SSH private key for remote gateways |
Examples:
-- `openshell doctor exec -- kubectl get pods -A`
-- `openshell doctor exec -- k9s`
-- `openshell doctor exec -- sh` (interactive shell)
+- `kubectl -n openshell get pods`
+- `kubectl -n openshell logs statefulset/openshell`
+- `helm -n openshell status openshell`
---
@@ -158,7 +154,7 @@ Show server connectivity and version for the active gateway.
### `openshell sandbox create [OPTIONS] [-- COMMAND...]`
-Create a sandbox, wait for readiness, then connect or execute the trailing command. Auto-bootstraps a cluster if none exists.
+Create a sandbox through the active gateway, wait for readiness, then connect or execute the trailing command.
| Flag | Description |
|------|-------------|
@@ -169,12 +165,8 @@ Create a sandbox, wait for readiness, then connect or execute the trailing comma
| `--provider ` | Provider to attach (repeatable) |
| `--policy ` | Path to custom policy YAML |
| `--forward ` | Forward local port to sandbox (keeps the sandbox alive) |
-| `--remote ` | SSH destination for auto-bootstrap |
-| `--ssh-key ` | SSH private key for auto-bootstrap |
| `--tty` | Force pseudo-terminal allocation |
| `--no-tty` | Disable pseudo-terminal allocation |
-| `--bootstrap` | Auto-bootstrap a gateway if none is available (skips interactive prompt) |
-| `--no-bootstrap` | Never auto-bootstrap; error immediately if no gateway is available |
| `--auto-providers` | Auto-create missing providers from local credentials (skips interactive prompt) |
| `--no-auto-providers` | Never auto-create providers; skip missing providers silently |
| `[-- COMMAND...]` | Command to execute (defaults to interactive shell) |
diff --git a/.agents/skills/triage-issue/SKILL.md b/.agents/skills/triage-issue/SKILL.md
index b54b92e52..54e726a5d 100644
--- a/.agents/skills/triage-issue/SKILL.md
+++ b/.agents/skills/triage-issue/SKILL.md
@@ -110,7 +110,7 @@ Based on the sub-agent's analysis, also attempt to validate the report directly:
- For bug reports: check the relevant code paths, look for the described failure mode
- For feature requests: assess feasibility against the existing architecture
-- For cluster/infrastructure issues: reference the `debug-openshell-cluster` skill's known failure patterns
+- For gateway deployment or infrastructure issues: reference the `debug-openshell-cluster` skill's known failure patterns
- For inference and provider-topology issues: reference the `debug-inference` skill's known failure patterns
- For CLI/usage issues: reference the `openshell-cli` skill's command reference
diff --git a/.agents/skills/tui-development/SKILL.md b/.agents/skills/tui-development/SKILL.md
index 80984759c..bbd9f1ecd 100644
--- a/.agents/skills/tui-development/SKILL.md
+++ b/.agents/skills/tui-development/SKILL.md
@@ -19,7 +19,7 @@ The OpenShell TUI is a ratatui-based terminal UI for the OpenShell platform. It
- `tonic` with TLS — gRPC client for the OpenShell gateway
- `tokio` — async runtime for event loop, spawned tasks, and mpsc channels
- `openshell-core` — proto-generated types (`OpenShellClient`, request/response structs)
- - `openshell-bootstrap` — cluster discovery (`list_clusters()`)
+ - `openshell-bootstrap` — gateway discovery (`list_gateways()`)
- **Theme:** Adaptive dark/light via `Theme` struct — NVIDIA-branded green accents. Controlled by `--theme` flag, `OPENSHELL_THEME` env var, or auto-detection.
## 2. Domain Object Hierarchy
@@ -33,7 +33,7 @@ Gateway (discovered via openshell_bootstrap::list_gateways())
```
- **Gateways** are discovered from on-disk config via `openshell_bootstrap::list_gateways()`. Each gateway has a name, endpoint, and local/remote flag.
-- **Sandboxes** belong to the active cluster. Fetched via `ListSandboxes` gRPC call with a periodic tick refresh. Each sandbox has: `id`, `name`, `phase`, `created_at_ms`, and `spec.template.image`.
+- **Sandboxes** belong to the active gateway. Fetched via `ListSandboxes` gRPC call with a periodic tick refresh. Each sandbox has: `id`, `name`, `phase`, `created_at_ms`, and `spec.template.image`.
- **Logs** belong to a single sandbox. Initial batch fetched via `GetSandboxLogs` (500 lines), then live-tailed via `WatchSandbox` with `follow_logs: true`.
The **title bar** always reflects this hierarchy, reading left-to-right from general to specific:
@@ -90,7 +90,7 @@ Every frame renders four vertical regions:
```
┌─────────────────────────────────────────────┐
-│ Title bar (1 row) — brand + cluster + context│
+│ Title bar (1 row) — brand + gateway + context│
├─────────────────────────────────────────────┤
│ │
│ Main content (flexible) │
@@ -270,7 +270,7 @@ TUI actions should parallel `openshell` CLI commands so users have familiar ment
| `openshell sandbox list` | Sandbox table on Dashboard |
| `openshell sandbox delete ` | `[d]` on sandbox detail, then `[y]` to confirm |
| `openshell logs ` | `[l]` on sandbox detail to open log viewer |
-| `openshell status` | Status in title bar + cluster list |
+| `openshell status` | Status in title bar + gateway list |
When adding new TUI features, check what the CLI offers and maintain consistency.
@@ -364,7 +364,7 @@ lib.rs (event loop, gRPC, async tasks)
├── theme.rs (colors + styles)
└── ui/
├── mod.rs (draw dispatcher, chrome)
- ├── dashboard.rs (cluster list + sandbox table layout)
+ ├── dashboard.rs (gateway list + sandbox table layout)
├── sandboxes.rs (sandbox table widget)
├── sandbox_detail.rs (detail view)
└── sandbox_logs.rs (log viewer)
@@ -415,7 +415,7 @@ All gRPC calls use a 5-second timeout:
tokio::time::timeout(Duration::from_secs(5), client.health(req)).await
```
-The connect timeout for cluster switching is 10 seconds with HTTP/2 keepalive at 10-second intervals.
+The connect timeout for gateway switching is 10 seconds with HTTP/2 keepalive at 10-second intervals.
### Log streaming lifecycle
@@ -443,7 +443,7 @@ The connect timeout for cluster switching is 10 seconds with HTTP/2 keepalive at
# Build the crate
cargo build -p openshell-tui
-# Run the TUI against the active cluster
+# Run the TUI against the active gateway
mise run term
# Run with cargo-watch for hot-reload during development
@@ -466,13 +466,21 @@ mise run pre-commit
### Gateway changes
-If you change sandbox or server code that affects the backend, redeploy the gateway:
+If you change sandbox or server code that affects the backend, restart or redeploy the gateway for the compute platform you are using.
+
+For Docker-backed local development:
+
+```bash
+mise run gateway:docker
+```
+
+For Kubernetes Helm deployments:
```bash
-mise run cluster:deploy all
+helm upgrade --install openshell deploy/helm/openshell --namespace openshell
```
-To pick up new sandbox images after changing sandbox code, delete the pod manually so it gets recreated:
+For Kubernetes, pick up new sandbox images after changing sandbox code by deleting the pod manually so it gets recreated:
```bash
kubectl delete pod -n
diff --git a/AGENTS.md b/AGENTS.md
index 858e104f5..93062fd5f 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -34,7 +34,7 @@ These pipelines connect skills into end-to-end workflows. Individual skill files
| `crates/openshell-sandbox/` | Sandbox runtime | Container supervision, policy-enforced egress routing |
| `crates/openshell-policy/` | Policy engine | Filesystem, network, process, and inference constraints |
| `crates/openshell-router/` | Privacy router | Privacy-aware LLM routing |
-| `crates/openshell-bootstrap/` | Cluster bootstrap | K3s cluster setup, image loading, mTLS PKI |
+| `crates/openshell-bootstrap/` | Gateway metadata | Gateway registration metadata, mTLS bundle storage, legacy bootstrap helpers |
| `crates/openshell-ocsf/` | OCSF logging | OCSF v1.7.0 event types, builders, shorthand/JSONL formatters, tracing layers |
| `crates/openshell-core/` | Shared core | Common types, configuration, error handling |
| `crates/openshell-providers/` | Provider management | Credential provider backends |
@@ -154,7 +154,7 @@ ocsf_emit!(event);
## Sandbox Infra Changes
-- If you change sandbox infrastructure, ensure `mise run sandbox` succeeds.
+- If you change sandbox infrastructure, ensure the relevant sandbox e2e path succeeds.
## Commits
@@ -172,7 +172,7 @@ ocsf_emit!(event);
- `mise run pre-commit` — Lint, format, license headers. Run before every commit.
- `mise run test` — Unit test suite. Run after code changes.
-- `mise run e2e` — End-to-end tests against a running cluster. Run for infrastructure, sandbox, or policy changes.
+- `mise run e2e` — End-to-end tests against a running gateway. Run for infrastructure, sandbox, or policy changes.
- `mise run ci` — Full local CI (lint + compile/type checks + tests). Run before opening a PR.
## Python
@@ -185,7 +185,7 @@ ocsf_emit!(event);
## Cluster Infrastructure Changes
-- If you change cluster bootstrap infrastructure (e.g., `openshell-bootstrap` crate, `deploy/docker/Dockerfile.images`, `cluster-entrypoint.sh`, `cluster-healthcheck.sh`, deploy logic in `openshell-cli`), update the `debug-openshell-cluster` skill in `.agents/skills/debug-openshell-cluster/SKILL.md` to reflect those changes.
+- If you change gateway deployment infrastructure (e.g., Helm values/templates, gateway image packaging, or deploy logic in `openshell-cli`), update the `debug-openshell-cluster` skill in `.agents/skills/debug-openshell-cluster/SKILL.md` to reflect those changes.
## Documentation
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 3a01d75eb..fad43e82f 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -41,7 +41,7 @@ All open issues are actionable — if it's in the issue tracker, it's ready to b
This project ships with [agent skills](#agent-skills-for-contributors) that can diagnose problems, explore the codebase, generate policies, and walk you through common workflows. Before filing an issue:
1. Clone the repo and point your coding agent at it.
-2. Load the relevant skill - `debug-openshell-cluster` for cluster problems, `debug-inference` for inference setup problems, `openshell-cli` for usage questions, `generate-sandbox-policy` for policy help.
+2. Load the relevant skill - `debug-openshell-cluster` for gateway or deployment problems, `debug-inference` for inference setup problems, `openshell-cli` for usage questions, `generate-sandbox-policy` for policy help.
3. Have your agent investigate. Let it run diagnostics, read the architecture docs, and attempt a fix.
4. If the agent cannot resolve it, open an issue **with the agent's diagnostic output attached**. The issue template requires this.
@@ -49,7 +49,7 @@ This project ships with [agent skills](#agent-skills-for-contributors) that can
- A real bug that your agent confirmed and could not fix.
- A feature proposal with a design — not a "please build this" request.
-- An infrastructure problem that the `debug-openshell-cluster` skill could not resolve.
+- An infrastructure problem that the gateway deployment troubleshooting skill could not resolve.
- An inference setup problem that the `debug-inference` skill could not resolve.
- Security vulnerabilities must follow [SECURITY.md](SECURITY.md) — **not** GitHub issues.
@@ -66,7 +66,7 @@ Skills live in `.agents/skills/`. Your agent's harness can discover and load the
| Category | Skill | Purpose |
| --------------- | ------------------------- | --------------------------------------------------------------------------------------------------- |
| Getting Started | `openshell-cli` | CLI usage, sandbox lifecycle, provider management, BYOC workflows |
-| Getting Started | `debug-openshell-cluster` | Diagnose cluster startup failures and health issues |
+| Getting Started | `debug-openshell-cluster` | Diagnose gateway deployment and health issues |
| Getting Started | `debug-inference` | Diagnose `inference.local`, host-backed local inference, and direct external inference setup issues |
| Contributing | `create-spike` | Investigate a problem, produce a structured GitHub issue |
| Contributing | `build-from-issue` | Plan and implement work from a GitHub issue (maintainer workflow) |
@@ -151,8 +151,8 @@ cargo build -p openshell-prover --features bundled-z3
# One-time trust
mise trust
-# Launch a sandbox (deploys a cluster if one isn't running)
-mise run sandbox
+# Run a standalone Docker-backed gateway for local development
+mise run gateway:docker
```
## Building the `openshell` CLI
@@ -169,16 +169,16 @@ openshell --help
openshell sandbox create -- codex
```
-### Cluster debugging helpers
+### Kubernetes debugging helpers
-Two additional scripts in `scripts/bin/` provide gateway-aware wrappers for cluster debugging:
+Two additional scripts in `scripts/bin/` provide gateway-aware wrappers for Kubernetes debugging:
| Script | What it does |
| --------- | ------------------------------------------------------------------------------------ |
-| `kubectl` | Runs `kubectl` inside the active gateway's k3s container via `openshell doctor exec` |
-| `k9s` | Runs `k9s` inside the active gateway's k3s container via `openshell doctor exec` |
+| `kubectl` | Runs `kubectl` using the active development context |
+| `k9s` | Opens a Kubernetes terminal UI for the active development context |
-These work for both local and remote gateways (SSH is handled automatically). Examples:
+Use these when developing against a Kubernetes-backed gateway. Examples:
```bash
kubectl get pods -A
@@ -193,8 +193,8 @@ These are the primary `mise` tasks for day-to-day development:
| Task | Purpose |
| ------------------ | ------------------------------------------------------- |
-| `mise run cluster` | Bootstrap or incremental deploy |
-| `mise run sandbox` | Create a sandbox on the running cluster |
+| `mise run gateway:docker` | Run a standalone Docker-backed gateway for local development |
+| `mise run sandbox` | Create or reconnect to the dev sandbox |
| `mise run test` | Default test suite |
| `mise run e2e` | Default end-to-end test lane |
| `mise run ci` | Full local CI checks (lint, compile/type checks, tests) |
diff --git a/README.md b/README.md
index aaa851452..a73c7a4e9 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
OpenShell is the safe, private runtime for autonomous AI agents. It provides sandboxed execution environments that protect your data, credentials, and infrastructure — governed by declarative YAML policies that prevent unauthorized file access, data exfiltration, and uncontrolled network activity.
-OpenShell is built agent-first. The project ships with agent skills for everything from cluster debugging to policy generation, and we expect contributors to use them.
+OpenShell is built agent-first. The project ships with agent skills for everything from gateway troubleshooting to policy generation, and we expect contributors to use them.
> **Alpha software — single-player mode.** OpenShell is proof-of-life: one developer, one environment, one gateway. We are building toward multi-tenant enterprise deployments, but the starting point is getting your own environment up and running. Expect rough edges. Bring your agent.
@@ -16,7 +16,8 @@ OpenShell is built agent-first. The project ships with agent skills for everythi
### Prerequisites
-- **Docker** — Docker Desktop (or a Docker daemon) must be running.
+- **A reachable OpenShell gateway** backed by Docker, Podman, Kubernetes, or the experimental MicroVM runtime.
+- **Docker** — Docker Desktop or Docker Engine must be running for local Docker-backed gateways and local image builds.
### Install
@@ -34,14 +35,30 @@ uv tool install -U openshell
Both methods install the latest stable release by default. To install a specific version, set `OPENSHELL_VERSION` (binary) or pin the version with `uv tool install openshell==`. A [`dev` release](https://github.com/NVIDIA/OpenShell/releases/tag/dev) is also available that tracks the latest commit on `main`.
+### Register a gateway
+
+For local development from a source checkout, start a Docker-backed gateway:
+
+```bash
+git clone https://github.com/NVIDIA/OpenShell.git
+cd OpenShell
+mise trust
+mise run gateway:docker
+```
+
+In another terminal, register the endpoint:
+
+```bash
+openshell gateway add http://127.0.0.1:18080 --local --name local
+openshell status
+```
+
### Create a sandbox
```bash
openshell sandbox create -- claude # or opencode, codex, copilot
```
-A gateway is created automatically on first use. To deploy on a remote host instead, pass `--remote user@host` to the create command.
-
The sandbox container includes the following tools by default:
| Category | Tools |
@@ -99,7 +116,7 @@ OpenShell isolates each sandbox in its own container with policy-enforced egress
| **Policy Engine** | Enforces filesystem, network, and process constraints from application layer down to kernel. |
| **Privacy Router** | Privacy-aware LLM routing that keeps sensitive context on sandbox compute. |
-Under the hood, all these components run as a [K3s](https://k3s.io/) Kubernetes cluster inside a single Docker container — no separate K8s install required. The `openshell gateway` commands take care of provisioning the container and cluster.
+Under the hood, OpenShell runs a gateway control plane that manages sandbox lifecycle through a configured compute driver. Supported compute platforms include Docker, Podman, Kubernetes through the Helm chart in `deploy/helm/openshell`, and the experimental MicroVM runtime.
## Protection Layers
@@ -128,7 +145,7 @@ OpenShell can pass host GPUs into sandboxes for local inference, fine-tuning, or
openshell sandbox create --gpu --from [gpu-enabled-sandbox] -- claude
```
-The CLI auto-bootstraps a GPU-enabled gateway on first use, auto-selecting CDI when available and otherwise falling back to Docker's NVIDIA GPU request path (`--gpus all`). GPU intent is also inferred automatically for community images with `gpu` in the name.
+Docker-backed GPU sandboxes auto-select CDI when available and otherwise fall back to Docker's NVIDIA GPU request path (`--gpus all`). GPU intent is also inferred automatically for community images with `gpu` in the name.
**Requirements:** NVIDIA drivers and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) must be installed on the host. The sandbox image itself must include the appropriate GPU drivers and libraries for your workload — the default `base` image does not. See the [BYOC example](https://github.com/NVIDIA/OpenShell/tree/main/examples/bring-your-own-container) for building a custom sandbox image with GPU support.
@@ -171,7 +188,7 @@ openshell term
-The TUI gives you a live, keyboard-driven view of your cluster. Navigate with `Tab` to switch panels, `j`/`k` to move through lists, `Enter` to select, and `:` for command mode. Cluster health and sandbox status auto-refresh every two seconds.
+The TUI gives you a live, keyboard-driven view of your gateway and sandboxes. Navigate with `Tab` to switch panels, `j`/`k` to move through lists, `Enter` to select, and `:` for command mode. Gateway health and sandbox status auto-refresh every two seconds.
## Community Sandboxes and BYOC
@@ -195,7 +212,7 @@ cd OpenShell
# Point your agent here — it will discover the skills in .agents/skills/ automatically
```
-Your agent can load skills for CLI usage (`openshell-cli`), cluster troubleshooting (`debug-openshell-cluster`), inference troubleshooting (`debug-inference`), policy generation (`generate-sandbox-policy`), and more. See [CONTRIBUTING.md](CONTRIBUTING.md) for the full skills table.
+Your agent can load skills for CLI usage (`openshell-cli`), gateway troubleshooting (`debug-openshell-cluster`), inference troubleshooting (`debug-inference`), policy generation (`generate-sandbox-policy`), and more. See [CONTRIBUTING.md](CONTRIBUTING.md) for the full skills table.
## Built With Agents
diff --git a/architecture/README.md b/architecture/README.md
index 36b0a4978..0a8b098d5 100644
--- a/architecture/README.md
+++ b/architecture/README.md
@@ -6,11 +6,11 @@ This project is a platform for securely running AI agents in isolated sandbox en
This platform solves that problem by creating sandboxed execution environments where agents run with exactly the permissions they need and nothing more. Every sandbox is governed by a policy that defines which files the agent can access, which network hosts it can reach, and which system operations it can perform. All outbound network traffic is forced through a controlled proxy that inspects and enforces access rules in real time.
-The platform packages the entire infrastructure -- orchestration gateway, sandbox runtime, networking, and Kubernetes cluster -- into a single deployable unit. A user can go from zero to a running, secured sandbox in two commands. The system handles cluster provisioning, credential management, policy enforcement, and secure remote access without requiring the user to configure Kubernetes, networking, or security policies manually.
+The platform runs a gateway control plane and uses a configured compute driver to run agents in isolated sandbox environments. Supported compute platforms include Docker, Podman, Kubernetes, and the experimental MicroVM runtime. The system handles credential management, policy enforcement, and secure remote access while leaving runtime and cluster provisioning to the operator.
## How the Subsystems Fit Together
-The following diagram shows how the major subsystems interact at a high level. Users interact through the CLI, which communicates with a central gateway. The gateway manages sandbox lifecycle in Kubernetes, and each sandbox enforces its own policy locally. Inference API calls to `inference.local` are routed locally within the sandbox by an embedded inference router, without traversing the gateway at request time.
+The following diagram shows how the major subsystems interact at a high level. Users interact through the CLI, which communicates with a central gateway. The gateway manages sandbox lifecycle through a compute driver, and each sandbox enforces its own policy locally. Inference API calls to `inference.local` are routed locally within the sandbox by an embedded inference router, without traversing the gateway at request time.
```mermaid
flowchart TB
@@ -18,11 +18,12 @@ flowchart TB
CLI["Command-Line Interface"]
end
- subgraph CLUSTER["Kubernetes Cluster (single Docker container)"]
+ subgraph PLATFORM["Compute Platform"]
SERVER["Gateway / Control Plane"]
DB["Database (SQLite or Postgres)"]
+ DRIVER["Compute Driver
(Docker, Podman,
Kubernetes, VM)"]
- subgraph SBX["Sandbox Pod"]
+ subgraph SBX["Sandbox Workload"]
SUPERVISOR["Sandbox Supervisor"]
PROXY["Network Proxy"]
ROUTER["Inference Router"]
@@ -40,7 +41,8 @@ flowchart TB
CLI -- "gRPC / HTTPS" --> SERVER
CLI -- "SSH over HTTP CONNECT" --> SERVER
SERVER -- "CRUD + Watch" --> DB
- SERVER -- "Create / Delete Pods" --> SBX
+ SERVER -- "Create / Delete / Watch" --> DRIVER
+ DRIVER -- "Manage sandbox workload" --> SBX
SUPERVISOR -- "Fetch Policy + Credentials + Inference Bundle" --> SERVER
SUPERVISOR -- "Spawn + Restrict" --> CHILD
CHILD -- "All network traffic" --> PROXY
@@ -92,42 +94,41 @@ For more detail, see [Sandbox Architecture](sandbox.md) (Proxy Routing section).
### Gateway / Control Plane
-The gateway is the central orchestration service. It provides the API that the CLI talks to and manages the lifecycle of sandboxes in Kubernetes.
+The gateway is the central orchestration service. It provides the API that the CLI talks to and manages sandbox lifecycle through the selected compute driver.
Key responsibilities:
-- **Sandbox lifecycle management**: Creating, deleting, and monitoring sandboxes. When a user creates a sandbox, the gateway provisions a Kubernetes pod with the correct container image, policy, and environment configuration.
+- **Sandbox lifecycle management**: Creating, deleting, and monitoring sandboxes. When a user creates a sandbox, the gateway asks the selected compute driver to provision a workload with the correct image, policy, and environment configuration.
- **gRPC and HTTP APIs**: The gateway exposes a gRPC API for structured operations (sandbox CRUD, provider management, SSH session creation) and HTTP endpoints for health checks. Both protocols share a single network port through protocol multiplexing.
- **Data persistence**: Sandbox records, provider credentials, SSH sessions, and inference routes are stored in a database (SQLite by default, Postgres as an option).
- **TLS termination**: The gateway supports TLS with automatic protocol negotiation, so gRPC and HTTP clients can connect securely on the same port.
- **SSH tunnel gateway**: The gateway provides the entry point for SSH connections into sandboxes (see Sandbox Connect below).
- **Real-time updates**: The gateway streams sandbox status changes to the CLI, so users see live progress when a sandbox is starting up.
-- **Inference bundle resolution**: The gateway stores cluster-level inference configuration (provider name + model ID) and resolves it into bundles containing endpoint URLs, API keys, supported protocols, provider type, and auth metadata. Sandboxes fetch these bundles at startup and refresh them periodically. The gateway does not proxy inference traffic at request time -- it only provides configuration.
+- **Inference bundle resolution**: The gateway stores gateway-level inference configuration (provider name + model ID) and resolves it into bundles containing endpoint URLs, API keys, supported protocols, provider type, and auth metadata. Sandboxes fetch these bundles at startup and refresh them periodically. The gateway does not proxy inference traffic at request time -- it only provides configuration.
For more detail, see [Gateway Architecture](gateway.md).
-### Cluster Bootstrap and Infrastructure
+### Gateway Deployment Infrastructure
-The entire platform -- Kubernetes, the gateway, networking, and pre-loaded container images -- is packaged into a single Docker container. This means the only dependency a user needs is Docker.
+The gateway can run as a standalone process, a container, or a Kubernetes workload installed by the Helm chart in `deploy/helm/openshell`. Operators supply the compute platform and configure the driver that the gateway should use for sandboxes.
-The bootstrap system handles:
+The deployment layer handles:
-- **Provisioning**: Creating the Docker container with an embedded Kubernetes (k3s) cluster, pre-loaded with all required images and Helm charts.
-- **Local and remote deployment**: The same bootstrap flow works for local development (Docker on the user's machine) and remote deployment (Docker on a remote host, accessed via SSH).
-- **Health monitoring**: After starting the cluster, the system polls for readiness -- waiting for Kubernetes to start, for components to deploy, and for health checks to pass.
-- **Credential management**: If TLS is enabled, the bootstrap process automatically extracts client certificates and stores them locally for the CLI to use.
-- **Idempotent operation**: Running the deploy command again is safe. It reuses existing infrastructure or recreates only what changed.
+- **Gateway startup**: Running the gateway process or installing the Kubernetes Helm release.
+- **Runtime configuration**: Supplying image references, service exposure, sandbox runtime configuration, callback endpoints, and TLS material.
+- **Credential distribution**: Providing TLS and SSH relay material to the gateway and sandbox workloads.
+- **Compute driver configuration**: Selecting Docker, Podman, Kubernetes, or VM-backed sandbox execution.
-The target onboarding experience is two commands:
+The target onboarding experience is:
```bash
-pip install
-openshell sandbox create --remote user@host -- claude
+mise run gateway:docker
+openshell gateway add
```
-The first command installs the CLI. The second command bootstraps the cluster on the remote host (if needed) and launches a sandbox running the specified agent.
+The first command is one local development example. Kubernetes operators use `helm upgrade --install openshell ./deploy/helm/openshell --namespace openshell` instead. The second command registers the reachable endpoint with the CLI.
-For more detail, see [Cluster Bootstrap Architecture](cluster-single-node.md).
+For more detail, see [Gateway Deployment and Compute Platforms](gateway-single-node.md).
### Sandbox Connect (SSH Tunneling)
@@ -171,7 +172,7 @@ The inference routing system transparently intercepts AI inference API calls fro
**How it works end-to-end:**
-1. An operator configures cluster-level inference via `openshell cluster inference set --provider --model `. This stores a reference to the named provider and model on the gateway.
+1. An operator configures gateway-level inference via `openshell inference set --provider --model `. This stores a reference to the named provider and model on the gateway.
2. When a sandbox starts, the supervisor fetches an inference bundle from the gateway via the `GetInferenceBundle` RPC. The gateway resolves the stored provider reference into a complete route: endpoint URL, API key, supported protocols, provider type, and auth metadata. The sandbox refreshes this bundle eagerly in the background every 5 seconds by default (override with `OPENSHELL_ROUTE_REFRESH_INTERVAL_SECS`).
3. The agent sends requests to `https://inference.local` using standard OpenAI or Anthropic SDK calls.
4. The sandbox proxy intercepts the HTTPS CONNECT to `inference.local` (bypassing OPA policy evaluation), TLS-terminates the connection using the sandbox's ephemeral CA, and parses the HTTP request.
@@ -184,9 +185,9 @@ The inference routing system transparently intercepts AI inference API calls fro
- The sandbox never sees the real API key for the backend -- credential isolation is maintained through the gateway's bundle resolution.
- Routing is explicit via `inference.local`; OPA network policy is not involved in inference routing.
- Provider-specific behavior (auth header style, default headers, supported protocols) is centralized in `InferenceProviderProfile` definitions in `openshell-core`. Supported inference provider types are openai, anthropic, and nvidia.
-- Cluster inference is managed via CLI (`openshell cluster inference set/get`).
+- Gateway inference is managed via CLI (`openshell inference set/get`).
-**Inference routes** are stored on the gateway as protobuf objects (`InferenceRoute` in `proto/inference.proto`). Cluster inference uses a managed singleton route entry keyed by `inference.local` and configured from provider + model settings. Endpoint, credentials, and protocols are resolved from the referenced provider record at bundle fetch time, so rotating a provider's API key takes effect on the next bundle refresh without reconfiguring the route.
+**Inference routes** are stored on the gateway as protobuf objects (`InferenceRoute` in `proto/inference.proto`). Gateway inference uses a managed singleton route entry keyed by `inference.local` and configured from provider + model settings. Endpoint, credentials, and protocols are resolved from the referenced provider record at bundle fetch time, so rotating a provider's API key takes effect on the next bundle refresh without reconfiguring the route.
**Components involved:**
@@ -196,20 +197,19 @@ The inference routing system transparently intercepts AI inference API calls fro
| Inference pattern detection | `crates/openshell-sandbox/src/l7/inference.rs` | Matches HTTP method + path against known inference API patterns |
| Local inference router | `crates/openshell-router/src/lib.rs` | Selects a compatible route by protocol and proxies to the backend |
| Provider profiles | `crates/openshell-core/src/inference.rs` | Centralized auth, headers, protocols, and endpoint defaults per provider type |
-| Gateway inference service | `crates/openshell-server/src/inference.rs` | Stores cluster inference config, resolves bundles with credentials from provider records |
+| Gateway inference service | `crates/openshell-server/src/inference.rs` | Stores gateway inference config, resolves bundles with credentials from provider records |
| Proto definitions | `proto/inference.proto` | `ClusterInferenceConfig`, `ResolvedRoute`, bundle RPCs |
### Container and Build System
-The platform produces three container images:
+The platform publishes the gateway image and relies on community-maintained sandbox images:
| Image | Purpose |
|---|---|
-| **Sandbox** | Runs inside each sandbox pod. Contains the sandbox supervisor binary, Python runtime, and agent tooling. Uses a multi-user setup (privileged supervisor, restricted agent user). |
| **Gateway** | Runs the control plane. Contains the gateway binary, database migrations, and an embedded SSH client for sandbox management. |
-| **Cluster** | An airgapped Kubernetes image with k3s, pre-loaded sandbox and gateway images, Helm charts, and an API gateway. This is the single container that users deploy. |
+| **Sandbox** | Runs each sandbox workload. Maintained in the OpenShell Community repository or supplied by the user. |
-Builds use multi-stage Dockerfiles with caching to keep rebuild times fast. A Helm chart handles Kubernetes-level configuration (service ports, health checks, security contexts, resource limits). Build automation is managed through mise tasks.
+Builds use multi-stage Dockerfiles with caching to keep rebuild times fast. The Helm chart handles Kubernetes-level configuration such as service ports, health checks, security contexts, resource limits, storage, and TLS secret mounts. Docker, Podman, and VM-backed deployments configure equivalent runtime concerns through their driver-specific gateway configuration.
For more detail, see [Container Management](build-containers.md).
@@ -222,7 +222,7 @@ Sandbox behavior is governed by policies written in YAML and evaluated by an emb
- **Process privileges**: What user/group the agent runs as.
- **L7 inspection rules**: Protocol-level constraints on HTTP API calls for specific endpoints.
-Inference routing to `inference.local` is configured separately at the cluster level and does not require network policy entries. The OPA engine evaluates only explicit network policies; `inference.local` connections bypass OPA entirely and are handled by the proxy's dedicated inference interception path.
+Inference routing to `inference.local` is configured separately at the gateway level and does not require network policy entries. The OPA engine evaluates only explicit network policies; `inference.local` connections bypass OPA entirely and are handled by the proxy's dedicated inference interception path.
Policies are not intended to be hand-edited by end users in normal operation. They are associated with sandboxes at creation time and fetched by the sandbox supervisor at startup via gRPC. For development and testing, policies can also be loaded from local files. A gateway-global policy can override all sandbox policies via `openshell policy set --global`.
@@ -234,17 +234,17 @@ For more detail on the policy language, see [Policy Language](security-policy.md
The CLI is the primary way users interact with the platform. It provides commands organized into four groups:
-- **Gateway management** (`openshell gateway`): Deploy, stop, destroy, and inspect clusters. Supports both local and remote (SSH) targets.
+- **Gateway management** (`openshell gateway`): Register, select, and inspect gateway endpoints.
- **Sandbox management** (`openshell sandbox`): Create sandboxes (with optional file upload and provider auto-discovery), connect to sandboxes via SSH, and delete sandboxes.
-- **Top-level commands**: `openshell status` (cluster health), `openshell logs` (sandbox logs), `openshell forward` (port forwarding), `openshell policy` (sandbox policy management), `openshell settings` (effective sandbox settings and global/sandbox key updates).
+- **Top-level commands**: `openshell status` (gateway health), `openshell logs` (sandbox logs), `openshell forward` (port forwarding), `openshell policy` (sandbox policy management), `openshell settings` (effective sandbox settings and global/sandbox key updates).
- **Provider management** (`openshell provider`): Create, update, list, and delete external service credentials.
-- **Inference management** (`openshell cluster inference`): Configure cluster-level inference by specifying a provider and model. The gateway resolves endpoint and credential details from the named provider record.
+- **Inference management** (`openshell inference`): Configure gateway-level inference by specifying a provider and model. The gateway resolves endpoint and credential details from the named provider record.
The CLI resolves which gateway to operate on through a priority chain: explicit `--gateway` flag, then the `OPENSHELL_GATEWAY` environment variable, then the active gateway set by `openshell gateway select`. Gateway names are exposed to shell completion from local metadata, and `openshell gateway select` opens an interactive chooser on a TTY while falling back to a printed list in non-interactive use. The CLI supports TLS client certificates for mutual authentication with the gateway.
## How Users Get Started
-The onboarding flow is designed to require minimal setup. Docker is the only prerequisite.
+The onboarding flow starts from a reachable gateway endpoint.
**Step 1: Install the CLI.**
@@ -258,22 +258,7 @@ pip install
openshell sandbox create -- claude
```
-If no cluster exists, the CLI automatically bootstraps one. It provisions a local Kubernetes cluster inside a Docker container, waits for it to become healthy, discovers the user's AI provider credentials from local configuration files, uploads them to the gateway, and launches a sandbox running the specified agent -- all from a single command.
-
-For remote deployment (running the sandbox on a different machine):
-
-```bash
-openshell sandbox create --remote user@hostname -- claude
-```
-
-This performs the same bootstrap flow on the remote host via SSH.
-
-For development and testing against the current checkout, use
-`scripts/remote-deploy.sh` instead. That helper syncs the local repository to
-an SSH-reachable machine, builds the CLI and Docker images on the remote host,
-and then runs `openshell gateway start` there. It defaults to secure gateway
-startup and only enables `--plaintext`, `--disable-gateway-auth`, or
-`--recreate` when explicitly requested.
+Before creating a sandbox, start or deploy the gateway on the selected compute platform and register the reachable endpoint with the CLI.
**Step 3: Connect to a running sandbox.**
@@ -287,10 +272,10 @@ This opens an interactive SSH session into the sandbox, with all provider creden
| Document | Description |
|---|---|
-| [Cluster Bootstrap](cluster-single-node.md) | How the platform bootstraps a Kubernetes cluster from a single Docker container, for local and remote targets. |
+| [Gateway Deployment and Compute Platforms](gateway-single-node.md) | How the gateway runs across Docker, Podman, Kubernetes with Helm, and the experimental VM driver. |
| [Gateway Architecture](gateway.md) | The control plane gateway: API multiplexing, gRPC services, persistence, TLS, and sandbox orchestration. |
| [Gateway Communication](gateway-deploy-connect.md) | How the CLI resolves a gateway and communicates with it over mTLS, plaintext HTTP/2, or an edge-authenticated WebSocket tunnel. |
-| [Gateway Security](gateway-security.md) | mTLS enforcement, PKI bootstrap, certificate hierarchy, and the gateway trust model. |
+| [Gateway Security](gateway-security.md) | mTLS enforcement, PKI provisioning, certificate hierarchy, and the gateway trust model. |
| [Sandbox Architecture](sandbox.md) | The sandbox execution environment: policy enforcement, Landlock, seccomp, network namespaces, and the network proxy. |
| [Container Management](build-containers.md) | Container images, Dockerfiles, Helm charts, build tasks, and CI/CD. |
| [Sandbox Connect](sandbox-connect.md) | SSH tunneling into sandboxes through the gateway. |
diff --git a/architecture/build-containers.md b/architecture/build-containers.md
index e619534b2..8d4a1ec8f 100644
--- a/architecture/build-containers.md
+++ b/architecture/build-containers.md
@@ -1,25 +1,43 @@
-# Container Images
+# Container Images and Deployment Packaging
-OpenShell produces two container images, both published for `linux/amd64` and `linux/arm64`.
+OpenShell publishes the gateway container image and keeps Kubernetes Helm packaging in this repository. Sandbox images are maintained in the separate OpenShell Community repository.
-## Gateway (`openshell/gateway`)
+## Gateway Image
-The gateway runs the control plane API server. It is deployed as a StatefulSet inside the cluster container via a bundled Helm chart.
+The gateway image runs the control plane API server. Kubernetes deployments use it through the Helm chart. Standalone container deployments can use the same image with driver-specific runtime configuration.
- **Docker target**: `gateway` in `deploy/docker/Dockerfile.images`
- **Registry**: `ghcr.io/nvidia/openshell/gateway:latest`
-- **Pulled when**: Cluster startup (the Helm chart triggers the pull)
-- **Entrypoint**: `openshell-gateway --port 8080` (gRPC + HTTP, mTLS)
+- **Pulled when**: Helm install or upgrade, or standalone container deployment
+- **Entrypoint**: `openshell-gateway --port 8080`
-## Cluster (`openshell/cluster`)
+The image contains the gateway binary and database migrations. Runtime configuration is supplied by Helm values and Kubernetes secrets for Kubernetes, or by driver-specific configuration for standalone gateway deployments.
-The cluster image is a single-container Kubernetes distribution that bundles the Helm charts, Kubernetes manifests, and the `openshell-sandbox` supervisor binary needed to bootstrap the control plane.
+## Helm Chart
-- **Docker target**: `cluster` in `deploy/docker/Dockerfile.images`
-- **Registry**: `ghcr.io/nvidia/openshell/cluster:latest`
-- **Pulled when**: `openshell gateway start`
+The Helm chart at `deploy/helm/openshell` owns Kubernetes deployment concerns:
-The supervisor binary (`openshell-sandbox`) is built by the shared `supervisor-builder` stage in `deploy/docker/Dockerfile.images` and placed at `/opt/openshell/bin/openshell-sandbox`. It is exposed to sandbox pods at runtime via a read-only `hostPath` volume mount — it is not baked into sandbox images.
+- Gateway StatefulSet and persistent volume claim.
+- Service account, RBAC, and service.
+- Gateway service exposure.
+- TLS secret mounts and environment variables.
+- Sandbox namespace, default sandbox image, and callback endpoint configuration.
+- NetworkPolicy restricting sandbox SSH ingress to the gateway.
+
+The chart remains the supported deployment artifact for Kubernetes. OpenShell no longer publishes a k3s cluster image that bundles Kubernetes, manifests, and charts into a single Docker container.
+
+## Supervisor Delivery
+
+The `openshell-sandbox` supervisor is delivered by the selected compute driver:
+
+| Driver | Supervisor delivery |
+|---|---|
+| Kubernetes | Sandbox pod image or Kubernetes driver pod template configuration. |
+| Docker | Local supervisor binary or supervisor image extraction configured by the gateway. |
+| Podman | Read-only OCI image volume from the `supervisor-output` image. |
+| VM | Embedded in the VM runtime rootfs. |
+
+The old k3s hostPath model, where the supervisor binary was side-loaded from a node filesystem baked into the cluster image, is no longer part of the target architecture.
## Standalone Gateway Binary
@@ -29,7 +47,6 @@ OpenShell also publishes a standalone `openshell-gateway` binary as a GitHub rel
- **Artifact name**: `openshell-gateway-.tar.gz`
- **Targets**: `x86_64-unknown-linux-gnu`, `aarch64-unknown-linux-gnu`, `aarch64-apple-darwin`
- **Release workflows**: `.github/workflows/release-dev.yml`, `.github/workflows/release-tag.yml`
-- **Installer**: None yet. The binary is a manual-download asset.
Both the standalone artifact and the deployed container image use the `openshell-gateway` binary.
@@ -44,7 +61,7 @@ OpenShell also publishes Python wheels for `linux/amd64`, `linux/arm64`, and mac
## Sandbox Images
-Sandbox images are **not built in this repository**. They are maintained in the [openshell-community](https://github.com/nvidia/openshell-community) repository and pulled from `ghcr.io/nvidia/openshell-community/sandboxes/` at runtime.
+Sandbox images are not built in this repository. They are maintained in the [openshell-community](https://github.com/nvidia/openshell-community) repository and pulled from `ghcr.io/nvidia/openshell-community/sandboxes/` at runtime.
The default sandbox image is `ghcr.io/nvidia/openshell-community/sandboxes/base:latest`. To use a named community sandbox:
@@ -56,38 +73,14 @@ This pulls `ghcr.io/nvidia/openshell-community/sandboxes/:latest`.
## Local Development
-`mise run cluster` is the primary development command. It bootstraps a cluster if one doesn't exist, then performs incremental deploys for subsequent runs.
-
-The incremental deploy (`cluster-deploy-fast.sh`) fingerprints local Git changes and only rebuilds components whose files have changed:
+Use the workflow that matches the driver you are changing:
-| Changed files | Rebuild triggered |
+| Area | Typical local command |
|---|---|
-| Cargo manifests, proto definitions, cross-build script | Gateway + supervisor |
-| `crates/openshell-server/*`, `crates/openshell-ocsf/*`, `deploy/docker/Dockerfile.images` | Gateway |
-| `crates/openshell-sandbox/*`, `crates/openshell-policy/*` | Supervisor |
-| `deploy/helm/openshell/*` | Helm upgrade |
-
-When no local changes are detected, the command is a no-op.
-
-**Gateway updates** are pushed to a local registry and the StatefulSet is restarted. **Supervisor updates** are copied directly into the running cluster container via `docker cp` — new sandbox pods pick up the updated binary immediately through the hostPath mount, with no image rebuild or cluster restart required.
-
-Fingerprints are stored in `.cache/cluster-deploy-fast.state`. You can also target specific components explicitly:
-
-```bash
-mise run cluster -- gateway # rebuild gateway only
-mise run cluster -- supervisor # rebuild supervisor only
-mise run cluster -- chart # helm upgrade only
-mise run cluster -- all # rebuild everything
-```
-
-To validate incremental routing and BuildKit cache reuse locally, run:
-
-```bash
-mise run cluster:test:fast-deploy-cache
-```
-
-The harness runs isolated scenarios in temporary git worktrees, keeps its own state and cache under `.cache/cluster-deploy-fast-test/`, and writes a Markdown summary with:
+| Gateway image or chart | `mise run helm:lint` and `mise run docker:build:gateway` |
+| Docker driver | `mise run gateway:docker` or `mise run e2e:docker` |
+| Podman driver | `mise run e2e:podman` |
+| VM driver | `mise run e2e:vm` |
+| Published docs | `mise run docs` |
-- auto-detection checks for gateway-only, supervisor-only, shared, Helm-only, unrelated, and explicit-target changes
-- cold vs warm rebuild comparisons for gateway and supervisor code changes
-- container-ID invalidation coverage to verify gateway + Helm are retriggered when the cluster container changes
+Kubernetes chart changes should be validated with `helm lint deploy/helm/openshell` and, when possible, by installing the chart into a disposable Kubernetes namespace.
diff --git a/architecture/custom-vm-runtime.md b/architecture/custom-vm-runtime.md
index 045ee2e9a..e96feae5e 100644
--- a/architecture/custom-vm-runtime.md
+++ b/architecture/custom-vm-runtime.md
@@ -78,9 +78,9 @@ The rootfs tarball the driver embeds starts from the same minimal Ubuntu base
used across the project, and is **rewritten into a supervisor-only sandbox
guest** during extraction:
-- k3s state and Kubernetes manifests are stripped out
+- Kubernetes state and manifests are stripped out
- `/srv/openshell-vm-sandbox-init.sh` is installed as the guest entrypoint
-- the guest boots directly into `openshell-sandbox` — no k3s, no kube-proxy,
+- the guest boots directly into `openshell-sandbox` -- no Kubernetes control plane, no kube-proxy,
no CNI plugins
See `crates/openshell-driver-vm/src/rootfs.rs` for the rewrite logic and
diff --git a/architecture/gateway-security.md b/architecture/gateway-security.md
index a32c3fb52..59f0032c9 100644
--- a/architecture/gateway-security.md
+++ b/architecture/gateway-security.md
@@ -2,17 +2,17 @@
## Overview
-By default, communication with the OpenShell gateway is secured by mutual TLS (mTLS). The CLI, SDK, and sandbox pods present certificates signed by the cluster CA before they reach any application handler. The PKI is bootstrapped automatically during cluster deployment, and certificates are distributed to Kubernetes secrets and the local filesystem without manual configuration.
+By default, communication with the OpenShell gateway is secured by mutual TLS (mTLS). The CLI, SDK, and sandbox workloads present certificates signed by the deployment CA before they reach any application handler. In Helm deployments, operators provide the certificate bundle as Kubernetes secrets and place the CLI client bundle in the local gateway credential directory. Non-Kubernetes deployments provide equivalent certificate files to the gateway and sandbox runtime.
The gateway also supports Cloudflare-fronted deployments where the edge, not the gateway, is the first authentication boundary. In that mode the gateway either keeps TLS enabled but allows no-certificate client handshakes (`allow_unauthenticated=true`) and relies on application-layer Cloudflare JWTs, or disables gateway TLS entirely and serves plaintext behind a trusted reverse proxy or tunnel.
-This document covers the certificate hierarchy, the bootstrap process, how gateway transport security modes are enforced, how sandboxes and the CLI consume their certificates, and the broader security model of the gateway.
+This document covers the certificate hierarchy, how gateway transport security modes are enforced, how sandboxes and the CLI consume their certificates, and the broader security model of the gateway.
## Architecture Diagram
```mermaid
graph TD
- subgraph PKI["PKI (generated at bootstrap)"]
+ subgraph PKI["PKI (operator provided)"]
CA["openshell-ca
(self-signed root)"]
SERVER_CERT["openshell-server cert
(signed by CA)"]
CLIENT_CERT["openshell-client cert
(signed by CA, shared)"]
@@ -20,17 +20,17 @@ graph TD
CA --> CLIENT_CERT
end
- subgraph CLUSTER["Kubernetes Cluster"]
+ subgraph CLUSTER["Kubernetes Helm Deployment"]
S1["openshell-server-tls
Secret (server cert+key)"]
S2["openshell-server-client-ca
Secret (CA cert)"]
S3["openshell-client-tls
Secret (client cert+key+CA)"]
GW["Gateway Process
(tokio-rustls)"]
- SBX["Sandbox Pod"]
+ SBX["Sandbox Workload"]
end
subgraph HOST["User's Machine"]
CLI["CLI"]
- MTLS_DIR["~/.config/openshell/
clusters/<name>/mtls/"]
+ MTLS_DIR["~/.config/openshell/
gateways/<name>/mtls/"]
end
SERVER_CERT --> S1
@@ -49,7 +49,7 @@ graph TD
## Certificate Hierarchy
-The PKI is a single-tier CA hierarchy generated by the `openshell-bootstrap` crate using `rcgen`. All certificates are created in a single pass at cluster bootstrap time.
+The default PKI shape is a single-tier CA hierarchy. Operators can generate this bundle with their internal PKI tooling, cert-manager, or a local development CA.
```text
openshell-ca (Self-signed Root CA, O=openshell, CN=openshell-ca)
@@ -60,28 +60,26 @@ openshell-ca (Self-signed Root CA, O=openshell, CN=openshell-ca)
│ + extra SANs for remote deployments
│
└── openshell-client (Leaf cert, CN=openshell-client)
- Shared by the CLI and all sandbox pods.
+ Shared by the CLI and all sandbox workloads.
```
Key design decisions:
-- **Single client certificate**: One client cert is shared by the CLI and every sandbox pod. This simplifies secret management. Individual sandbox identity is not expressed at the TLS layer; post-authentication identification uses the `x-sandbox-id` gRPC header.
-- **Long-lived certificates**: Certificates use `rcgen` defaults (validity ~1975--4096), which effectively never expire. This is appropriate for an internal dev-cluster PKI where certificates are ephemeral to the cluster's lifetime.
-- **CA key not persisted**: The CA private key is used only during generation and is not stored in any Kubernetes secret. Re-signing requires regenerating the entire PKI.
-
-See `crates/openshell-bootstrap/src/pki.rs:35` for the `generate_pki()` implementation and `crates/openshell-bootstrap/src/pki.rs:18` for the default SAN list.
+- **Single client certificate**: One client cert is shared by the CLI and every sandbox workload. This simplifies secret management. Individual sandbox identity is not expressed at the TLS layer; post-authentication identification uses the `x-sandbox-id` gRPC header.
+- **Certificate lifetime**: Certificate validity is owned by the operator's PKI policy.
+- **CA key not stored in OpenShell**: The chart consumes certificates and CA bundles, but it does not need the CA private key.
## Kubernetes Secret Distribution
-The PKI bundle is distributed as three Kubernetes secrets in the `openshell` namespace:
+In Helm deployments, the PKI bundle is distributed as three Kubernetes secrets in the `openshell` namespace:
| Secret Name | Type | Contents | Consumed By |
|---|---|---|---|
| `openshell-server-tls` | `kubernetes.io/tls` | `tls.crt` (server cert), `tls.key` (server key) | Gateway StatefulSet |
| `openshell-server-client-ca` | `Opaque` | `ca.crt` (CA cert) | Gateway StatefulSet (client verification) |
-| `openshell-client-tls` | `Opaque` | `tls.crt` (client cert), `tls.key` (client key), `ca.crt` (CA cert) | Sandbox pods, CLI (via local filesystem) |
+| `openshell-client-tls` | `Opaque` | `tls.crt` (client cert), `tls.key` (client key), `ca.crt` (CA cert) | Sandbox workloads, CLI (via local filesystem) |
-Secret names are defined as constants in `crates/openshell-bootstrap/src/constants.rs:6-10`.
+Secret names are chart values under `server.tls.*` in `deploy/helm/openshell/values.yaml`.
### Gateway Mounts
@@ -100,9 +98,9 @@ OPENSHELL_TLS_KEY=/etc/openshell-tls/server/tls.key
OPENSHELL_TLS_CLIENT_CA=/etc/openshell-tls/client-ca/ca.crt
```
-### Sandbox Pod Mounts
+### Sandbox Workload TLS Material
-When the gateway creates a sandbox pod (`crates/openshell-server/src/sandbox/mod.rs:681`), it injects:
+When the Kubernetes driver creates a sandbox pod, it injects:
- A volume backed by the `openshell-client-tls` secret.
- A read-only mount at `/etc/openshell-tls/client/` on the agent container.
@@ -128,41 +126,31 @@ $XDG_CONFIG_HOME/openshell/gateways//mtls/
Files are written atomically using a temp-dir -> validate -> rename strategy with backup and rollback on failure. See `crates/openshell-bootstrap/src/mtls.rs:10`.
-## PKI Bootstrap Sequence
+## PKI Provisioning Sequence
+
+PKI provisioning is operator-driven:
-PKI provisioning occurs during `deploy_cluster_with_logs()` (`crates/openshell-bootstrap/src/lib.rs:284`). The full sequence:
+1. Generate or obtain a server certificate, server key, client certificate, client key, and CA certificate.
+2. Provide the server certificate and client CA to the gateway process.
+3. Provide the client certificate bundle to sandbox workloads through the selected compute driver.
+4. Store the same client bundle under `~/.config/openshell/gateways//mtls/` so the CLI can authenticate to the gateway.
-1. **Cluster container launched** -- a k3s container is created via Docker with a persistent volume.
-2. **k3s readiness** -- the bootstrap waits for k3s to become ready inside the container.
-3. **Extra SANs computed** -- for remote deployments, the SSH destination hostname and its resolved IP are added to the server certificate's SANs. For local deployments, the detected gateway host (if any) is added.
-4. **`reconcile_pki()` called** (`crates/openshell-bootstrap/src/lib.rs:515`):
- 1. Wait for the `openshell` namespace to exist (created by the Helm controller).
- 2. Attempt to load existing PKI from the three K8s secrets via `kubectl get secret` exec'd inside the container. Each field is base64-decoded and validated for PEM markers.
- 3. **If secrets exist and are valid**: reuse them and return `rotated=false`.
- 4. **If secrets are missing, incomplete, or malformed**: generate fresh PKI via `generate_pki()`, apply all three secrets via `kubectl apply`, and return `rotated=true`.
-5. **Workload restart on rotation** -- if `rotated=true` and the openshell StatefulSet already exists, the bootstrap performs `kubectl rollout restart` and waits for completion. This ensures the server picks up new TLS secrets before the CLI writes its local copy.
-6. **CLI-side credential storage** -- `store_pki_bundle()` writes `ca.crt`, `tls.crt`, `tls.key` to the local filesystem.
+For Helm deployments, steps 2 and 3 use the `openshell-server-tls`, `openshell-server-client-ca`, and `openshell-client-tls` Kubernetes secrets before installing or upgrading the chart.
```mermaid
sequenceDiagram
- participant CLI as nav deploy
- participant Docker as Cluster Container
- participant K8s as k3s / K8s API
-
- CLI->>Docker: Create container, wait for k3s
- CLI->>K8s: Wait for openshell namespace
- CLI->>K8s: Read existing TLS secrets
- alt Secrets valid
- CLI->>CLI: Reuse existing PKI
- else Secrets missing/invalid
- CLI->>CLI: generate_pki(extra_sans)
- CLI->>K8s: kubectl apply (3 secrets)
- alt Workload exists
- CLI->>K8s: kubectl rollout restart
- CLI->>K8s: Wait for rollout complete
- end
- end
- CLI->>CLI: store_pki_bundle() to local filesystem
+ participant O as Operator
+ participant K8s as Kubernetes API
+ participant Helm as Helm
+ participant GW as Gateway Pod
+ participant CLI as CLI
+
+ O->>K8s: Create TLS and SSH handshake secrets
+ O->>Helm: helm upgrade --install
+ Helm->>K8s: Apply StatefulSet and mounts
+ K8s->>GW: Start gateway with mounted certs
+ O->>CLI: Store client cert bundle locally
+ CLI->>GW: Connect with mTLS
```
## Gateway TLS Enforcement
@@ -216,7 +204,7 @@ The e2e test suite (`e2e/python/test_security_tls.py`) validates four scenarios:
## Sandbox-to-Gateway mTLS
-Sandbox pods connect back to the gateway at startup to fetch their policy and provider credentials. The gRPC client (`crates/openshell-sandbox/src/grpc_client.rs:18`) reads three environment variables to configure mTLS:
+Sandbox workloads connect back to the gateway at startup to fetch their policy and provider credentials. The gRPC client (`crates/openshell-sandbox/src/grpc_client.rs:18`) reads three environment variables to configure mTLS:
| Env Var | Value |
|---|---|
@@ -226,7 +214,7 @@ Sandbox pods connect back to the gateway at startup to fetch their policy and pr
These are used to build a `tonic::transport::ClientTlsConfig` with:
-- `ca_certificate()` -- verifies the server's certificate against the cluster CA.
+- `ca_certificate()` -- verifies the server's certificate against the deployment CA.
- `identity()` -- presents the shared client certificate for mTLS.
The sandbox calls two RPCs over this authenticated channel:
@@ -281,13 +269,13 @@ Per-connection flow:
2. Gateway calls `SupervisorSessionRegistry::open_relay(sandbox_id, ...)`, which allocates a `channel_id` (UUID) and sends a `RelayOpen` message to the supervisor over the already-established `ConnectSupervisor` stream. If no session is registered yet, it polls with exponential backoff up to a bounded timeout (30 s for `/connect/ssh`, 15 s for `ExecSandbox`).
3. The supervisor opens a new `RelayStream` RPC on the same `Channel` — a new HTTP/2 stream, no new TCP connection and no new TLS handshake. The first `RelayFrame` is a `RelayInit { channel_id }` that claims the pending slot on the gateway.
4. `claim_relay` pairs the gateway-side waiter with the supervisor-side RPC via a `tokio::io::duplex(64 KiB)` pair. Subsequent `RelayFrame::data` frames carry raw SSH bytes in both directions. The supervisor is a dumb byte bridge: it has no protocol awareness of the SSH bytes flowing through.
-5. Inside the sandbox pod, the supervisor connects the relay to sshd over a Unix domain socket at `/run/openshell/ssh.sock` (see `crates/openshell-driver-kubernetes/src/main.rs`).
+5. Inside the sandbox workload, the supervisor connects the relay to sshd over a Unix domain socket at `/run/openshell/ssh.sock`.
Security properties of this model:
- **One auth boundary.** mTLS on the `ConnectSupervisor` stream is the only identity check between gateway and sandbox. Every relay rides that same authenticated HTTP/2 connection.
-- **No inbound network path into the sandbox.** The sandbox exposes no TCP port for gateway ingress; all relays are supervisor-initiated. The pod only needs egress to the gateway.
-- **In-pod access control is filesystem permissions on the Unix socket.** sshd listens on `/run/openshell/ssh.sock` with the parent directory at `0700` and the socket itself at `0600`, both owned by the supervisor (root). The sandbox entrypoint runs as an unprivileged user and cannot open either. Any process in the supervisor's filesystem view that can open the socket can reach sshd — same trust model as any local Unix socket with `0600` permissions. See `crates/openshell-sandbox/src/ssh.rs:55-83`.
+- **No inbound network path into the sandbox.** The sandbox exposes no TCP port for gateway ingress; all relays are supervisor-initiated. The workload only needs egress to the gateway.
+- **In-workload access control is filesystem permissions on the Unix socket.** sshd listens on `/run/openshell/ssh.sock` with the parent directory at `0700` and the socket itself at `0600`, both owned by the supervisor (root). The sandbox entrypoint runs as an unprivileged user and cannot open either. Any process in the supervisor's filesystem view that can open the socket can reach sshd; this is the same trust model as any local Unix socket with `0600` permissions. See `crates/openshell-sandbox/src/ssh.rs:55-83`.
- **Supersede race is closed.** A supervisor reconnect registers a new `session_id` for the same sandbox id. Cleanup on the old session's task uses `remove_if_current(sandbox_id, session_id)` so a late-finishing old task cannot evict the new registration or serve relays meant for the new instance. See `SupervisorSessionRegistry::remove_if_current` in `crates/openshell-server/src/supervisor_session.rs`.
- **Pending-relay reaper.** A background task sweeps `pending_relays` entries older than 10 s (`RELAY_PENDING_TIMEOUT`). If the supervisor acknowledges `RelayOpen` but never initiates `RelayStream` — crash, deadlock, or adversarial stall — the gateway-side slot does not pin indefinitely.
- **Client-side keepalives.** The CLI's `ssh` invocation sets `ServerAliveInterval=15` / `ServerAliveCountMax=3` (`crates/openshell-cli/src/ssh.rs:150`), so a silently-dropped relay (gateway restart, supervisor restart, or adversarial TCP drop) surfaces to the user within roughly 45 s rather than hanging.
@@ -296,17 +284,16 @@ Observability (sandbox side, OCSF): `session_established`, `session_closed`, `se
## Port Configuration
-Traffic flows through several layers from the host to the gateway process:
+Traffic flows through the configured gateway exposure path to the gateway process. Kubernetes deployments use the Helm-managed service; standalone deployments bind the gateway port directly or place it behind an operator-managed proxy.
| Layer | Port | Configurable Via |
|---|---|---|
-| Host (Docker) | `8080` (default) | `--port` flag on `nav deploy` |
-| Container | `30051` | Hardcoded in `crates/openshell-bootstrap/src/docker.rs` |
-| k3s NodePort | `30051` | `deploy/helm/openshell/values.yaml` (`service.nodePort`) |
-| k3s Service | `8080` | `deploy/helm/openshell/values.yaml` (`service.port`) |
+| External ingress / port-forward / load balancer / reverse proxy | Operator choice | Platform-specific service or proxy configuration |
+| Kubernetes Service | `8080` by default | `deploy/helm/openshell/values.yaml` (`service.port`) |
+| NodePort, when enabled | `30051` by default | `deploy/helm/openshell/values.yaml` (`service.nodePort`) |
| Server bind | `8080` | `--port` flag / `OPENSHELL_SERVER_PORT` env var |
-Docker maps `host_port → 30051/tcp`. Inside k3s, the NodePort service maps `30051 → 8080 (pod port)`. The server binds `0.0.0.0:8080`.
+The server binds `0.0.0.0:8080` by default. The chart maps the service port to the gateway container's `grpc` port for Kubernetes deployments.
## Security Model Summary
@@ -324,16 +311,16 @@ graph LR
API["gRPC + HTTP API"]
end
- subgraph KUBE["Kubernetes"]
- SBX["Sandbox Pod"]
+ subgraph PLATFORM["Compute Platform"]
+ SBX["Sandbox Workload"]
end
subgraph INET["Internet"]
HOSTS["Allowed Hosts"]
end
- CLI -- "mTLS
(cluster CA)" --> TLS
- SDK -- "mTLS
(cluster CA)" --> TLS
+ CLI -- "mTLS
(deployment CA)" --> TLS
+ SDK -- "mTLS
(deployment CA)" --> TLS
TLS --> API
SBX -- "mTLS + ConnectSupervisor
(supervisor-initiated)" --> TLS
API -- "RelayStream
(HTTP/2 on same mTLS conn)" --> SBX
@@ -344,10 +331,10 @@ graph LR
| Boundary | Mechanism |
|---|---|
-| External → Gateway | mTLS with cluster CA by default, or trusted reverse-proxy/Cloudflare boundary in edge mode |
+| External → Gateway | mTLS with deployment CA by default, or trusted reverse-proxy/Cloudflare boundary in edge mode |
| Sandbox → Gateway | mTLS with shared client cert (supervisor-initiated `ConnectSupervisor` stream) |
-| Gateway → Sandbox (SSH/exec) | Rides the supervisor's mTLS `ConnectSupervisor` HTTP/2 connection as a `RelayStream` — no separate gateway-to-pod connection |
-| Supervisor → in-pod sshd | Unix-socket filesystem permissions (`/run/openshell/ssh.sock`, 0700 parent / 0600 socket) |
+| Gateway → Sandbox (SSH/exec) | Rides the supervisor's mTLS `ConnectSupervisor` HTTP/2 connection as a `RelayStream`; no separate gateway-to-sandbox network connection |
+| Supervisor → workload sshd | Unix-socket filesystem permissions (`/run/openshell/ssh.sock`, 0700 parent / 0600 socket) |
| Sandbox → External (network) | OPA policy + process identity binding via `/proc` |
### What Is Not Authenticated (by Design)
@@ -388,18 +375,18 @@ This section defines the primary attacker profiles, what the current design prot
|---|---|
| Network attacker | Can observe/modify traffic between clients and gateway |
| Unauthorized external client | Can reach gateway port but has no valid client cert |
-| Compromised sandbox workload | Has code execution inside one sandbox pod |
-| Malicious in-cluster pod | Can attempt direct pod-to-pod connections |
+| Compromised sandbox workload | Has code execution inside one sandbox workload |
+| Malicious platform peer | Can attempt direct workload-to-workload connections |
| Stolen CLI credentials | Has copied `ca.crt`/`tls.crt`/`tls.key` from a developer machine |
### Primary Defenses
| Threat | Existing Defense | Notes |
|---|---|---|
-| MITM or passive interception of gateway traffic | Mandatory mTLS with cluster CA, or trusted reverse-proxy boundary in Cloudflare mode | Default mode is direct mTLS; reverse-proxy mode shifts the outer trust boundary upstream |
+| MITM or passive interception of gateway traffic | Mandatory mTLS with deployment CA, or trusted reverse-proxy boundary in Cloudflare mode | Default mode is direct mTLS; reverse-proxy mode shifts the outer trust boundary upstream |
| Unauthenticated API/health access | mTLS by default, or Cloudflare/reverse-proxy auth in edge mode | `/health*` are direct-mTLS only in the default deployment mode |
| Forged SSH tunnel connection to sandbox | Session token validation at the gateway; only the supervisor's authenticated mTLS `ConnectSupervisor` stream can carry a `RelayStream` to its sandbox | Forging a relay requires stealing a valid mTLS client identity |
-| Direct access to sandbox sshd from cluster peers | sshd listens on a Unix socket (`0700` parent / `0600` socket) inside the pod | No network path exists to sshd from cluster peers |
+| Direct access to sandbox sshd from platform peers | sshd listens on a Unix socket (`0700` parent / `0600` socket) inside the workload | No network path exists to sshd from platform peers |
| Stale or reconnecting supervisor serves relays for a new instance | `session_id`-scoped `remove_if_current` on the registry | Old session cleanup cannot evict a newer registration |
| Supervisor acknowledges `RelayOpen` but never initiates `RelayStream` | Gateway-side pending-relay reaper (10 s timeout) | Prevents indefinite resource pinning by a buggy or malicious supervisor |
| Silent TCP drop of an in-flight relay | CLI `ServerAliveInterval=15` / `ServerAliveCountMax=3` | Client detects a dead relay within ~45 s instead of hanging |
@@ -418,35 +405,35 @@ This section defines the primary attacker profiles, what the current design prot
### Out of Scope / Not Defended By This Layer
-- A fully compromised Kubernetes control plane or cluster-admin account.
-- A malicious actor with direct access to Kubernetes secrets in the `openshell` namespace.
+- A fully compromised compute platform, such as a Kubernetes control plane, container host, or VM host.
+- A malicious actor with direct access to deployment secrets for the gateway or sandbox runtime.
- Host-level compromise of the developer workstation running the CLI.
- Application-layer authorization bugs after mTLS authentication succeeds.
### Trust Assumptions
-- The cluster CA is generated and distributed without interception during bootstrap.
-- Kubernetes secret access is restricted to intended workloads and operators.
+- The deployment CA is generated and distributed without interception during provisioning.
+- Secret access is restricted to intended workloads and operators.
- Gateway and sandbox container images are trusted and not tampered with.
-- The sandbox pod's filesystem is trusted: only the supervisor process (root) can open `/run/openshell/ssh.sock`, which is enforced by the `0700` parent directory and `0600` socket permissions set at sshd start.
+- The sandbox workload's filesystem is trusted: only the supervisor process (root) can open `/run/openshell/ssh.sock`, which is enforced by the `0700` parent directory and `0600` socket permissions set at sshd start.
## Sandbox Outbound TLS (L7 Inspection)
-Separate from the cluster mTLS infrastructure, each sandbox has an independent TLS capability for inspecting outbound HTTPS traffic. This is documented here for completeness because it involves a distinct, per-sandbox PKI.
+Separate from the gateway mTLS infrastructure, each sandbox has an independent TLS capability for inspecting outbound HTTPS traffic. This is documented here for completeness because it involves a distinct, per-sandbox PKI.
The sandbox proxy automatically detects and terminates TLS on outbound HTTPS connections by peeking the first bytes of each tunnel. This enables credential injection and L7 inspection without requiring explicit policy configuration. The proxy performs TLS man-in-the-middle inspection:
-1. **Ephemeral sandbox CA**: a per-sandbox CA (`CN=OpenShell Sandbox CA, O=OpenShell`) is generated at sandbox startup. This CA is completely independent of the cluster mTLS CA.
+1. **Ephemeral sandbox CA**: a per-sandbox CA (`CN=OpenShell Sandbox CA, O=OpenShell`) is generated at sandbox startup. This CA is completely independent of the gateway mTLS CA.
2. **Trust injection**: the sandbox CA is written to the sandbox filesystem and injected via `NODE_EXTRA_CA_CERTS` and `SSL_CERT_FILE` so processes inside the sandbox trust it.
3. **Dynamic leaf certs**: for each target hostname, the proxy generates and caches a leaf certificate signed by the sandbox CA (up to 256 entries).
-4. **Upstream verification**: the proxy verifies upstream server certificates against Mozilla root CAs (`webpki-roots`) and system CA certificates from the container's trust store, not against the cluster CA. Custom sandbox images can add corporate/internal CAs via `update-ca-certificates`.
+4. **Upstream verification**: the proxy verifies upstream server certificates against Mozilla root CAs (`webpki-roots`) and system CA certificates from the container's trust store, not against the gateway mTLS CA. Custom sandbox images can add corporate/internal CAs via `update-ca-certificates`.
This capability is orthogonal to gateway mTLS -- it operates only on sandbox-to-internet traffic and uses entirely separate key material. See [Policy Language](security-policy.md) for configuration details.
## Cross-References
- [Gateway Architecture](gateway.md) -- protocol multiplexing, gRPC services, persistence, and SSH tunneling
-- [Cluster Bootstrap](cluster-single-node.md) -- cluster provisioning, Docker container lifecycle, and credential management
+- [Gateway Deployment and Compute Platforms](gateway-single-node.md) -- gateway deployment modes, compute platform inputs, and removed k3s responsibilities
- [Sandbox Architecture](sandbox.md) -- sandbox-side isolation, proxy, and policy enforcement
- [Sandbox Connect](sandbox-connect.md) -- client-side SSH connection flow through the gateway
- [Policy Language](security-policy.md) -- YAML/Rego policy system including L7 TLS inspection configuration
diff --git a/architecture/gateway-single-node.md b/architecture/gateway-single-node.md
index 01b69b2f5..ef4780b1f 100644
--- a/architecture/gateway-single-node.md
+++ b/architecture/gateway-single-node.md
@@ -1,471 +1,143 @@
-# Gateway Bootstrap Architecture
+# Gateway Deployment and Compute Platforms
-This document describes how OpenShell bootstraps a single-node k3s gateway inside a Docker container, for both local and remote (SSH) targets.
+This document describes the target OpenShell gateway deployment model after removing the embedded k3s implementation. OpenShell no longer centers deployment on a Docker-wrapped Kubernetes cluster image. Operators run a gateway endpoint and configure the compute driver that should create sandboxes.
+
+The Helm chart remains in this repository as the supported Kubernetes deployment artifact. Docker, Podman, and the experimental MicroVM runtime remain first-class compute platforms for local and specialized deployments.
## Goals and Scope
-- Provide a single bootstrap flow through `openshell-bootstrap` for local and remote gateway lifecycle.
-- Keep Docker as the only runtime dependency for provisioning and lifecycle operations.
-- Package the OpenShell gateway as one container image, transferred to the target host via registry pull.
-- Support idempotent `deploy` behavior (safe to re-run).
-- Persist gateway access artifacts (metadata, mTLS certs) in the local XDG config directory.
-- Track the active gateway so most CLI commands resolve their target automatically.
+- Keep the gateway deployable as a standard process, container, or Kubernetes Helm release.
+- Keep the Helm chart for Kubernetes deployments.
+- Remove the published k3s cluster image and Docker-wrapped Kubernetes bootstrap flow from the documented architecture.
+- Keep the gateway image independent from the compute runtime.
+- Make compute-platform dependencies explicit.
+- Preserve CLI gateway registration and selection as the way users target an already-running gateway.
Out of scope:
-- Multi-node orchestration.
+- Provisioning Kubernetes, Docker, Podman, or VM host infrastructure.
+- Replacing all existing legacy CLI bootstrap code.
+- Defining a new one-command mTLS import flow for every deployment type.
## Components
-- `crates/openshell-cli/src/main.rs`: CLI entry point; `clap`-based command parsing.
-- `crates/openshell-cli/src/run.rs`: CLI command implementations (`gateway_start`, `gateway_stop`, `gateway_destroy`, `gateway_info`, `doctor_logs`).
-- `crates/openshell-cli/src/bootstrap.rs`: Auto-bootstrap helpers for `sandbox create` (offers to deploy a gateway when one is unreachable).
-- `crates/openshell-bootstrap/src/lib.rs`: Gateway lifecycle orchestration (`deploy_gateway`, `deploy_gateway_with_logs`, `gateway_handle`, `check_existing_deployment`).
-- `crates/openshell-bootstrap/src/docker.rs`: Docker API wrappers (per-gateway network, volume, container, image operations).
-- `crates/openshell-bootstrap/src/image.rs`: Remote image registry pull with XOR-obfuscated distribution credentials.
-- `crates/openshell-bootstrap/src/runtime.rs`: In-container operations via `docker exec` (health polling, stale node cleanup, deployment restart).
-- `crates/openshell-bootstrap/src/metadata.rs`: Gateway metadata creation, storage, and active gateway tracking.
-- `crates/openshell-bootstrap/src/mtls.rs`: Gateway TLS detection and CLI mTLS bundle extraction.
-- `crates/openshell-bootstrap/src/push.rs`: Local development image push into k3s containerd.
-- `crates/openshell-bootstrap/src/paths.rs`: XDG path resolution.
-- `crates/openshell-bootstrap/src/constants.rs`: Shared constants (image name, container/volume/network naming).
-- `deploy/docker/Dockerfile.images` (target `cluster`): Container image definition (k3s base + Helm charts + manifests + entrypoint).
-- `deploy/docker/cluster-entrypoint.sh`: Container entrypoint (DNS proxy, registry config, manifest injection).
-- `deploy/docker/cluster-healthcheck.sh`: Docker HEALTHCHECK script.
-- Docker daemon(s):
- - Local daemon for local deploys.
- - Remote daemon over SSH for remote deploy container operations.
-
-## CLI Commands
-
-All gateway lifecycle commands live under `openshell gateway`:
-
-| Command | Description |
-|---|---|
-| `openshell gateway start [--name NAME] [--remote user@host] [--ssh-key PATH]` | Provision or update a gateway |
-| `openshell gateway stop [--name NAME] [--remote user@host]` | Stop the container (preserves state) |
-| `openshell gateway destroy [--name NAME] [--remote user@host]` | Destroy container, attached volumes, per-gateway network, and metadata |
-| `openshell gateway info [--name NAME]` | Show deployment details (endpoint, SSH host) |
-| `openshell status` | Show gateway health via gRPC/HTTP |
-| `openshell doctor logs [--name NAME] [--remote user@host] [--tail N]` | Fetch gateway container logs |
-| `openshell doctor exec [--name NAME] [--remote user@host] -- ` | Run a command inside the gateway container |
-| `openshell gateway select ` | Set the active gateway |
-| `openshell gateway select` | Open an interactive chooser on a TTY, or list all gateways in non-interactive mode |
-
-The `--name` flag defaults to `"openshell"`. When omitted on commands that accept it, the CLI resolves the active gateway via: `--gateway` flag, then `OPENSHELL_GATEWAY` env, then `~/.config/openshell/active_gateway` file.
-
-For remote dev/test deploys from a local checkout, `scripts/remote-deploy.sh`
-wraps a different workflow: it rsyncs the repository to a remote host, builds
-the release CLI plus cluster/server/sandbox images on that machine, and then
-invokes `openshell gateway start` with explicit flags such as `--recreate`,
-`--plaintext`, or `--disable-gateway-auth` only when requested.
+- `crates/openshell-server`: Gateway API server, persistence, inference route management, SSH relay, and compute-driver integration.
+- `crates/openshell-driver-kubernetes`: Kubernetes compute driver for sandbox pods and Kubernetes resources.
+- `crates/openshell-driver-docker`: Docker compute driver for local sandbox containers.
+- `crates/openshell-driver-vm`: VM compute driver for libkrun-backed sandboxes.
+- Podman driver path: rootless container execution compatible with the Podman runtime model.
+- `deploy/helm/openshell`: Helm chart for deploying the gateway and Kubernetes driver configuration.
+- `deploy/docker/Dockerfile.images` target `gateway`: Builds the published gateway image.
+- `crates/openshell-cli`: CLI commands that register, select, and talk to gateways.
-## Local Task Flows (`mise`)
-
-Development task entrypoints split bootstrap behavior:
-
-| Task | Behavior |
-|---|---|
-| `mise run cluster` | Bootstrap or incremental deploy: creates gateway if needed (fast recreate), then detects changed files and rebuilds/pushes only impacted components |
-
-For `mise run cluster`, `.env` acts as local source-of-truth for `GATEWAY_NAME`, `GATEWAY_PORT`, and `OPENSHELL_GATEWAY`. Missing keys are appended; existing values are preserved. If `GATEWAY_PORT` is missing, the task selects a free local port and persists it.
-Fast mode ensures a local registry (`127.0.0.1:5000`) is running and configures k3s to mirror pulls via `host.docker.internal:5000`, so the cluster task can push/pull local component images consistently.
-
-## Bootstrap Sequence Diagram
+## Deployment Flow
```mermaid
sequenceDiagram
- participant U as User
- participant C as openshell-cli
- participant B as openshell-bootstrap
- participant L as Local Docker daemon
- participant R as Remote Docker daemon (SSH)
-
- U->>C: openshell gateway start --remote user@host
- C->>B: deploy_gateway(DeployOptions)
-
- B->>B: create_ssh_docker_client (ssh://, 600s timeout)
- B->>R: pull_remote_image (registry auth, platform-aware)
- B->>R: tag image to local ref
-
- Note over B,R: Docker socket APIs only, no extra host dependencies
-
- B->>B: resolve SSH host for extra TLS SANs
- B->>R: ensure_network (per-gateway bridge, attachable)
- B->>R: ensure_volume
- B->>R: ensure_container (privileged, k3s server)
- B->>R: start_container
- B->>R: clean_stale_nodes (kubectl delete node)
- B->>R: wait_for_gateway_ready (180 attempts, 2s apart)
- B->>R: poll for secret openshell-cli-client (90 attempts, 2s apart)
- R-->>B: ca.crt, tls.crt, tls.key
- B->>B: atomically store mTLS bundle
- B->>B: create and persist gateway metadata JSON
- B->>B: save_active_gateway
- B-->>C: GatewayHandle (metadata, docker client)
-```
-
-## End-State Connectivity Diagram
-
-```mermaid
-flowchart LR
- subgraph WS[User workstation]
- NAV[openshell-cli]
- MTLS[mTLS bundle ca.crt, tls.crt, tls.key]
- end
-
- subgraph HOST[Target host]
- DOCKER[Docker daemon]
- K3S[openshell-cluster-NAME single k3s container]
- G8080[Gateway :port -> :30051 mTLS default 8080]
- SBX[Sandbox runtime]
- end
-
- NAV --> G8080
- MTLS -. client cert auth .-> G8080
-
- DOCKER --> K3S
- K3S --> G8080
- K3S --> G80
- K3S --> G443
-```
-
-## Deploy Flow
-
-### 1) Entry and client selection
-
-`deploy_gateway(DeployOptions)` in `crates/openshell-bootstrap/src/lib.rs` chooses execution mode:
-
-- `DeployOptions` fields: `name: String`, `image_ref: Option`, `remote: Option`, `port: u16` (default 8080).
-- `RemoteOptions` fields: `destination: String`, `ssh_key: Option`.
-- **Local deploy**: Create one Docker client with `Docker::connect_with_local_defaults()`.
-- **Remote deploy**: Create SSH Docker client via `Docker::connect_with_ssh()` with a 600-second timeout (for large image transfers). The destination is prefixed with `ssh://` if not already present.
-
-The `deploy_gateway_with_logs` variant accepts an `FnMut(String)` callback for progress reporting. The CLI wraps this in a `GatewayDeployLogPanel` for interactive terminals.
-
-**Pre-deploy check** (CLI layer in `gateway_start`): In interactive terminals, `check_existing_deployment` inspects whether a container or volume already exists. If found, the user is prompted to destroy and recreate or reuse the existing gateway.
-
-### 2) Image readiness
-
-Image ref resolution in `default_gateway_image_ref()`:
-
-1. If `OPENSHELL_CLUSTER_IMAGE` is set and non-empty, use it verbatim.
-2. Otherwise, use the published distribution image base (`/openshell/cluster`) with its default tag behavior.
-
-- **Local deploy**: `ensure_image()` inspects the image on the local daemon and pulls from the configured registry if missing (using built-in distribution credentials when pulling from the default distribution host).
-- **Remote deploy**: `pull_remote_image()` queries the remote daemon's architecture via `Docker::version()`, pulls the matching platform variant from the distribution registry (with XOR-decoded credentials), and tags the pulled image to the expected local ref (for example `openshell/cluster:dev` when an explicit local tag is requested).
-
-### 3) Runtime infrastructure
-
-For the target daemon (local or remote):
-
-1. **Ensure bridge network** `openshell-cluster-{name}` (attachable, bridge driver) via `ensure_network()`. Each gateway gets its own isolated Docker network.
-2. **Ensure volume** `openshell-cluster-{name}` via `ensure_volume()`.
-3. **Compute extra TLS SANs**:
- - For **local deploys**: Check `DOCKER_HOST` for a non-loopback `tcp://` endpoint (e.g., `tcp://docker:2375` in CI). If found, extract the host as an extra SAN. The function `local_gateway_host_from_docker_host()` skips `localhost`, `127.0.0.1`, and `::1`.
- - For **remote deploys**: Extract the host from the SSH destination (handles `user@host`, `ssh://user@host`), resolve via `ssh -G` to get the canonical hostname/IP. Include both the resolved host and original SSH host (if different) as extra SANs.
-4. **Ensure container** `openshell-cluster-{name}` via `ensure_container()`:
- - k3s server command: `server --disable=traefik --tls-san=127.0.0.1 --tls-san=localhost --tls-san=host.docker.internal` plus computed extra SANs.
- - Privileged mode.
- - Volume bind mount: `openshell-cluster-{name}:/var/lib/rancher/k3s`.
- - Network: `openshell-cluster-{name}` (per-gateway bridge network).
- - Extra host: `host.docker.internal:host-gateway`.
- - The cluster entrypoint prefers the resolved IPv4 for `host.docker.internal` when populating sandbox pod `hostAliases`, then falls back to the container default gateway. This keeps sandbox host aliases working on Docker Desktop, where the host-reachable IP differs from the bridge gateway.
- - Port mappings:
-
- | Container Port | Host Port | Purpose |
- |---|---|---|
- | 30051/tcp | configurable (default 8080) | OpenShell service NodePort (mTLS) |
-
- - Container environment variables (see [Container Environment Variables](#container-environment-variables) below).
- - If the container exists with a different image ID (compared by inspecting the content-addressable ID), it is stopped, force-removed, and recreated. If the image matches, the existing container is reused.
-5. **Start container** via `start_container()`. Tolerates already-running 409 conflict.
-
-### 4) Readiness and artifact extraction
-
-After the container starts:
-
-1. **Clean stale nodes**: `clean_stale_nodes()` finds nodes whose name does not match the deterministic k3s `--node-name` and deletes them. That node name is derived from the gateway name but normalized to a Kubernetes-safe lowercase form so existing gateway names that contain `_`, `.`, or uppercase characters still produce a valid node identity. This cleanup is needed when a container is recreated but reuses the persistent volume -- old node entries can persist in etcd. Non-fatal on error; returns the count of removed nodes.
-2. **Push local images** (optional, local deploy only): If `OPENSHELL_PUSH_IMAGES` is set, the comma-separated image refs are exported from the local Docker daemon as a single tar, uploaded into the container via `docker put_archive`, and imported into containerd via `ctr images import` in the `k8s.io` namespace. After import, `kubectl rollout restart deployment/openshell openshell` is run, followed by `kubectl rollout status --timeout=180s` to wait for completion. See `crates/openshell-bootstrap/src/push.rs`.
-3. **Wait for gateway health**: `wait_for_gateway_ready()` polls the Docker HEALTHCHECK status up to 180 times, 2 seconds apart (6 min total). A background task streams container logs during this wait. Failure modes:
- - Container exits during polling: error includes recent log lines.
- - Container has no HEALTHCHECK instruction: fails immediately.
- - HEALTHCHECK reports unhealthy on final attempt: error includes recent logs.
-
-The gateway StatefulSet also uses a Kubernetes `startupProbe` on the gRPC port before steady-state liveness and readiness checks begin. This gives single-node k3s boots extra time to absorb early networking and flannel initialization delay without restarting the gateway pod too aggressively.
-
-### 5) mTLS bundle capture
-
-TLS is always required. `fetch_and_store_cli_mtls()` polls for Kubernetes secret `openshell-cli-client` in namespace `openshell` (90 attempts, 2 seconds apart, 3 min total). Each attempt checks the container is still running. The secret's base64-encoded `ca.crt`, `tls.crt`, and `tls.key` fields are decoded and stored.
-
-Storage location: `~/.config/openshell/gateways/{name}/mtls/`
-
-Write is atomic: write to `.tmp` directory, validate all three files are non-empty, rename existing directory to `.bak`, rename `.tmp` to final path, then remove `.bak`.
-
-### 6) Metadata persistence
-
-`create_gateway_metadata()` produces a `GatewayMetadata` struct:
-
-- **Local**: endpoint `https://127.0.0.1:{port}` by default, or `https://{docker_host}:{port}` when `DOCKER_HOST` is a non-loopback `tcp://` endpoint. `is_remote=false`.
-- **Remote**: endpoint `https://{resolved_host}:{port}`, `is_remote=true`, plus SSH destination and resolved host.
-
-Metadata fields:
-
-| Field | Type | Description |
-|---|---|---|
-| `name` | `String` | Gateway name |
-| `gateway_endpoint` | `String` | HTTPS endpoint with port (e.g., `https://127.0.0.1:8080`) |
-| `is_remote` | `bool` | Whether gateway is remote |
-| `gateway_port` | `u16` | Host port mapped to the gateway NodePort |
-| `remote_host` | `Option` | SSH destination (e.g., `user@host`) |
-| `resolved_host` | `Option` | Resolved hostname/IP from `ssh -G` |
-
-Metadata location: `~/.config/openshell/gateways/{name}_metadata.json`
-
-Note: metadata is stored at the `gateways/` level (not nested inside `{name}/` like mTLS).
-
-After deploy, the CLI calls `save_active_gateway(name)`, writing the gateway name to `~/.config/openshell/active_gateway`. Subsequent commands that don't specify `--gateway` or `OPENSHELL_GATEWAY` resolve to this active gateway.
-
-## Container Image
-
-The cluster image is defined by target `cluster` in `deploy/docker/Dockerfile.images`:
-
-```text
-Base: rancher/k3s:v1.35.2-k3s1
+ participant O as Operator
+ participant G as Gateway
+ participant D as Compute Driver
+ participant P as Compute Platform
+ participant C as openshell CLI
+
+ O->>G: Start gateway with driver configuration
+ G->>D: Initialize selected compute driver
+ D->>P: Verify runtime or cluster access
+ O->>C: Register reachable gateway endpoint
+ C->>G: gRPC / HTTP requests
+ G->>D: Create / delete / watch sandboxes
+ D->>P: Create sandbox workload
```
-Layers added:
-
-1. Custom entrypoint: `deploy/docker/cluster-entrypoint.sh` -> `/usr/local/bin/cluster-entrypoint.sh`
-2. Healthcheck script: `deploy/docker/cluster-healthcheck.sh` -> `/usr/local/bin/cluster-healthcheck.sh`
-3. Packaged Helm charts: `deploy/docker/.build/charts/*.tgz` -> `/var/lib/rancher/k3s/server/static/charts/`
-4. Kubernetes manifests: `deploy/kube/manifests/*.yaml` -> `/opt/openshell/manifests/`
-
-Bundled manifests include:
-
-- `openshell-helmchart.yaml` (OpenShell Helm chart auto-deploy)
-- `envoy-gateway-helmchart.yaml` (Envoy Gateway for Gateway API)
-- `agent-sandbox.yaml`
-
-The HEALTHCHECK is configured as: `--interval=5s --timeout=5s --start-period=20s --retries=60`.
-
-## Entrypoint Script
-
-`deploy/docker/cluster-entrypoint.sh` runs before k3s starts. It performs:
-
-### DNS proxy setup
-
-On Docker custom networks, `/etc/resolv.conf` contains `127.0.0.11` (Docker's internal DNS). k3s detects this loopback and falls back to `8.8.8.8`, which does not work on Docker Desktop. The entrypoint solves this by:
+## Supported Compute Platforms
-1. Discovering Docker's real DNS listener ports from the `DOCKER_OUTPUT` iptables chain.
-2. Getting the container's `eth0` IP as a routable address.
-3. Adding DNAT rules in PREROUTING to forward DNS from pod namespaces through to Docker's DNS.
-4. Writing a custom resolv.conf pointing to the container IP.
-5. Passing `--kubelet-arg=resolv-conf=/etc/rancher/k3s/resolv.conf` to k3s.
+| Platform | Gateway shape | Sandbox workload | Primary dependencies |
+|---|---|---|---|
+| Docker | Standalone gateway process or container on a host with Docker access. | Local containers. | Docker daemon, image pull/build access, local networking. |
+| Podman | Standalone gateway process with Podman socket access. | Rootless or user-scoped containers. | Podman socket, rootless networking, image pull/build access. |
+| Kubernetes | Gateway StatefulSet installed by Helm. | Sandbox pods. | Kubernetes API, namespace, service account, RBAC, storage, secrets. |
+| MicroVM | Gateway process with VM driver access. | VM-backed sandboxes. | VM runtime rootfs, libkrun-based driver, host virtualization support. |
-Falls back to `8.8.8.8` / `8.8.4.4` if iptables detection fails.
+## Kubernetes Helm Deployment
-### Registry configuration
+The Helm chart at `deploy/helm/openshell` owns Kubernetes deployment concerns:
-Writes `/etc/rancher/k3s/registries.yaml` from `REGISTRY_HOST`, `REGISTRY_ENDPOINT`, `REGISTRY_USERNAME`, `REGISTRY_PASSWORD`, and `REGISTRY_INSECURE` environment variables so that k3s/containerd can authenticate when pulling component images at runtime. When no explicit credentials are provided (the default for public GHCR repos), the auth block is omitted and images are pulled anonymously.
+- Gateway StatefulSet and persistent volume claim.
+- Service account, RBAC, and service.
+- Gateway service exposure.
+- TLS secret mounts and environment variables.
+- Sandbox namespace, default sandbox image, and callback endpoint configuration.
+- NetworkPolicy restricting sandbox SSH ingress to the gateway.
-### Manifest injection
-
-Copies bundled manifests from `/opt/openshell/manifests/` to `/var/lib/rancher/k3s/server/manifests/`. This is needed because the volume mount on `/var/lib/rancher/k3s` overwrites any files baked into that path at image build time.
-
-### Image configuration overrides
-
-When environment variables are set, the entrypoint modifies the HelmChart manifest at `/var/lib/rancher/k3s/server/manifests/openshell-helmchart.yaml`:
-
-- `IMAGE_REPO_BASE`: Rewrites `repository:`, `sandboxImage:`, and `jobImage:` in the HelmChart.
-- `PUSH_IMAGE_REFS`: In push mode, parses comma-separated image refs and rewrites the exact gateway, sandbox, and pki-job image references (matching on path component `/gateway:`, `/sandbox:`, `/pki-job:`).
-- `IMAGE_TAG`: Replaces `:latest` tags with the specified tag on gateway, sandbox, and pki-job images. Handles both quoted and unquoted `tag: latest` formats.
-- `IMAGE_PULL_POLICY`: Replaces `pullPolicy: Always` with the specified policy (e.g., `IfNotPresent`).
-- `SSH_GATEWAY_HOST` / `SSH_GATEWAY_PORT`: Replaces `__SSH_GATEWAY_HOST__` and `__SSH_GATEWAY_PORT__` placeholders.
-- `EXTRA_SANS`: Builds a YAML flow-style list from the comma-separated SANs and replaces `extraSANs: []`.
-
-## Healthcheck Script
-
-`deploy/docker/cluster-healthcheck.sh` validates cluster readiness through a series of checks:
-
-1. **Kubernetes API**: `kubectl get --raw='/readyz'`
-2. **OpenShell StatefulSet**: Checks that `statefulset/openshell` in namespace `openshell` exists and has 1 ready replica.
-3. **Gateway**: Checks that `gateway/openshell-gateway` in namespace `openshell` has the `Programmed` condition.
-4. **mTLS secret** (conditional): If `NAV_GATEWAY_TLS_ENABLED` is true (or inferred from the HelmChart manifest using the same two-path detection logic as the bootstrap code), checks that secret `openshell-cli-client` exists with non-empty `ca.crt`, `tls.crt`, and `tls.key` data.
-
-## GPU Enablement
-
-GPU support is part of the single-node gateway bootstrap path rather than a separate architecture.
-
-- `openshell gateway start --gpu` threads GPU device options through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`.
-- When enabled, the cluster container is created with Docker `DeviceRequests`. The injection mechanism is selected based on whether CDI is enabled on the daemon (`SystemInfo.CDISpecDirs` via `GET /info`):
- - **CDI enabled** (daemon reports non-empty `CDISpecDirs`): CDI device injection — `driver="cdi"` with `nvidia.com/gpu=all`. Specs are expected to be pre-generated on the host (e.g. automatically by the `nvidia-cdi-refresh.service` or manually via `nvidia-ctk generate`).
- - **CDI not enabled**: `--gpus all` device request — `driver="nvidia"`, `count=-1`, which relies on the NVIDIA Container Runtime hook.
-- `deploy/docker/Dockerfile.images` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image.
-- `deploy/docker/cluster-entrypoint.sh` checks `GPU_ENABLED=true` and copies GPU-only manifests from `/opt/openshell/gpu-manifests/` into k3s's manifests directory.
-- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`. NFD and GFD are disabled; the device plugin's default `nodeAffinity` (which requires `feature.node.kubernetes.io/pci-10de.present=true` or `nvidia.com/gpu.present=true` from NFD/GFD) is overridden to empty so the DaemonSet schedules on the single-node cluster without requiring those labels. The chart is configured with `deviceListStrategy: cdi-cri` so the device plugin injects devices via direct CDI device requests in the CRI.
-- k3s auto-detects `nvidia-container-runtime` on `PATH`, registers the `nvidia` containerd runtime, and creates the `nvidia` `RuntimeClass` automatically.
-- The OpenShell Helm chart grants the gateway service account cluster-scoped read access to `node.k8s.io/runtimeclasses` and core `nodes` so GPU sandbox admission can verify both the `nvidia` `RuntimeClass` and allocatable GPU capacity before creating a sandbox.
-
-The runtime chain is:
-
-```text
-Host GPU drivers & NVIDIA Container Toolkit
- └─ Docker: DeviceRequests (CDI when enabled, --gpus all otherwise)
- └─ k3s/containerd: nvidia-container-runtime on PATH -> auto-detected
- └─ k8s: nvidia-device-plugin DaemonSet advertises nvidia.com/gpu
- └─ Pods: request nvidia.com/gpu in resource limits (CDI injection — no runtimeClassName needed)
-```
+The chart expects these operator-provided inputs:
-### `--gpu` flag
-
-The `--gpu` flag on `gateway start` enables GPU passthrough. OpenShell auto-selects CDI when enabled on the daemon and falls back to Docker's NVIDIA GPU request path (`--gpus all`) otherwise.
+| Input | Purpose |
+|---|---|
+| Namespace | Release namespace and default sandbox namespace. |
+| `openshell-ssh-handshake` Secret | HMAC key used by the SSH relay handshake. |
+| `openshell-server-tls` Secret | Server certificate and key when TLS is enabled. |
+| `openshell-server-client-ca` Secret | CA bundle used by the gateway to verify client certificates. |
+| `openshell-client-tls` Secret | Client certificate bundle mounted into sandbox pods. |
+| StorageClass / PVC support | Persistent gateway SQLite data when using the default `server.dbUrl`. |
+| Service exposure | Port-forward, ingress, load balancer, or NodePort for CLI access. |
-Device injection uses CDI (`deviceListStrategy: cdi-cri`): the device plugin injects devices via direct CDI device requests in the CRI. Sandbox pods only need `nvidia.com/gpu: 1` in their resource limits, and GPU pods do not set `runtimeClassName`.
+For local Kubernetes evaluation, TLS may be disabled with `server.disableTls=true` and the service can be reached through `kubectl port-forward`. Production deployments should keep TLS enabled or place the gateway behind a trusted TLS-terminating access proxy.
-The expected smoke test is a plain pod requesting `nvidia.com/gpu: 1` without `runtimeClassName` and running `nvidia-smi`.
+Key Helm values:
-## Remote Image Transfer
+| Value | Effect |
+|---|---|
+| `image.repository`, `image.tag` | Select the gateway image. |
+| `service.type`, `service.port`, `service.nodePort` | Expose the gateway service. |
+| `server.dbUrl` | Select SQLite or Postgres persistence. |
+| `server.sandboxNamespace` | Namespace for sandbox resources. |
+| `server.sandboxImage` | Default sandbox image. |
+| `server.grpcEndpoint` | Endpoint sandbox supervisors use to call back to the gateway. |
+| `server.sshGatewayHost`, `server.sshGatewayPort` | Host and port returned to CLI clients for SSH proxy connections. |
+| `server.disableTls`, `server.disableGatewayAuth` | Transport/authentication mode. |
+| `server.tls.*` | Names of TLS secrets mounted into the gateway and sandboxes. |
+
+## Runtime Shape
```mermaid
flowchart LR
- B[bootstrap] -->|query| RP[Remote platform: Docker version API]
- RP --> Auth[Authenticate with distribution registry]
- Auth --> Pull[Pull platform-specific image layers]
- Pull --> Tag[Tag to local image ref]
- Tag --> OK[Image available as openshell/cluster:TAG]
+ CLI[openshell CLI] -->|gRPC / HTTP| GW[Gateway]
+ GW --> DB[(SQLite or Postgres)]
+ GW --> DRIVER[Compute Driver]
+ DRIVER --> DOCKER[Docker]
+ DRIVER --> PODMAN[Podman]
+ DRIVER --> K8S[Kubernetes API]
+ DRIVER --> VM[MicroVM Driver]
+ DOCKER --> SBX1[Sandbox Container]
+ PODMAN --> SBX2[Sandbox Container]
+ K8S --> SBX3[Sandbox Pod]
+ VM --> SBX4[Sandbox VM]
+ SBX1 --> GW
+ SBX2 --> GW
+ SBX3 --> GW
+ SBX4 --> GW
```
-- Remote platform is queried via `Docker::version()` and normalized (e.g., `x86_64` -> `amd64`, `aarch64` -> `arm64`).
-- Distribution registry credentials are XOR-encoded in the binary (lightweight obfuscation, not a security boundary).
-- If the image ref looks local (no `/` in repository), the `latest` tag is used from the distribution registry regardless of the local `IMAGE_TAG`.
-
-## Access Model
-
-### Gateway endpoint exposure
-
-- Local: `https://127.0.0.1:{port}` (or `https://{docker_host}:{port}` when `DOCKER_HOST` is a non-loopback TCP endpoint). Default port is 8080.
-- Remote: `https://:{port}`.
-- The host port (configurable via `--port`, default 8080) maps to container port 30051 (OpenShell service NodePort).
-
-## Lifecycle Operations
-
-### stop
-
-`GatewayHandle::stop()` calls `stop_container()`, which tolerates 404 (not found) and 409 (already stopped).
-
-### destroy
-
-**Bootstrap layer** (`GatewayHandle::destroy()` -> `destroy_gateway_resources()`):
-
-1. Stop the container.
-2. Remove the container (`force=true`). Tolerates 404.
-3. Remove the volume (`force=true`). Tolerates 404.
-4. Force-remove the per-gateway network via `force_remove_network()`, disconnecting any stale endpoints first.
-
-**CLI layer** (`gateway_destroy()` in `run.rs` additionally):
-
-5. Remove the metadata JSON file via `remove_gateway_metadata()`.
-6. Clear the active gateway reference if it matches the destroyed gateway.
-
+The gateway process manages all OpenShell control-plane APIs. It persists records in SQLite or Postgres, watches sandbox state through the selected compute driver, and brokers SSH access through supervisor-initiated relay streams.
-## Idempotency and Error Behavior
+## Removed k3s Responsibilities
-- Re-running deploy is safe:
- - Network is recreated on each deploy to guarantee a clean state; volume is reused (inspect before create).
- - If a container exists with the same image ID, it is reused; if the image changed, the container is recreated.
- - `start_container` tolerates already-running state (409).
-- In interactive terminals, the CLI prompts the user to optionally destroy and recreate an existing gateway before redeploying.
-- Error handling surfaces:
- - Docker API failures from inspect/create/start/remove.
- - SSH connection failures when creating the remote Docker client.
- - Health check timeout (6 min) with recent container logs.
- - Container exit during any polling phase (health, mTLS) with diagnostic information (exit code, OOM status, recent logs).
- - mTLS secret polling timeout (3 min).
- - Local image ref without registry prefix: clear error with build instructions rather than a failed Docker Hub pull.
+The target architecture removes these responsibilities from OpenShell:
-## Auto-Bootstrap from `sandbox create`
-
-When `openshell sandbox create` cannot connect to a gateway (connection refused, DNS error, missing default TLS certs), the CLI offers to bootstrap one automatically:
-
-1. `should_attempt_bootstrap()` in `crates/openshell-cli/src/bootstrap.rs` checks the error type. It returns `true` for connectivity errors and missing default TLS materials, but `false` for TLS handshake/auth errors.
-2. If running in a terminal, the user is prompted to confirm.
-3. `run_bootstrap()` deploys a gateway named `"openshell"`, sets it as active, and returns fresh `TlsOptions` pointing to the newly-written mTLS certs.
-4. When `sandbox create` requests GPU explicitly (`--gpu`) or infers it from an image whose final name component contains `gpu` (such as `nvidia-gpu`), the bootstrap path enables gateway GPU support before retrying sandbox creation, using the same CDI-or-fallback selection as `gateway start --gpu`.
-
-## Container Environment Variables
-
-Variables set on the container by `ensure_container()` in `docker.rs`:
-
-| Variable | Value | When Set |
-|---|---|---|
-| `REGISTRY_MODE` | `"external"` | Always |
-| `REGISTRY_HOST` | Distribution registry host (or `OPENSHELL_REGISTRY_HOST` override) | Always |
-| `REGISTRY_INSECURE` | `"true"` or `"false"` | Always |
-| `IMAGE_REPO_BASE` | `{registry_host}/{namespace}` (or `IMAGE_REPO_BASE`/`OPENSHELL_IMAGE_REPO_BASE` override) | Always |
-| `REGISTRY_ENDPOINT` | Custom endpoint URL | When `OPENSHELL_REGISTRY_ENDPOINT` is set |
-| `REGISTRY_USERNAME` | Registry auth username | When explicit credentials provided via `--registry-username`/`--registry-token` or env vars |
-| `REGISTRY_PASSWORD` | Registry auth password | When explicit credentials provided via `--registry-username`/`--registry-token` or env vars |
-| `EXTRA_SANS` | Comma-separated extra TLS SANs | When extra SANs computed |
-| `SSH_GATEWAY_HOST` | Resolved remote hostname/IP | Remote deploys only |
-| `SSH_GATEWAY_PORT` | Configured host port (default `8080`) | Remote deploys only |
-| `IMAGE_TAG` | Image tag (e.g., `"dev"`) | When `IMAGE_TAG` env is set or push mode |
-| `IMAGE_PULL_POLICY` | `"IfNotPresent"` | Push mode only |
-| `PUSH_IMAGE_REFS` | Comma-separated image refs | Push mode only |
-
-## Host-Side Environment Variables
-
-Environment variables that affect bootstrap behavior when set on the host:
-
-| Variable | Effect |
-|---|---|
-| `OPENSHELL_CLUSTER_IMAGE` | Overrides entire image ref if set and non-empty |
-| `IMAGE_TAG` | Sets image tag (default: `"dev"`) when `OPENSHELL_CLUSTER_IMAGE` is not set |
-| `NAV_GATEWAY_TLS_ENABLED` | Overrides HelmChart manifest for TLS enabled check (`true`/`1`/`yes`/`false`/`0`/`no`) |
-| `XDG_CONFIG_HOME` | Base config directory (default: `$HOME/.config`) |
-| `DOCKER_HOST` | When `tcp://` and non-loopback, the host is added as a TLS SAN and used as the gateway endpoint |
-| `OPENSHELL_PUSH_IMAGES` | Comma-separated image refs to push into the gateway's containerd (local deploy only) |
-| `OPENSHELL_REGISTRY_HOST` | Override the distribution registry host |
-| `OPENSHELL_REGISTRY_NAMESPACE` | Override the registry namespace (default: `"openshell"`) |
-| `IMAGE_REPO_BASE` / `OPENSHELL_IMAGE_REPO_BASE` | Override the image repository base path |
-| `OPENSHELL_REGISTRY_INSECURE` | Use HTTP instead of HTTPS for registry mirror |
-| `OPENSHELL_REGISTRY_ENDPOINT` | Custom registry mirror endpoint |
-| `OPENSHELL_REGISTRY_USERNAME` | Override registry auth username |
-| `OPENSHELL_REGISTRY_PASSWORD` | Override registry auth password |
-| `OPENSHELL_GATEWAY` | Set the active gateway name for CLI commands |
-
-## File System Layout
-
-Artifacts stored under `$XDG_CONFIG_HOME/openshell/` (default `~/.config/openshell/`):
-
-```text
-openshell/
- active_gateway # plain text: active gateway name
- gateways/
- {name}_metadata.json # GatewayMetadata JSON
- {name}/
- mtls/ # mTLS bundle (when TLS enabled)
- ca.crt
- tls.crt
- tls.key
-```
+- Publishing `ghcr.io/nvidia/openshell/cluster`.
+- Running a k3s server inside a privileged Docker container.
+- Copying Helm charts into `/var/lib/rancher/k3s/server/static/charts`.
+- Auto-installing manifests through the k3s Helm controller.
+- Importing images into k3s containerd.
+- Side-loading the supervisor binary from a k3s node hostPath baked into the cluster image.
+- Running k3s-specific DNS proxy, stale-node cleanup, and healthcheck scripts.
-## Implementation References
+## Operational Notes
-- `crates/openshell-bootstrap/src/lib.rs` -- public API, deploy orchestration
-- `crates/openshell-bootstrap/src/docker.rs` -- Docker API wrappers
-- `crates/openshell-bootstrap/src/image.rs` -- registry pull, XOR credentials
-- `crates/openshell-bootstrap/src/runtime.rs` -- exec, health polling, stale node cleanup
-- `crates/openshell-bootstrap/src/metadata.rs` -- metadata CRUD, active gateway, SSH resolution
-- `crates/openshell-bootstrap/src/mtls.rs` -- TLS detection, secret extraction, atomic write
-- `crates/openshell-bootstrap/src/push.rs` -- local image push into k3s containerd
-- `crates/openshell-bootstrap/src/constants.rs` -- naming conventions
-- `crates/openshell-bootstrap/src/paths.rs` -- XDG path helpers
-- `crates/openshell-cli/src/main.rs` -- CLI command definitions
-- `crates/openshell-cli/src/run.rs` -- CLI command implementations
-- `crates/openshell-cli/src/bootstrap.rs` -- auto-bootstrap from sandbox create
-- `deploy/docker/Dockerfile.images` -- shared image build definition (cluster target)
-- `deploy/docker/cluster-entrypoint.sh` -- container entrypoint script
-- `deploy/docker/cluster-healthcheck.sh` -- Docker HEALTHCHECK script
-- `deploy/kube/manifests/openshell-helmchart.yaml` -- OpenShell Helm chart manifest
-- `deploy/kube/manifests/envoy-gateway-helmchart.yaml` -- Envoy Gateway manifest
+- Gateway endpoint registration should use `openshell gateway add ` regardless of compute platform.
+- Kubernetes chart changes should be validated with `helm lint deploy/helm/openshell` and an install into a disposable namespace when possible.
+- Docker driver changes should be validated with `mise run gateway:docker` or `mise run e2e:docker`.
+- Podman driver changes should be validated with `mise run e2e:podman`.
+- VM driver changes should be validated with `mise run e2e:vm`.
+- Gateway image changes should be validated by building `deploy/docker/Dockerfile.images` target `gateway`.
+- Published docs should describe gateway deployment and endpoint registration, not cluster-image bootstrap.
diff --git a/architecture/gateway.md b/architecture/gateway.md
index e83640a43..eb79dbd23 100644
--- a/architecture/gateway.md
+++ b/architecture/gateway.md
@@ -66,7 +66,7 @@ graph TD
| Protocol mux | `crates/openshell-server/src/multiplex.rs` | `MultiplexService`, `MultiplexedService`, `GrpcRouter`, `BoxBody`, HTTP/2 adaptive-window tuning |
| gRPC: OpenShell | `crates/openshell-server/src/grpc/mod.rs` | `OpenShellService` trait impl -- dispatches to per-concern handlers |
| gRPC: Sandbox/Exec | `crates/openshell-server/src/grpc/sandbox.rs` | Sandbox CRUD, `ExecSandbox`, SSH session handlers, relay-backed exec proxy |
-| gRPC: Inference | `crates/openshell-server/src/inference.rs` | `InferenceService` -- cluster inference config and sandbox bundle delivery |
+| gRPC: Inference | `crates/openshell-server/src/inference.rs` | `InferenceService` -- gateway inference config and sandbox bundle delivery |
| Supervisor sessions | `crates/openshell-server/src/supervisor_session.rs` | `SupervisorSessionRegistry`, `handle_connect_supervisor`, `handle_relay_stream`, reaper |
| HTTP | `crates/openshell-server/src/http.rs` | Health endpoints, merged with SSH tunnel router |
| Browser auth | `crates/openshell-server/src/auth.rs` | Cloudflare browser login relay at `/auth/connect` |
@@ -133,7 +133,7 @@ All configuration is via CLI flags with environment variable fallbacks. The `--d
| `--db-url` | `OPENSHELL_DB_URL` | *required* | Database URL (`sqlite:...` or `postgres://...`). The Helm chart defaults to `sqlite:/var/openshell/openshell.db` (persistent volume). In-memory SQLite (`sqlite::memory:?cache=shared`) works for ephemeral/test environments but data is lost on restart. |
| `--sandbox-namespace` | `OPENSHELL_SANDBOX_NAMESPACE` | `default` | Kubernetes namespace for sandbox CRDs |
| `--sandbox-image` | `OPENSHELL_SANDBOX_IMAGE` | None | Default container image for sandbox pods |
-| `--grpc-endpoint` | `OPENSHELL_GRPC_ENDPOINT` | None | gRPC endpoint reachable from within the cluster (for supervisor callbacks) |
+| `--grpc-endpoint` | `OPENSHELL_GRPC_ENDPOINT` | None | gRPC endpoint reachable from sandbox workloads for supervisor callbacks |
| `--drivers` | `OPENSHELL_DRIVERS` | `kubernetes` | Compute backend to use. Current options are `kubernetes`, `docker`, and `vm`. |
| `--vm-driver-state-dir` | `OPENSHELL_VM_DRIVER_STATE_DIR` | `target/openshell-vm-driver` | Host directory for VM sandbox rootfs, console logs, and runtime state |
| `--driver-dir` | `OPENSHELL_DRIVER_DIR` | unset | Override directory for `openshell-driver-vm`. When unset, the gateway searches `~/.local/libexec/openshell`, `/usr/local/libexec/openshell`, `/usr/local/libexec`, then a sibling binary. |
@@ -211,7 +211,7 @@ When TLS is enabled (`crates/openshell-server/src/tls.rs`):
- `--disable-tls` removes gateway-side TLS entirely and serves plaintext HTTP behind a trusted reverse proxy or tunnel.
- Supports PKCS#1, PKCS#8, and SEC1 private key formats.
- The TLS handshake happens before the stream reaches Hyper's auto builder, so ALPN negotiation and HTTP version detection work together transparently.
-- Certificates are generated at cluster bootstrap time by the `openshell-bootstrap` crate using `rcgen`, not by a Helm Job. The bootstrap reconciles three K8s secrets: `openshell-server-tls` (server cert+key), `openshell-server-client-ca` (CA cert), and `openshell-client-tls` (client cert+key+CA, shared by CLI and sandbox pods).
+- Certificates are operator-provided in the target deployment model. Helm deployments consume three K8s secrets: `openshell-server-tls` (server cert+key), `openshell-server-client-ca` (CA cert), and `openshell-client-tls` (client cert+key+CA, shared by CLI and sandbox workloads).
- Sandbox supervisors reuse the shared client cert to authenticate their `ConnectSupervisor` and `RelayStream` calls over the same mTLS channel.
## Supervisor Sessions
@@ -395,27 +395,27 @@ These RPCs support the sandbox-initiated policy recommendation pipeline. The san
Defined in `proto/inference.proto`, implemented in `crates/openshell-server/src/inference.rs` as `InferenceService`.
-The gateway acts as the control plane for inference configuration. It stores a single managed cluster inference route (named `inference.local`) and delivers resolved route bundles to sandbox pods. The gateway does not execute inference requests -- sandboxes connect directly to inference backends using the credentials and endpoints provided in the bundle.
+The gateway acts as the control plane for inference configuration. It stores a single managed gateway inference route (named `inference.local`) and delivers resolved route bundles to sandbox pods. The gateway does not execute inference requests -- sandboxes connect directly to inference backends using the credentials and endpoints provided in the bundle.
-#### Cluster Inference Configuration
+#### Gateway Inference Configuration
-The gateway manages a single cluster-wide inference route that maps to a provider record. When set, the route stores only a `provider_name` and `model_id` reference. At bundle resolution time, the gateway looks up the referenced provider and derives the endpoint URL, API key, protocols, and provider type from it. This late-binding design means provider credential rotations are automatically reflected in the next bundle fetch without updating the route itself.
+The gateway manages a single gateway-wide inference route that maps to a provider record. When set, the route stores only a `provider_name` and `model_id` reference. At bundle resolution time, the gateway looks up the referenced provider and derives the endpoint URL, API key, protocols, and provider type from it. This late-binding design means provider credential rotations are automatically reflected in the next bundle fetch without updating the route itself.
| RPC | Description |
|-----|-------------|
-| `SetClusterInference` | Configures the cluster inference route. Validates `provider_name` and `model_id` are non-empty, verifies the named provider exists and has a supported type for inference (openai, anthropic, nvidia), validates the provider has a usable API key, then upserts the `inference.local` route record. Increments a monotonic `version` on each update. Returns the configured `provider_name`, `model_id`, and `version`. |
-| `GetClusterInference` | Returns the current cluster inference configuration (`provider_name`, `model_id`, `version`). Returns `NotFound` if no cluster inference is configured, or `FailedPrecondition` if the stored route has empty provider/model metadata. |
+| `SetClusterInference` | Configures the gateway inference route. Validates `provider_name` and `model_id` are non-empty, verifies the named provider exists and has a supported type for inference (openai, anthropic, nvidia), validates the provider has a usable API key, then upserts the `inference.local` route record. Increments a monotonic `version` on each update. Returns the configured `provider_name`, `model_id`, and `version`. |
+| `GetClusterInference` | Returns the current gateway inference configuration (`provider_name`, `model_id`, `version`). Returns `NotFound` if no gateway inference is configured, or `FailedPrecondition` if the stored route has empty provider/model metadata. |
| `GetInferenceBundle` | Returns the resolved inference route bundle for sandbox consumption. See [Route Bundle Delivery](#route-bundle-delivery) below. |
#### Route Bundle Delivery
-The `GetInferenceBundle` RPC resolves the managed cluster route into a `GetInferenceBundleResponse` containing fully materialized route data that sandboxes can use directly.
+The `GetInferenceBundle` RPC resolves the managed gateway route into a `GetInferenceBundleResponse` containing fully materialized route data that sandboxes can use directly.
The trait method delegates to `resolve_inference_bundle(store)` (`crates/openshell-server/src/inference.rs`), which takes `&Store` instead of `&self`. This extraction decouples bundle resolution from `ServerState`, enabling direct unit testing against an in-memory SQLite store without constructing a full server.
The `GetInferenceBundleResponse` includes:
-- **`routes`** -- a list of `ResolvedRoute` messages containing base URL, model ID, API key, protocols, and provider type. Currently contains zero or one routes (the managed cluster route).
+- **`routes`** -- a list of `ResolvedRoute` messages containing base URL, model ID, API key, protocols, and provider type. Currently contains zero or one routes (the managed gateway route).
- **`revision`** -- a hex-encoded hash computed from route contents. Sandboxes compare this value to detect when their route set has changed.
- **`generated_at_ms`** -- epoch milliseconds when the bundle was assembled.
@@ -578,7 +578,7 @@ The `generate_name()` function produces random 6-character lowercase alphabetic
### Deployment Storage
-The gateway runs as a Kubernetes **StatefulSet** with a `volumeClaimTemplate` that provisions a 1Gi `ReadWriteOnce` PersistentVolumeClaim mounted at `/var/openshell`. On k3s clusters this uses the built-in `local-path-provisioner` StorageClass (the cluster default). The SQLite database file at `/var/openshell/openshell.db` survives pod restarts and rescheduling.
+The gateway runs as a Kubernetes **StatefulSet** with a `volumeClaimTemplate` that provisions a 1Gi `ReadWriteOnce` PersistentVolumeClaim mounted at `/var/openshell`. The cluster's default StorageClass supplies the volume unless an operator customizes the chart. The SQLite database file at `/var/openshell/openshell.db` survives pod restarts and rescheduling.
The Helm chart template is at `deploy/helm/openshell/templates/statefulset.yaml`.
diff --git a/architecture/inference-routing.md b/architecture/inference-routing.md
index 4d7b0f517..01b6a853a 100644
--- a/architecture/inference-routing.md
+++ b/architecture/inference-routing.md
@@ -62,7 +62,7 @@ File: `crates/openshell-server/src/inference.rs`
The gateway implements the `Inference` gRPC service defined in `proto/inference.proto`.
-### Cluster inference set/get
+### Gateway inference set/get
`SetClusterInference` takes a `provider_name` and `model_id`. It:
@@ -73,7 +73,7 @@ The gateway implements the `Inference` gRPC service defined in `proto/inference.
5. Builds a managed route spec that stores only `provider_name` and `model_id`. The spec intentionally leaves `base_url`, `api_key`, and `protocols` empty -- these are resolved dynamically at bundle time from the provider record.
6. Upserts the route with name `inference.local`. Version starts at 1 and increments monotonically on each update.
-`GetClusterInference` returns `provider_name`, `model_id`, and `version` for the managed route. Returns `NOT_FOUND` if cluster inference is not configured.
+`GetClusterInference` returns `provider_name`, `model_id`, and `version` for the managed route. Returns `NOT_FOUND` if gateway inference is not configured.
### Bundle delivery
@@ -87,7 +87,7 @@ The gateway implements the `Inference` gRPC service defined in `proto/inference.
Because resolution happens at request time, credential rotation and endpoint changes on the provider record take effect on the next bundle fetch without re-running `SetClusterInference`.
-An empty route list is valid and indicates cluster inference is not yet configured.
+An empty route list is valid and indicates gateway inference is not yet configured.
### Proto definitions
@@ -109,7 +109,7 @@ Files:
- `crates/openshell-sandbox/src/lib.rs` -- inference context initialization, route refresh
- `crates/openshell-sandbox/src/grpc_client.rs` -- `fetch_inference_bundle()`
-In cluster mode, the sandbox starts a background refresh loop as soon as the inference context is created. The loop polls the gateway every 5 seconds by default (`OPENSHELL_ROUTE_REFRESH_INTERVAL_SECS` override) and uses the bundle revision hash to skip no-op cache writes. The revision hash covers all route fields including `timeout_secs`, so any configuration change (provider, model, or timeout) triggers a cache update on the next poll.
+In gateway bundle mode, the sandbox starts a background refresh loop as soon as the inference context is created. The loop polls the gateway every 5 seconds by default (`OPENSHELL_ROUTE_REFRESH_INTERVAL_SECS` override) and uses the bundle revision hash to skip no-op cache writes. The revision hash covers all route fields including `timeout_secs`, so any configuration change (provider, model, or timeout) triggers a cache update on the next poll.
### Interception flow
@@ -146,9 +146,9 @@ If no pattern matches, the proxy returns `403 Forbidden` with `{"error": "connec
### Route cache
- `InferenceContext` holds a `Router`, the pattern list, and an `Arc>>` route cache.
-- In cluster mode, `spawn_route_refresh()` polls `GetInferenceBundle` every 5 seconds (`OPENSHELL_ROUTE_REFRESH_INTERVAL_SECS`). On failure, stale routes are kept.
+- In gateway bundle mode, `spawn_route_refresh()` polls `GetInferenceBundle` every 5 seconds (`OPENSHELL_ROUTE_REFRESH_INTERVAL_SECS`). On failure, stale routes are kept.
- In file mode (`--inference-routes`), routes load once at startup from YAML. No refresh task is spawned.
-- In cluster mode, an empty initial bundle still enables the inference context so the refresh task can pick up later configuration.
+- In gateway bundle mode, an empty initial bundle still enables the inference context so the refresh task can pick up later configuration.
### Bundle-to-route conversion
@@ -280,7 +280,7 @@ Validation at load time requires either `api_key` or `api_key_env` to resolve, a
| Status | Condition |
|--------|-----------|
| `403` | Request on `inference.local` does not match a recognized inference API pattern |
-| `503` | Pattern matched but route cache is empty (cluster inference not configured) |
+| `503` | Pattern matched but route cache is empty (gateway inference not configured) |
| `400` | No compatible route for the detected source protocol |
| `401` | Upstream returned unauthorized |
| `502` | Upstream protocol error or internal router error |
@@ -328,15 +328,15 @@ The system route is stored as a separate `InferenceRoute` record in the gateway
## CLI Surface
-Cluster inference commands:
+Gateway inference commands:
-- `openshell inference set --provider --model [--timeout ]` -- configures user-facing cluster inference
+- `openshell inference set --provider --model [--timeout ]` -- configures user-facing gateway inference
- `openshell inference set --system --provider --model [--timeout ]` -- configures system inference
- `openshell inference update [--provider ] [--model ] [--timeout ]` -- updates individual fields without resetting others
- `openshell inference get` -- displays both user and system inference configuration
- `openshell inference get --system` -- displays only the system inference configuration
-The `--provider` flag references a provider record name (not a provider type). The provider must already exist in the cluster and have a supported inference type (`openai`, `anthropic`, or `nvidia`).
+The `--provider` flag references a provider record name (not a provider type). The provider must already exist on the gateway and have a supported inference type (`openai`, `anthropic`, or `nvidia`).
The `--timeout` flag sets the per-request timeout in seconds for upstream inference calls. When omitted or set to `0`, the default of 60 seconds applies. Timeout changes propagate to running sandboxes within the route refresh interval (5 seconds by default).
diff --git a/architecture/podman-rootless-networking.md b/architecture/podman-rootless-networking.md
index b267cfffa..c6216d25b 100644
--- a/architecture/podman-rootless-networking.md
+++ b/architecture/podman-rootless-networking.md
@@ -339,7 +339,7 @@ Gateway (host, port 8080)
| TLS | mTLS via K8s secrets | Disabled by default (loopback-only, `--disable-tls`) |
| DNS | Kubernetes CoreDNS | Podman bridge DNS (aardvark-dns, `dns_enabled: true`) |
| Network policy | K8s NetworkPolicy (ingress restricted to gateway) | iptables inside inner sandbox netns |
-| Supervisor delivery | hostPath volume from k3s node | OCI image volume mount (FROM scratch image) |
+| Supervisor delivery | Kubernetes driver managed pod image/template | OCI image volume mount (FROM scratch image) |
| Secrets | K8s Secret volume mount (TLS certs); SSH handshake secret via env var | Podman `secret_env` API (hidden from `podman inspect`) |
Both drivers use the same reverse gRPC relay (`ConnectSupervisor` + `RelayStream`) for SSH transport. The most significant difference is network reachability: in rootless Podman, the bridge network is not routable from the host, so all communication between host and container goes through either pasta port forwarding (`portmappings`) or the `host.containers.internal` hostname (resolved to `169.254.1.2` by pasta).
diff --git a/architecture/sandbox-connect.md b/architecture/sandbox-connect.md
index 499532fb9..bf654b249 100644
--- a/architecture/sandbox-connect.md
+++ b/architecture/sandbox-connect.md
@@ -216,7 +216,7 @@ sequenceDiagram
- Resolves sandbox name to ID via `GetSandbox` gRPC.
- Creates an SSH session via `CreateSshSession` gRPC.
- Builds a `ProxyCommand` string: ` ssh-proxy --gateway --sandbox-id --token --gateway-name `.
- - If the gateway host is loopback but the cluster endpoint is not, `resolve_ssh_gateway()` overrides the host with the cluster endpoint's host.
+ - If the SSH gateway host is loopback but the registered gateway endpoint is not, `resolve_ssh_gateway()` overrides the host with the registered endpoint's host.
3. `sandbox_connect()` builds an `ssh` command with:
- `-o ProxyCommand=...`
- `-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o GlobalKnownHostsFile=/dev/null` (ephemeral host keys)
@@ -251,7 +251,7 @@ When `openshell sandbox create` launches a `--no-keep` command or shell, it keep
- With `-d`/`--background`, SSH forks after auth and the CLI exits. The PID is tracked in `~/.config/openshell/forwards/-.pid` along with sandbox id metadata.
- `openshell forward stop ` validates PID ownership and then kills a background forward.
- `openshell forward list` shows all tracked forwards.
-- `openshell forward stop` and `openshell forward list` are local operations and do not require resolving an active cluster.
+- `openshell forward stop` and `openshell forward list` are local operations and do not require resolving an active gateway.
- `openshell sandbox create --forward ` starts a background forward before connect/exec, including when no trailing command is provided.
- `openshell sandbox delete` auto-stops any active forwards for the deleted sandbox.
@@ -544,9 +544,9 @@ The gateway builds the remote command by shell-escaping arguments, prepending so
**File**: `crates/openshell-core/src/forward.rs` -- `resolve_ssh_gateway()`
-When the gateway returns a loopback address (`127.0.0.1`, `0.0.0.0`, `localhost`, or `::1`), the client overrides it with the host from the cluster endpoint URL. This handles the common case where the gateway defaults to `127.0.0.1` but the cluster is running on a remote machine.
+When the gateway returns a loopback address (`127.0.0.1`, `0.0.0.0`, `localhost`, or `::1`), the client overrides it with the host from the registered gateway endpoint URL. This handles the common case where the gateway defaults to `127.0.0.1` but the gateway is running on a remote machine.
-The override only applies if the cluster endpoint itself is not also a loopback address. If both are loopback, the original address is kept.
+The override only applies if the registered gateway endpoint itself is not also a loopback address. If both are loopback, the original address is kept.
This function is shared between the CLI and TUI via the `openshell-core::forward` module.
diff --git a/architecture/sandbox-custom-containers.md b/architecture/sandbox-custom-containers.md
index 5d482ffe0..071790346 100644
--- a/architecture/sandbox-custom-containers.md
+++ b/architecture/sandbox-custom-containers.md
@@ -9,7 +9,7 @@ The `--from` flag accepts four kinds of input:
| Input | Example | Behavior |
|-------|---------|----------|
| **Community sandbox name** | `--from openclaw` | Resolves to `ghcr.io/nvidia/openshell-community/sandboxes/openclaw:latest` |
-| **Dockerfile path** | `--from ./Dockerfile` | Builds the image, pushes it into the cluster, then creates the sandbox |
+| **Dockerfile path** | `--from ./Dockerfile` | Builds the image, publishes it to a registry reachable by the compute backend, then creates the sandbox |
| **Directory with Dockerfile** | `--from ./my-sandbox/` | Uses the directory as the build context |
| **Full image reference** | `--from myregistry.com/img:tag` | Uses the image directly |
@@ -33,39 +33,39 @@ The community registry prefix defaults to `ghcr.io/nvidia/openshell-community/sa
When `--from` points to a Dockerfile or directory, the CLI:
1. Builds the image locally via the Docker daemon (respecting `.dockerignore`).
-2. Pushes it into the cluster's containerd runtime using `docker save` / `ctr import`.
+2. Publishes it to a registry reachable by the compute backend.
3. Creates the sandbox with the resulting image tag.
## How It Works
-The supervisor binary (`openshell-sandbox`) is **always side-loaded** from the k3s node filesystem via a read-only `hostPath` volume. It is never baked into sandbox images. This applies to all sandbox pods — whether using the default community base image, a custom image, or a user-built Dockerfile.
+The supervisor binary (`openshell-sandbox`) must be delivered by the selected compute driver. The target architecture does not depend on a k3s node hostPath or a cluster image.
```mermaid
flowchart TB
- subgraph node["K3s Node"]
- bin["/opt/openshell/bin/openshell-sandbox
- (built into cluster image, updatable via docker cp)"]
+ subgraph delivery["Supervisor delivery"]
+ bin["openshell-sandbox
+ (image, image volume, local binary, or VM rootfs)"]
end
- node -- "hostPath (readOnly)" --> agent
+ delivery --> agent
subgraph pod["Pod"]
subgraph agent["Agent Container"]
agent_desc["Image: community base or custom image
Command: /opt/openshell/bin/openshell-sandbox
- Volume: /opt/openshell/bin (ro hostPath)
+ Supervisor path configured by compute driver
Env: OPENSHELL_SANDBOX_ID, OPENSHELL_ENDPOINT, ...
Caps: SYS_ADMIN, NET_ADMIN, SYS_PTRACE"]
end
end
```
-The server applies these transforms to every sandbox pod template (`sandbox/mod.rs`):
+For Kubernetes-backed sandboxes, the driver must ensure every pod template has:
-1. Adds a `hostPath` volume named `openshell-supervisor-bin` pointing to `/opt/openshell/bin` on the node.
-2. Mounts it read-only at `/opt/openshell/bin` in the agent container.
-3. Overrides the agent container's `command` to `/opt/openshell/bin/openshell-sandbox`.
-4. Sets `runAsUser: 0` so the supervisor has root privileges for namespace creation, proxy setup, and Landlock/seccomp.
+1. A resolvable `openshell-sandbox` entrypoint.
+2. Gateway callback environment variables such as `OPENSHELL_SANDBOX_ID`, `OPENSHELL_ENDPOINT`, and `OPENSHELL_SSH_SOCKET_PATH`.
+3. TLS and SSH handshake materials when the gateway requires them.
+4. The capabilities needed for namespace creation, proxy setup, and Landlock/seccomp.
These transforms apply to every generated pod template.
@@ -109,16 +109,16 @@ The `openshell-sandbox` supervisor adapts to arbitrary environments:
| Community name resolution | Bare names like `openclaw` expand to the GHCR community registry, making the common case simple |
| Auto build+push for Dockerfiles | Eliminates the two-step `image push` + `create` workflow for local development |
| `OPENSHELL_COMMUNITY_REGISTRY` env var | Allows organizations to host their own community sandbox registry |
-| hostPath side-load | Supervisor binary lives on the node filesystem — no init container, no emptyDir, no extra image pull. Faster pod startup. |
-| Read-only mount in agent | The supervisor binary is mounted read-only, and the startup seccomp prelude blocks the remount syscalls that would otherwise reopen it for writes once privileged bootstrap has completed. |
+| Driver-owned supervisor delivery | Each compute driver decides how to deliver `openshell-sandbox` without depending on a k3s cluster image. |
+| Read-only supervisor delivery | The supervisor should be mounted or packaged read-only where the driver supports it, and the startup seccomp prelude blocks remount syscalls that would otherwise reopen it for writes once privileged bootstrap has completed. |
| Command override | Ensures `openshell-sandbox` is the entrypoint regardless of the image's default CMD |
| Clear `run_as_user/group` for custom images | Prevents startup failure when the image lacks the default `sandbox` user |
| Non-fatal log file init | `/var/log/openshell.log` may be unwritable in arbitrary images; falls back to stdout |
-| `docker save` / `ctr import` for push | Avoids requiring a registry for local dev; images land directly in the k3s containerd store |
+| Registry publication for built images | Kubernetes and remote compute backends need image references that their runtime can pull. |
| Optional `iptables` for bypass detection | Core network isolation works via routing alone (`iproute2`); `iptables` only adds fast-fail (`ECONNREFUSED`) and diagnostic LOG entries. Making it optional avoids hard failures in minimal images that lack `iptables` while giving better UX when it is available. |
## Limitations
- Distroless / `FROM scratch` images are not supported (the supervisor needs glibc and `/proc`)
- Missing `iproute2` (or required capabilities) blocks startup in proxy mode because namespace isolation is mandatory
-- The supervisor binary must be present on the k3s node at `/opt/openshell/bin/openshell-sandbox` (embedded in the cluster image at build time)
+- The selected compute driver must provide an `openshell-sandbox` binary compatible with the sandbox image and host architecture.
diff --git a/architecture/sandbox.md b/architecture/sandbox.md
index 756baf88c..98f2bb8c7 100644
--- a/architecture/sandbox.md
+++ b/architecture/sandbox.md
@@ -898,29 +898,29 @@ pub struct InferenceContext {
#### Design decision: standalone capability
-The sandbox is designed to operate both as part of a cluster and as a standalone component without any cluster infrastructure. This is intentional -- it enables local development workflows (e.g., a developer running a sandbox against a local LLM server without deploying the full stack), CI/CD environments where sandboxes run as isolated test harnesses, and air-gapped deployments where the gateway is not available. Everything the sandbox needs -- policy, inference routes -- can be provided without any dependency on the control plane.
+The sandbox is designed to operate both under a gateway-managed compute platform and as a standalone component without gateway infrastructure. This is intentional -- it enables local development workflows (e.g., a developer running a sandbox against a local LLM server without deploying the full stack), CI/CD environments where sandboxes run as isolated test harnesses, and air-gapped deployments where the gateway is not available. Everything the sandbox needs -- policy, inference routes -- can be provided without any dependency on the control plane.
#### Route sources (priority order)
-1. **Route file (standalone mode)**: `--inference-routes` / `OPENSHELL_INFERENCE_ROUTES` points to a YAML file parsed by `RouterConfig::load_from_file()`. Routes are resolved via `config.resolve_routes()`. File loading or parsing errors are fatal (fail-fast), but an empty route list gracefully disables inference routing (returns `None`). The route file always takes precedence -- if both a route file and cluster credentials are present, the route file wins and the cluster bundle is not fetched.
+1. **Route file (standalone mode)**: `--inference-routes` / `OPENSHELL_INFERENCE_ROUTES` points to a YAML file parsed by `RouterConfig::load_from_file()`. Routes are resolved via `config.resolve_routes()`. File loading or parsing errors are fatal (fail-fast), but an empty route list gracefully disables inference routing (returns `None`). The route file always takes precedence -- if both a route file and gateway credentials are present, the route file wins and the gateway bundle is not fetched.
-2. **Cluster bundle (cluster mode)**: When `openshell_endpoint` is available (and no route file is configured), routes are fetched from the gateway via `grpc_client::fetch_inference_bundle()`, which calls the `GetInferenceBundle` gRPC RPC on the `Inference` service. The RPC takes no arguments (the bundle is cluster-scoped, not per-sandbox). The gateway returns a `GetInferenceBundleResponse` containing resolved `ResolvedRoute` entries for the managed cluster route. These proto messages are converted to router `ResolvedRoute` structs by `bundle_to_resolved_routes()`, which maps provider types to auth headers and default headers via `openshell_core::inference::auth_for_provider_type()`.
+2. **Gateway bundle (gateway mode)**: When `openshell_endpoint` is available (and no route file is configured), routes are fetched from the gateway via `grpc_client::fetch_inference_bundle()`, which calls the `GetInferenceBundle` gRPC RPC on the `Inference` service. The RPC takes no arguments (the bundle is gateway-scoped, not per-sandbox). The gateway returns a `GetInferenceBundleResponse` containing resolved `ResolvedRoute` entries for the managed gateway route. These proto messages are converted to router `ResolvedRoute` structs by `bundle_to_resolved_routes()`, which maps provider types to auth headers and default headers via `openshell_core::inference::auth_for_provider_type()`.
-3. **No source**: If neither route file nor cluster credentials are configured, `build_inference_context()` returns `None` and inference routing is disabled.
+3. **No source**: If neither route file nor gateway credentials are configured, `build_inference_context()` returns `None` and inference routing is disabled.
-#### Cluster mode graceful degradation
+#### Gateway mode graceful degradation
-In cluster mode, `fetch_inference_bundle()` failures are handled based on the error type:
+In gateway mode, `fetch_inference_bundle()` failures are handled based on the error type:
- gRPC `PermissionDenied` or `NotFound` (detected via error message string matching): sandbox has no inference policy -- inference routing is silently disabled.
- Other errors: logged as a warning, inference routing is disabled.
- Empty initial route bundle: inference routing stays enabled with an empty cache and background refresh continues.
-Route sources handle empty route lists differently: file mode disables inference routing when the file resolves to zero routes, while cluster mode keeps inference routing active with an empty cache so refresh can pick up routes created later. File *loading errors* (missing file, parse failure) are fatal, while cluster *fetch errors* are non-fatal.
+Route sources handle empty route lists differently: file mode disables inference routing when the file resolves to zero routes, while gateway mode keeps inference routing active with an empty cache so refresh can pick up routes created later. File *loading errors* (missing file, parse failure) are fatal, while gateway *fetch errors* are non-fatal.
#### Background route cache refresh
-In cluster mode (when no route file is configured), `spawn_route_refresh()` starts a background tokio task that refreshes the route cache every 30 seconds (`ROUTE_REFRESH_INTERVAL_SECS`). The task calls `fetch_inference_bundle()` on each tick and replaces the `RwLock>` contents. On fetch failure, the task logs a warning and keeps the stale routes. The `MissedTickBehavior::Skip` policy prevents refresh storms after temporary gateway outages.
+In gateway mode (when no route file is configured), `spawn_route_refresh()` starts a background tokio task that refreshes the route cache every 30 seconds (`ROUTE_REFRESH_INTERVAL_SECS`). The task calls `fetch_inference_bundle()` on each tick and replaces the `RwLock>` contents. On fetch failure, the task logs a warning and keeps the stale routes. The `MissedTickBehavior::Skip` policy prevents refresh storms after temporary gateway outages.
```mermaid
flowchart TD
@@ -940,7 +940,7 @@ flowchart TD
M -- Yes --> L
M -- No --> N[Warn + None]
H -- No --> L
- F --> O[spawn_route_refresh if cluster mode]
+ F --> O[spawn_route_refresh if gateway mode]
G --> O
```
@@ -1558,8 +1558,8 @@ The sandbox uses `miette` for error reporting and `thiserror` for typed errors.
| SSRF: DNS resolution failure | Deny the specific CONNECT request |
| Inference route file load/parse error | Fatal -- sandbox startup aborts |
| Inference route file with empty routes | Inference routing disabled (graceful) |
-| Inference cluster bundle with empty routes | Inference routing stays enabled with empty cache; refresh can activate routes later |
-| Inference cluster bundle fetch failure | Warn + inference routing disabled (graceful) |
+| Inference gateway bundle with empty routes | Inference routing stays enabled with empty cache; refresh can activate routes later |
+| Inference gateway bundle fetch failure | Warn + inference routing disabled (graceful) |
| Inference interception: missing InferenceContext | Denied outcome + structured CONNECT deny log |
| Inference interception: missing TLS state | Denied outcome + structured CONNECT deny log |
| Inference interception: TLS handshake failure | Denied outcome + structured CONNECT deny log |
diff --git a/architecture/security-policy.md b/architecture/security-policy.md
index 8afef7ae3..aeaac9fb9 100644
--- a/architecture/security-policy.md
+++ b/architecture/security-policy.md
@@ -31,7 +31,7 @@ The YAML data file is preprocessed before loading into the OPA engine: L7 polici
### gRPC Mode (Production)
-When the sandbox runs inside a managed cluster, it fetches its typed protobuf policy from the gateway:
+When the sandbox runs under a gateway-managed compute platform, it fetches its typed protobuf policy from the gateway:
```bash
openshell-sandbox \
@@ -664,7 +664,7 @@ network_policies:
Inference routing to `inference.local` is handled by the proxy's `InferenceContext`, not by the OPA policy engine or an `inference` block in the policy YAML. The proxy intercepts HTTPS CONNECT requests to `inference.local` and routes matching inference API requests (e.g., `POST /v1/chat/completions`, `POST /v1/messages`) through the sandbox-local `openshell-router`. See [Inference Routing](inference-routing.md) for details on route configuration and the router architecture.
-The proxy always runs in proxy mode so that `inference.local` is addressable from within the sandbox's network namespace. Inference route sources are configured separately from policy: via `--inference-routes` (file mode) or fetched from the gateway's inference bundle (cluster mode). See `crates/openshell-sandbox/src/proxy.rs` -- `InferenceContext`, `crates/openshell-sandbox/src/l7/inference.rs`.
+The proxy always runs in proxy mode so that `inference.local` is addressable from within the sandbox's network namespace. Inference route sources are configured separately from policy: via `--inference-routes` (file mode) or fetched from the gateway's inference bundle (gateway mode). See `crates/openshell-sandbox/src/proxy.rs` -- `InferenceContext`, `crates/openshell-sandbox/src/l7/inference.rs`.
---
diff --git a/architecture/system-architecture.md b/architecture/system-architecture.md
index 5c7fcdcf7..b271cdd67 100644
--- a/architecture/system-architecture.md
+++ b/architecture/system-architecture.md
@@ -9,23 +9,23 @@ graph TB
CLI["OpenShell CLI
(openshell)"]
TUI["OpenShell TUI
(openshell term)"]
SDK["Python SDK
(openshell)"]
- LocalConfig["~/.config/openshell/
clusters, mTLS certs,
active_cluster"]
+ LocalConfig["~/.config/openshell/
gateways, mTLS certs,
active_gateway"]
end
%% ============================================================
- %% KUBERNETES CLUSTER (single Docker container)
+ %% GATEWAY AND COMPUTE PLATFORM
%% ============================================================
- subgraph Cluster["OpenShell Cluster Container (Docker)"]
+ subgraph Cluster["Gateway and Compute Platform"]
- subgraph K3s["k3s (v1.35.2-k3s1)"]
- KubeAPI["Kubernetes API
:6443"]
- HelmController["Helm Controller"]
- LocalPathProv["local-path-provisioner"]
- end
+ ComputeDriver["Compute Driver
(Docker, Podman,
Kubernetes, VM)"]
+ DockerAPI["Docker API"]
+ PodmanAPI["Podman API"]
+ KubeAPI["Kubernetes API
(optional)"]
+ VMDriver["VM Driver
(experimental)"]
- subgraph NSNamespace["openshell namespace"]
+ subgraph NSNamespace["Gateway runtime"]
- subgraph GatewayPod["Gateway StatefulSet"]
+ subgraph GatewayPod["Gateway Process"]
Gateway["openshell-server
:8080
(gRPC + HTTP, mTLS)"]
SQLite[("SQLite DB
/var/openshell/
openshell.db")]
SupRegistry["SupervisorSessionRegistry
(live sessions + pending relays)"]
@@ -33,7 +33,7 @@ graph TB
LogBus["TracingLogBus
(in-memory broadcast)"]
end
- subgraph SandboxPod["Sandbox Pod (1 per sandbox)"]
+ subgraph SandboxPod["Sandbox Workload
(container, pod, or VM)"]
subgraph Supervisor["Sandbox Supervisor
(privileged user)"]
SSHServer["Embedded SSH
Server (russh)
Unix socket
/run/openshell/ssh.sock"]
@@ -54,7 +54,7 @@ graph TB
end
end
- subgraph ASNamespace["agent-sandbox-system namespace"]
+ subgraph ASNamespace["Kubernetes driver only"]
CRDController["Agent Sandbox
CRD Controller"]
end
@@ -91,7 +91,7 @@ graph TB
%% ============================================================
%% CONNECTIONS: User Machine --> Cluster
%% ============================================================
- CLI -- "gRPC over HTTPS (mTLS)
:30051 NodePort" --> Gateway
+ CLI -- "gRPC over HTTPS (mTLS)
service / ingress / port-forward" --> Gateway
TUI -- "gRPC polling (mTLS)
every 2s" --> Gateway
SDK -- "gRPC over HTTPS (mTLS)" --> Gateway
CLI -- "HTTP CONNECT upgrade
/connect/ssh (mTLS)" --> Gateway
@@ -102,8 +102,12 @@ graph TB
%% ============================================================
Gateway --> SQLite
Gateway --> SupRegistry
- Gateway -- "Watch + CRUD
Sandbox CRDs" --> KubeAPI
- KubeAPI -- "compute-driver events
(status, platform events)" --> Gateway
+ Gateway -- "Create / delete / watch
sandboxes" --> ComputeDriver
+ ComputeDriver --> DockerAPI
+ ComputeDriver --> PodmanAPI
+ ComputeDriver --> KubeAPI
+ ComputeDriver --> VMDriver
+ ComputeDriver -- "status, platform events" --> Gateway
%% ============================================================
%% CONNECTIONS: Supervisor session (inbound from sandbox)
@@ -146,9 +150,9 @@ graph TB
InferenceRouter -- "HTTPS" --> NVIDIA_API
%% ============================================================
- %% CONNECTIONS: Cluster bootstrap
+ %% CONNECTIONS: Image pulls
%% ============================================================
- K3s -- "pulls images
at runtime" --> GHCR
+ ComputeDriver -- "pulls or schedules workloads
that pull images" --> GHCR
%% ============================================================
%% CLIENT SSH / EXEC (bytes tunneled via supervisor relay)
@@ -175,7 +179,7 @@ graph TB
class Agent,Landlock,Seccomp,NetNS agent
class SQLite datastore
class Anthropic,OpenAI,NVIDIA_API,GitHub,GitLab,PyPI,NPM,LMStudio,VLLM,GHCR external
- class KubeAPI,HelmController,LocalPathProv,CRDController k8s
+ class ComputeDriver,DockerAPI,PodmanAPI,KubeAPI,VMDriver,CRDController k8s
class LocalConfig config
```
@@ -188,12 +192,12 @@ graph TB
| Green | Sandbox supervisor | SSH server, HTTP CONNECT proxy, OPA engine, inference router |
| Purple | Agent process & isolation | AI agent, Landlock, Seccomp, network namespace |
| Indigo | Data stores | SQLite database |
-| Dark blue | Kubernetes infrastructure | K8s API, Helm controller, CRD controller |
+| Dark blue | Compute infrastructure | Docker API, Podman API, K8s API, VM driver |
| Gray | External systems | AI APIs, code hosting, package registries, inference backends |
## Key Communication Flows
-1. **CLI/SDK to Gateway**: All control-plane traffic uses gRPC over HTTPS with mutual TLS (mTLS). Single multiplexed port (8080 inside cluster, 30051 NodePort).
+1. **CLI/SDK to Gateway**: Control-plane traffic uses gRPC over HTTPS with mutual TLS (mTLS) unless the gateway is explicitly deployed in plaintext mode behind a trusted transport. The gateway listens on one multiplexed service port.
2. **Supervisor Session (inbound from sandbox)**: Each sandbox supervisor opens a persistent `ConnectSupervisor` bidi gRPC stream to the gateway over mTLS. The gateway tracks these in `SupervisorSessionRegistry`. When SSH or exec access is needed, the gateway sends `RelayOpen { channel_id }` on that stream; the supervisor responds by initiating a `RelayStream` RPC on the same HTTP/2 connection whose first frame is a `RelayInit { channel_id }`. Subsequent frames carry raw bytes in both directions. The gateway never dials the sandbox pod.
diff --git a/architecture/tui.md b/architecture/tui.md
index 00e53a829..850cebc88 100644
--- a/architecture/tui.md
+++ b/architecture/tui.md
@@ -1,10 +1,10 @@
# OpenShell TUI
-The OpenShell TUI is a terminal user interface for OpenShell, inspired by [k9s](https://k9scli.io/). Instead of typing individual CLI commands to check cluster health, list sandboxes, and manage resources, the TUI gives you a real-time, keyboard-driven dashboard — everything updates automatically and you navigate with a few keystrokes.
+The OpenShell TUI is a terminal user interface for OpenShell, inspired by [k9s](https://k9scli.io/). Instead of typing individual CLI commands to check gateway health, list sandboxes, and manage resources, the TUI gives you a real-time, keyboard-driven dashboard — everything updates automatically and you navigate with a few keystrokes.
## Launching the TUI
-The TUI is a subcommand of the OpenShell CLI, so it inherits all your existing configuration — cluster selection, TLS settings, and verbosity flags all work the same way.
+The TUI is a subcommand of the OpenShell CLI, so it inherits all your existing configuration — gateway selection, TLS settings, and verbosity flags all work the same way.
```bash
openshell term # launch against the active gateway
@@ -27,7 +27,7 @@ The TUI divides the terminal into four horizontal regions:
```text
┌─────────────────────────────────────────────────────────────────┐
-│ OpenShell ─ my-cluster ─ Dashboard ● Healthy │ ← title bar
+│ OpenShell ─ my-gateway ─ Dashboard ● Healthy │ ← title bar
├─────────────────────────────────────────────────────────────────┤
│ │
│ (view content — Dashboard or Sandboxes) │ ← main area
@@ -39,7 +39,7 @@ The TUI divides the terminal into four horizontal regions:
└─────────────────────────────────────────────────────────────────┘
```
-- **Title bar** — shows the OpenShell logo, cluster name, current view, and live cluster health status.
+- **Title bar** — shows the OpenShell logo, gateway name, current view, and live gateway health status.
- **Main area** — the active view (Dashboard or Sandboxes).
- **Navigation bar** — lists available views with their shortcut keys, plus Help and Quit.
- **Command bar** — appears when you press `:` to type a command (like vim).
@@ -48,20 +48,20 @@ The TUI divides the terminal into four horizontal regions:
### Dashboard (press `1`)
-The Dashboard is the home screen. It shows your cluster at a glance.
+The Dashboard is the home screen. It shows your gateway at a glance.
The dashboard is divided into a top info pane and a middle pane with two tabs:
-- **Top pane**: Cluster name, gateway endpoint, health status, sandbox count.
+- **Top pane**: Gateway name, gateway endpoint, health status, sandbox count.
- **Middle pane**: Tabbed view toggled with `Tab`:
- - **Providers** — provider configurations attached to the cluster.
+ - **Providers** — provider configurations attached to the gateway.
- **Global Settings** — gateway-global runtime settings (fetched via `GetGatewaySettings`).
**Health status** indicators:
- `●` **Healthy** (green) — everything is running normally.
-- `◐` **Degraded** (yellow) — the cluster is up but something needs attention.
-- `○` **Unhealthy** (red) — the cluster is not operating correctly.
+- `◐` **Degraded** (yellow) — the gateway is up but something needs attention.
+- `○` **Unhealthy** (red) — the gateway is not operating correctly.
- `…` — still connecting or status unknown.
**Global policy indicator**: When a global policy is active, the gateway row shows `Global Policy Active (vN)` in yellow (the `status_warn` style). The TUI detects this by polling `ListSandboxPolicies` with `global: true, limit: 1` on each tick and checking if the latest revision has `PolicyStatus::Loaded`. See `crates/openshell-tui/src/ui/dashboard.rs`.
@@ -81,7 +81,7 @@ Both edit and delete operations display a confirmation modal before applying. Ch
### Sandboxes (press `2`)
-The Sandboxes view shows a table of all sandboxes in the cluster:
+The Sandboxes view shows a table of all sandboxes in the gateway:
| Column | Description |
|--------|-------------|
@@ -150,7 +150,7 @@ Press `Esc` to cancel and return to Normal mode. `Backspace` deletes characters
## Data Refresh
-The TUI automatically polls the cluster every **2 seconds**. Cluster health, the sandbox list, and global settings all update on each tick, so the display stays current without manual refreshing. This uses the same gRPC calls as the CLI — no additional server-side setup is required.
+The TUI automatically polls the gateway every **2 seconds**. Gateway health, the sandbox list, and global settings all update on each tick, so the display stays current without manual refreshing. This uses the same gRPC calls as the CLI — no additional server-side setup is required.
When viewing a sandbox, the policy pane auto-refreshes when a new policy version is detected. The sandbox list response includes `current_policy_version` for each sandbox; on every tick the TUI compares this against the currently displayed policy version and re-fetches the full policy only when they differ. This avoids extra RPCs during normal operation while ensuring policy updates appear within the polling interval. The user's scroll position is preserved across auto-refreshes.
diff --git a/docs/CONTRIBUTING.mdx b/docs/CONTRIBUTING.mdx
index f2426873a..fd539575f 100644
--- a/docs/CONTRIBUTING.mdx
+++ b/docs/CONTRIBUTING.mdx
@@ -126,7 +126,7 @@ These patterns are common in LLM-generated text and erode trust with technical r
- Use `shell` code blocks for copyable CLI examples. Do not prefix commands with `$`:
```shell
- openshell gateway start
+ openshell gateway add http://127.0.0.1:18080 --local --name local
```
- Use `text` code blocks for transcripts, log output, and examples that should not be copied verbatim.
@@ -168,7 +168,7 @@ docs: update gateway deployment instructions
If your doc change accompanies a code change, include both in the same PR and use the code change's commit type:
```text
-feat(cli): add --gpu flag to gateway start
+feat(cli): add gateway registration flag
```
## Reviewing Doc PRs
diff --git a/docs/about/architecture.mdx b/docs/about/architecture.mdx
index a88e6f490..ac17a8d4e 100644
--- a/docs/about/architecture.mdx
+++ b/docs/about/architecture.mdx
@@ -8,7 +8,7 @@ keywords: "Generative AI, Cybersecurity, AI Agents, Sandboxing, Security, Archit
position: 2
---
-OpenShell runs inside a Docker container. Each sandbox is an isolated environment managed through the gateway. Three components work together to keep agents secure.
+OpenShell runs sandboxes through a gateway that can use several compute platforms, including Kubernetes, Docker, Podman, and the experimental MicroVM runtime. Each sandbox is an isolated environment managed through the gateway. Three components work together to keep agents secure.

@@ -42,7 +42,7 @@ For REST endpoints, the proxy auto-detects Transport Layer Security (TLS) by pee
## Gateway Lifecycle
-OpenShell preserves sandbox state across gateway restarts. After `openshell gateway stop` and `openshell gateway start`, running sandboxes resume from their saved state. Existing SSH sessions reconnect without re-authenticating.
+OpenShell preserves gateway state in the configured gateway database. Kubernetes deployments commonly use the Helm chart's persistent volume, while standalone gateway deployments can use local storage or Postgres. Sandbox state depends on the selected compute driver and the storage configured for sandbox workloads.
## Observability
@@ -50,13 +50,15 @@ The sandbox emits operational decisions as Open Cybersecurity Schema Framework (
## Deployment Modes
-OpenShell can run locally, on a remote host, or behind a cloud proxy. The architecture is identical in all cases. Only the Docker container location and authentication mode change.
+OpenShell is deployed by running or exposing a gateway, then registering that endpoint with the CLI. The gateway's compute driver determines where sandboxes run.
| Mode | Description | Command |
|---|---|---|
-| **Local** | The gateway runs inside Docker on your workstation. The CLI provisions it automatically on first use. | `openshell gateway start` |
-| **Remote** | The gateway runs on a remote host via SSH. Only Docker is required on the remote machine. | `openshell gateway start --remote user@host` |
-| **Cloud** | A gateway already running behind a reverse proxy (e.g. Cloudflare Access). Register and authenticate via browser. | `openshell gateway add https://gateway.example.com` |
+| **Docker local** | The gateway uses Docker to create local sandbox containers. | `mise run gateway:docker` |
+| **Podman local** | The gateway uses a Podman-compatible container runtime, commonly rootless. | Driver-specific gateway configuration |
+| **Kubernetes service** | The Helm chart deploys the gateway and configures the Kubernetes compute driver. | `helm upgrade --install ... --values values.yaml` |
+| **MicroVM** | The gateway uses the experimental VM compute driver. | Driver-specific gateway configuration |
+| **Cloud proxy** | A gateway already running behind a reverse proxy, such as Cloudflare Access. Register and authenticate via browser. | `openshell gateway add https://gateway.example.com` |
You can register multiple gateways and switch between them with `openshell gateway select`. For the full deployment and management workflow, refer to the [Gateways](/sandboxes/manage-gateways) section.
diff --git a/docs/get-started/quickstart.mdx b/docs/get-started/quickstart.mdx
index 2f26c7bfb..d5e2727ef 100644
--- a/docs/get-started/quickstart.mdx
+++ b/docs/get-started/quickstart.mdx
@@ -2,18 +2,20 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: "Quickstart"
-description: "Install the OpenShell CLI and create your first sandboxed AI agent in two commands."
-keywords: "Generative AI, Cybersecurity, AI Agents, Sandboxing, Installation, Quickstart"
+description: "Install the OpenShell CLI, connect to a gateway, and create your first sandboxed AI agent."
+keywords: "Generative AI, Cybersecurity, AI Agents, Sandboxing, Installation, Quickstart, Gateway, Docker, Kubernetes, Podman"
position: 1
---
-This page gets you from zero to a running, policy-enforced sandbox in two commands.
+This page gets you from a reachable OpenShell gateway to a running, policy-enforced sandbox.
## Prerequisites
Before you begin, make sure you have:
-- Docker Desktop running on your machine.
+- A reachable OpenShell gateway.
+- At least one compute platform configured for the gateway: Kubernetes, Docker, Podman, or MicroVM.
+- The OpenShell CLI installed on your workstation.
For a complete list of requirements, refer to [Support Matrix](/reference/support-matrix).
@@ -31,13 +33,74 @@ If you prefer [uv](https://docs.astral.sh/uv/):
uv tool install -U openshell
```
-After installing the CLI, run `openshell --help` in your terminal to see the full CLI reference, including all commands and flags.
+After installing the CLI, run `openshell --help` in your terminal to see the full CLI reference.
You can also clone the [NVIDIA OpenShell GitHub repository](https://github.com/NVIDIA/OpenShell) and use the `/openshell-cli` skill to load the CLI reference into your agent.
-
+## Connect to a Gateway
+
+OpenShell sandboxes run through a gateway. The gateway can use different compute platforms:
+
+| Platform | Typical use |
+|---|---|
+| Kubernetes | Shared clusters and Helm-managed deployments. |
+| Docker | Local development and single-machine gateway testing. |
+| Podman | Rootless local or workstation deployments. |
+| MicroVM | Stronger isolation experiments using the VM driver. |
+
+For a local source checkout, you can start a Docker-backed gateway for evaluation:
+
+```shell
+git clone https://github.com/NVIDIA/OpenShell.git
+cd OpenShell
+mise trust
+mise run gateway:docker
+```
+
+In another terminal, register the gateway:
+
+```shell
+openshell gateway add http://127.0.0.1:18080 --local --name local
+openshell status
+```
+
+For Kubernetes deployments, install the Helm chart into a cluster you manage:
+
+```shell
+kubectl create namespace openshell --dry-run=client -o yaml | kubectl apply -f -
+kubectl -n openshell create secret generic openshell-ssh-handshake \
+ --from-literal=secret="$(openssl rand -hex 32)" \
+ --dry-run=client -o yaml | kubectl apply -f -
+```
+
+```shell
+helm upgrade --install openshell ./deploy/helm/openshell \
+ --namespace openshell \
+ --set service.type=ClusterIP \
+ --set server.disableTls=true \
+ --set server.grpcEndpoint=http://openshell.openshell.svc.cluster.local:8080
+```
+
+Then wait for the gateway and start a local port-forward:
+
+```shell
+kubectl -n openshell rollout status statefulset/openshell
+kubectl -n openshell port-forward svc/openshell 8080:8080
+```
+
+In another terminal, register the gateway:
+
+```shell
+openshell gateway add http://127.0.0.1:8080 --local --name local
+openshell status
+```
+
+
+Plaintext examples are for trusted local evaluation. For shared deployments, keep TLS enabled or expose the gateway through a trusted access proxy. See [Gateway Authentication](/reference/gateway-auth).
+
+
## Create Your First OpenShell Sandbox
Create a sandbox and launch an agent inside it.
@@ -100,8 +163,8 @@ Each definition bundles a container image, a tailored policy, and optional skill
-Use the `--from` flag to pull other OpenShell sandbox images from the [NVIDIA Container Registry](https://registry.nvidia.com/).
-For example, to pull the `base` image, run the following command:
+Use the `--from` flag to pull other OpenShell sandbox images from the [OpenShell Community](https://github.com/NVIDIA/OpenShell-Community) catalog.
+For example, to pull the `base` image, run:
```shell
openshell sandbox create --from base
@@ -110,46 +173,3 @@ openshell sandbox create --from base
-
-## Deploy a Gateway (Optional)
-
-Running `openshell sandbox create` without a gateway auto-bootstraps a local one.
-To start the gateway explicitly or deploy to a remote host, choose the tab that matches your setup.
-
-
-
-
-
-Deploy an OpenShell gateway on Brev by clicking **Deploy** on the [OpenShell Launchable](https://brev.nvidia.com/launchable/deploy/now?launchableID=env-3Ap3tL55zq4a8kew1AuW0FpSLsg).
-
-
-
-After the instance starts running, find the gateway URL in the Brev console under **Using Secure Links**.
-Copy the shareable URL for **port 8080**, which is the gateway endpoint.
-
-```shell
-openshell gateway add https://.brevlab.com
-openshell status
-```
-
-
-
-
-
-
-Set up your Spark with NVIDIA Sync first, or make sure SSH access is configured (such as SSH keys added to the host).
-
-
-
-Deploy to a DGX Spark machine over SSH:
-
-```shell
-openshell gateway start --remote @.local
-openshell status
-```
-
-After `openshell status` shows the gateway as healthy, all subsequent commands route through the SSH tunnel.
-
-
-
-
diff --git a/docs/get-started/tutorials/github-sandbox.mdx b/docs/get-started/tutorials/github-sandbox.mdx
index ee3e16759..021a7c348 100644
--- a/docs/get-started/tutorials/github-sandbox.mdx
+++ b/docs/get-started/tutorials/github-sandbox.mdx
@@ -334,7 +334,7 @@ The push completes successfully. The `openshell term` dashboard now shows `l7_de
## Clean Up
-When you are finished, delete the sandbox to free cluster resources:
+When you are finished, delete the sandbox to free gateway compute resources:
```shell
openshell sandbox delete
diff --git a/docs/reference/gateway-auth.mdx b/docs/reference/gateway-auth.mdx
index dc06a94b8..db96a332b 100644
--- a/docs/reference/gateway-auth.mdx
+++ b/docs/reference/gateway-auth.mdx
@@ -24,9 +24,9 @@ The CLI loads gateway metadata from disk to determine the endpoint URL and authe
The CLI uses one of three connection modes depending on the gateway's authentication configuration.
-### mTLS (local and remote gateways)
+### mTLS
-The default mode for self-deployed gateways. When you run `gateway start` or `gateway add https://... --local` / `gateway add https://... --remote`, the CLI extracts mTLS certificates from the running container and stores them locally. Every subsequent request presents a client certificate to prove identity.
+The default mode for self-managed gateways. Every CLI request presents a client certificate to prove identity. Kubernetes Helm deployments provide the server certificate, client CA, and client certificate bundle through Kubernetes secrets. Standalone Docker, Podman, and MicroVM-backed gateways provide equivalent certificate files through their runtime configuration.
The CLI loads three PEM files from `~/.config/openshell/gateways//mtls/`:
@@ -44,6 +44,24 @@ The connection flow:
4. The gateway verifies the client certificate against its CA.
5. An HTTP/2 channel is established. All CLI commands use this channel.
+For Helm deployments, create these Kubernetes secrets before installing the chart:
+
+```shell
+kubectl -n openshell create secret tls openshell-server-tls \
+ --cert server.crt \
+ --key server.key
+
+kubectl -n openshell create secret generic openshell-server-client-ca \
+ --from-file=ca.crt=ca.crt
+
+kubectl -n openshell create secret generic openshell-client-tls \
+ --from-file=ca.crt=ca.crt \
+ --from-file=tls.crt=client.crt \
+ --from-file=tls.key=client.key
+```
+
+The same `ca.crt`, `client.crt`, and `client.key` files must be placed in the CLI mTLS directory as `ca.crt`, `tls.crt`, and `tls.key`.
+
### Edge JWT (cloud gateways)
For gateways behind a reverse proxy that handles authentication (e.g. Cloudflare Access), the CLI uses a browser-based login flow and routes traffic through a WebSocket tunnel.
@@ -70,7 +88,7 @@ This is transparent to the user. All CLI commands work the same regardless of wh
### Plaintext
-When a gateway is deployed with `--plaintext`, TLS is disabled entirely. The CLI connects over plain HTTP/2. This mode is intended for gateways behind a trusted reverse proxy or tunnel that handles TLS termination externally.
+When a gateway is deployed with `server.disableTls=true`, TLS is disabled entirely. The CLI connects over plain HTTP/2. This mode is intended for local port-forwarding or gateways behind a trusted reverse proxy or tunnel that handles TLS termination externally.
Register a plaintext gateway with an explicit `http://` endpoint:
diff --git a/docs/reference/support-matrix.mdx b/docs/reference/support-matrix.mdx
index c5eeee567..1e0a3c25a 100644
--- a/docs/reference/support-matrix.mdx
+++ b/docs/reference/support-matrix.mdx
@@ -6,11 +6,11 @@ description: ""
position: 4
---
-This page lists the platform, software, runtime, and kernel requirements for running OpenShell.
+This page lists the host platform, compute platform, software, runtime, and kernel requirements for running OpenShell.
## Supported Platforms
-OpenShell publishes multi-architecture container images for `linux/amd64` and `linux/arm64`. The CLI and standalone gateway binary are supported on the following host platforms:
+OpenShell publishes multi-architecture gateway container images for `linux/amd64` and `linux/arm64`. The CLI and standalone gateway binary are supported on the following host platforms:
| Platform | Architecture | Status |
| -------------------------------- | --------------------- | --------- |
@@ -29,15 +29,30 @@ OpenShell publishes standalone `openshell-gateway` release assets for manual dow
| Linux aarch64 (arm64) | `openshell-gateway-aarch64-unknown-linux-gnu` |
| macOS Apple Silicon | `openshell-gateway-aarch64-apple-darwin` |
-These artifacts are attached to GitHub releases. `openshell gateway start` continues to use the published cluster and gateway container images.
+These artifacts are attached to GitHub releases. Kubernetes deployments should use the Helm chart and the published gateway image.
+
+## Compute Platforms
+
+The gateway can manage sandboxes on several compute platforms.
+
+| Compute platform | Status | Notes |
+|---|---|---|
+| Docker | Supported for local development and single-machine gateways. | Requires Docker Desktop or Docker Engine on the gateway host. |
+| Podman | Supported for rootless local and workstation workflows. | Requires a Podman-compatible socket and rootless networking setup. |
+| Kubernetes | Supported through the Helm chart in `deploy/helm/openshell`. | Requires a Kubernetes cluster supplied by the operator. |
+| MicroVM | Experimental. | Uses the VM compute driver and libkrun-based runtime work. |
## Software Prerequisites
-The following software must be installed on the host before using the OpenShell CLI:
+Install the software for the compute platform you use:
-| Component | Minimum Version | Notes |
-| ------------------------------- | --------------- | ----------------------------------------------- |
-| Docker Desktop or Docker Engine | 28.04 | Must be running before any `openshell` command. |
+| Component | Minimum Version | Notes |
+|---|---|---|
+| Docker Desktop or Docker Engine | 28.04 | Required for Docker-backed gateways, local image builds, and Docker development workflows. |
+| Podman | 5.x | Required for Podman-backed gateways. |
+| Kubernetes | 1.29 | Required for Helm deployments and Kubernetes sandbox scheduling. |
+| Helm | 3.x | Required to install `deploy/helm/openshell`. |
+| kubectl | Compatible with your cluster | Required for Kubernetes operational inspection and secret creation. |
## Sandbox Runtime Versions
@@ -45,23 +60,22 @@ Sandbox container images are maintained in the [openshell-community](https://git
## Container Images
-OpenShell publishes two container images. Both are published for `linux/amd64` and `linux/arm64`.
+OpenShell publishes the gateway container image for `linux/amd64` and `linux/arm64`.
-| Image | Reference | Pulled When |
-| ------- | ----------------------------------------- | -------------------------------- |
-| Cluster | `ghcr.io/nvidia/openshell/cluster:latest` | `openshell gateway start` |
-| Gateway | `ghcr.io/nvidia/openshell/gateway:latest` | Cluster startup (via Helm chart) |
+| Image | Reference | Pulled When |
+|---|---|---|
+| Gateway | `ghcr.io/nvidia/openshell/gateway:latest` | Helm chart install or upgrade, or standalone container deployment |
-The cluster image bundles the Helm charts, Kubernetes manifests, and the `openshell-sandbox` supervisor binary required to bootstrap the control plane. The supervisor binary is side-loaded into sandbox pods at runtime through a read-only host volume mount. The gateway image is pulled at cluster startup and runs the API server.
+The Helm chart in `deploy/helm/openshell` deploys the gateway StatefulSet, service account, service, persistent storage, and network policy for Kubernetes. OpenShell no longer publishes a k3s cluster image.
Sandbox images are maintained separately in the [openshell-community](https://github.com/nvidia/openshell-community) repository.
-To override the default image references, set the following environment variables:
+To override the default image references, use Helm values:
-| Variable | Purpose |
-| ------------------------------ | --------------------------------------------------- |
-| `OPENSHELL_CLUSTER_IMAGE` | Override the cluster image reference. |
-| `OPENSHELL_COMMUNITY_REGISTRY` | Override the registry for community sandbox images. |
+| Helm value | Purpose |
+|---|---|
+| `image.repository` / `image.tag` | Override the gateway image reference. |
+| `server.sandboxImage` | Override the default sandbox image. |
## Kernel Requirements
diff --git a/docs/sandboxes/about.mdx b/docs/sandboxes/about.mdx
index 92f4e20fe..e52346e16 100644
--- a/docs/sandboxes/about.mdx
+++ b/docs/sandboxes/about.mdx
@@ -10,20 +10,21 @@ position: 1
Every OpenShell deployment starts with a *gateway* and one or more *sandboxes*. The gateway is the control plane that manages sandbox lifecycle, providers, and policies. A sandbox is the data plane, a safe, private execution environment where an AI agent runs. Each sandbox runs with multiple layers of protection that prevent unauthorized data access, credential exposure, and network exfiltration. Protection layers include filesystem restrictions (Landlock), system call filtering (seccomp), network namespace isolation, and a privacy-enforcing HTTP CONNECT proxy.
-## Gateway Types
+## Gateway Compute Platforms
-A gateway provisions sandboxes, brokers CLI requests, enforces policies, and manages provider credentials. OpenShell supports three deployment models, so the gateway can run wherever your workload requires.
+A gateway provisions sandboxes, brokers CLI requests, enforces policies, and manages provider credentials. OpenShell supports multiple compute platforms, so the gateway can run sandboxes wherever your workload requires.
-| Type | Where It Runs | Best For |
+| Platform | Where Sandboxes Run | Best For |
|---|---|---|
-| **Local** | Docker on your workstation | Solo development and quick iteration. The CLI auto-bootstraps a local gateway if none exists. |
-| **Remote** | Docker on a remote host via SSH | Running sandboxes on a more powerful machine (for example, a DGX Spark) while keeping the CLI on your laptop. |
-| **Cloud** | Behind a reverse proxy (for example, Cloudflare Access) | Individual users accessing OpenShell behind a cloud VM. Cloud gateways are not yet intended for shared team access. |
+| Docker | Containers on the gateway host | Solo development, quick iteration, and single-machine gateways. |
+| Podman | Rootless containers on the gateway host | Workstations that avoid a rootful Docker daemon. |
+| Kubernetes | Pods in an operator-managed cluster | Shared clusters and cloud environments. |
+| MicroVM | VM-backed sandboxes | Experimental stronger-isolation workflows. |
-All three types expose the same API surface. Sandboxes, policies, and providers work identically regardless of where the gateway runs. The only difference is how the CLI reaches the gateway, whether through a direct Docker socket, SSH tunnel, or HTTPS through a proxy.
+All compute platforms expose the same gateway API surface. Sandboxes, policies, and providers work the same after the CLI registers the gateway endpoint. The difference is how the gateway creates sandbox workloads and how operators expose the gateway to users.
-You do not need to deploy a gateway manually. Running `openshell sandbox create` without a gateway auto-bootstraps a local one for you.
+For local development, start a Docker-backed gateway from a source checkout with `mise run gateway:docker`, then register it with `openshell gateway add http://127.0.0.1:18080 --local --name local`.
diff --git a/docs/sandboxes/manage-gateways.mdx b/docs/sandboxes/manage-gateways.mdx
index a20a27c20..f3f1eaa88 100644
--- a/docs/sandboxes/manage-gateways.mdx
+++ b/docs/sandboxes/manage-gateways.mdx
@@ -3,8 +3,8 @@
# SPDX-License-Identifier: Apache-2.0
title: "Deploy and Manage Gateways"
sidebar-title: "Gateways"
-description: "Deploy local and remote gateways, register cloud gateways, and manage multiple gateway environments."
-keywords: "Generative AI, Cybersecurity, Gateway, Deployment, Remote Gateway, CLI"
+description: "Deploy or register OpenShell gateways, choose a compute platform, and manage multiple gateway environments."
+keywords: "Generative AI, Cybersecurity, Gateway, Deployment, Docker, Podman, Kubernetes, Helm, MicroVM, CLI"
position: 3
---
@@ -13,123 +13,150 @@ The gateway is the control plane for OpenShell. All control-plane traffic betwee
The gateway is responsible for:
- Provisioning and managing sandboxes, including creation, deletion, and status monitoring.
-- Storing provider credentials (API keys, tokens) and delivering them to sandboxes at startup.
+- Storing provider credentials and delivering them to sandboxes at startup.
- Delivering network and filesystem policies to sandboxes. Policy enforcement itself happens inside each sandbox through the proxy, OPA, Landlock, and seccomp.
- Managing inference configuration and serving inference bundles so sandboxes can route requests to the correct backend.
- Providing the SSH tunnel endpoint so you can connect to sandboxes without exposing them directly.
-The gateway runs inside a Docker container and exposes a single port (gRPC and HTTP multiplexed), secured by mTLS by default. No separate Kubernetes installation is required. It can be deployed locally, on a remote host via SSH, or behind a cloud reverse proxy.
+OpenShell separates gateway access from the compute platform that runs sandboxes. The CLI talks to a gateway endpoint. The gateway then uses a configured compute driver to create sandboxes on Kubernetes, Docker, Podman, or the experimental MicroVM runtime.
-## Deploy a Local Gateway
+## Choose a Compute Platform
-Deploy a gateway on your workstation. The only prerequisite is a running Docker daemon.
+Use the compute platform that matches where you want sandboxes to run.
+
+| Platform | Deployment shape | Typical use |
+|---|---|---|
+| Docker | Gateway process or container talks to a Docker daemon. | Local development, single-machine evaluation, and quick driver testing. |
+| Podman | Gateway uses rootless Podman-compatible container execution. | Workstations that avoid a rootful Docker daemon. |
+| Kubernetes | Gateway runs from the Helm chart and manages sandbox pods. | Shared clusters, cloud environments, and team infrastructure. |
+| MicroVM | Gateway uses the VM compute driver. | Experimental stronger-isolation workflows. |
+
+All platforms expose the same CLI workflow after the gateway endpoint is registered.
+
+## Run a Local Docker-Backed Gateway
+
+From a source checkout, use the Docker-backed development gateway for local evaluation:
```shell
-openshell gateway start
+git clone https://github.com/NVIDIA/OpenShell.git
+cd OpenShell
+mise trust
+mise run gateway:docker
```
-The gateway becomes reachable at `https://127.0.0.1:8080`. Verify it is healthy:
+In another terminal, register the gateway:
```shell
+openshell gateway add http://127.0.0.1:18080 --local --name local
openshell status
```
-
-You do not need to deploy a gateway manually. If you run `openshell sandbox create` without a gateway, the CLI auto-bootstraps a local gateway for you.
+Use this path when you are testing the CLI, policy behavior, sandbox images, or Docker driver changes on one machine.
+
+## Deploy with Helm
-
+Use Helm when the gateway should run on a Kubernetes cluster you manage. Run these commands from a checkout of the OpenShell repository so Helm can read `deploy/helm/openshell`.
-To use a different port or name:
+Create the namespace and the SSH handshake secret required by sandbox relay:
```shell
-openshell gateway start --port 9090
-openshell gateway start --name dev-local
+kubectl create namespace openshell --dry-run=client -o yaml | kubectl apply -f -
+kubectl -n openshell create secret generic openshell-ssh-handshake \
+ --from-literal=secret="$(openssl rand -hex 32)" \
+ --dry-run=client -o yaml | kubectl apply -f -
```
-## Deploy a Remote Gateway
-
-Deploy a gateway on a remote machine accessible via SSH. The only dependency on the remote host is Docker.
+For TLS-enabled deployments, create the mTLS secrets described in [Gateway Authentication](/reference/gateway-auth) before installing. Then install the chart:
```shell
-openshell gateway start --remote user@hostname
+helm upgrade --install openshell ./deploy/helm/openshell \
+ --namespace openshell
```
-The gateway is reachable at `https://:8080`.
-
-To specify an SSH key:
+Wait for the gateway:
```shell
-openshell gateway start --remote user@hostname --ssh-key ~/.ssh/my_key
+kubectl -n openshell rollout status statefulset/openshell
```
-
-For DGX Spark, use your Spark's mDNS hostname:
+For local evaluation on a trusted workstation, you can disable TLS and use `kubectl port-forward`:
```shell
-openshell gateway start --remote @.local
+helm upgrade --install openshell ./deploy/helm/openshell \
+ --namespace openshell \
+ --set service.type=ClusterIP \
+ --set server.disableTls=true \
+ --set server.grpcEndpoint=http://openshell.openshell.svc.cluster.local:8080
```
-
-
-## Register an Existing Gateway
-
-Use `openshell gateway add` to register a gateway that is already running.
-
-### Cloud Gateway
-
-Register a gateway behind a reverse proxy such as Cloudflare Access:
+Start a local port-forward:
```shell
-openshell gateway add https://gateway.example.com
+kubectl -n openshell port-forward svc/openshell 8080:8080
```
-This opens your browser for the proxy's login flow. After authentication, the CLI stores a bearer token and sets the gateway as active.
-
-To give the gateway a specific name instead of deriving it from the hostname, use `--name`:
+In another terminal, register the gateway:
```shell
-openshell gateway add https://gateway.example.com --name production
+openshell gateway add http://127.0.0.1:8080 --local --name local
+openshell status
```
-If the token expires later, re-authenticate with:
+
+Plaintext mode should be limited to local port-forwarding or a trusted private path. For shared environments, keep TLS enabled and terminate public traffic at your ingress, load balancer, or access proxy.
+
-```shell
-openshell gateway login
-```
+## Configure Chart Values
+
+The most commonly changed values are:
-### Remote Gateway
+| Value | Purpose |
+|---|---|
+| `image.repository` / `image.tag` | Gateway container image. Defaults to `ghcr.io/nvidia/openshell/gateway:latest`. |
+| `service.type` | Kubernetes service type for the gateway. Use `ClusterIP`, `NodePort`, or your platform default. |
+| `server.dbUrl` | Gateway database URL. Defaults to SQLite on the chart-managed persistent volume. |
+| `server.sandboxNamespace` | Namespace where sandbox resources are created. |
+| `server.sandboxImage` | Default sandbox image used when a sandbox does not specify one. |
+| `server.grpcEndpoint` | Endpoint that sandbox supervisors use to call back to the gateway. |
+| `server.sshGatewayHost` / `server.sshGatewayPort` | Public host and port returned to CLI clients for SSH proxy connections. |
+| `server.disableTls` | Run the gateway over plaintext HTTP. Use only behind a trusted transport. |
+| `server.tls.*` | Secret names for server and client mTLS materials. |
-Register a gateway on a remote host you have SSH access to:
+Use a values file for repeatable deployments:
```shell
-openshell gateway add https://remote-host:8080 --remote user@remote-host
+helm upgrade --install openshell ./deploy/helm/openshell \
+ --namespace openshell \
+ --values values.yaml
```
-Or use the `ssh://` scheme to combine the SSH destination and gateway port:
+## Register an Existing Gateway
+
+Use `openshell gateway add` to register any reachable gateway endpoint so the CLI can target it.
+
+Register a plaintext local endpoint, such as a trusted port-forward:
```shell
-openshell gateway add ssh://user@remote-host:8080
+openshell gateway add http://127.0.0.1:8080 --local --name local
```
-### Local Gateway
-
-Register a local mTLS gateway:
+Register a gateway behind an authenticated reverse proxy:
```shell
-openshell gateway add https://127.0.0.1:8080 --local
+openshell gateway add https://gateway.example.com --name production
```
-Register a local plaintext gateway with no auth:
+This opens your browser for the proxy's login flow when the gateway uses edge authentication. If the token expires later, re-authenticate with:
```shell
-openshell gateway add http://127.0.0.1:8080 --local
+openshell gateway login production
```
-When the endpoint uses `http://`, the CLI stores a direct plaintext registration. It does not extract mTLS certificates and it does not open the browser auth flow.
+For direct mTLS endpoints, place the CLI client certificate bundle in the gateway credential directory described in [Gateway Authentication](/reference/gateway-auth), then register or select that gateway name.
## Manage Multiple Gateways
-One gateway is always the active gateway. All CLI commands target it by default. Both `gateway start` and `gateway add` automatically set the new gateway as active.
+One gateway is always the active gateway. All CLI commands target it by default. `gateway add` sets the new gateway as active.
List all registered gateways:
@@ -140,53 +167,20 @@ openshell gateway select
Switch the active gateway:
```shell
-openshell gateway select my-remote-cluster
+openshell gateway select production
```
Override the active gateway for a single command with `-g`:
```shell
-openshell status -g my-other-cluster
+openshell status -g staging
```
-Show deployment details for a gateway, including endpoint, auth mode, and port:
+Show gateway details:
```shell
openshell gateway info
-openshell gateway info --name my-remote-cluster
-```
-
-## Advanced Start Options
-
-| Flag | Purpose |
-|---|---|
-| `--gpu` | Enable NVIDIA GPU passthrough. Requires NVIDIA drivers and the Container Toolkit on the host. OpenShell auto-selects CDI when enabled on the daemon and falls back to Docker's NVIDIA GPU request path (`--gpus all`) otherwise. |
-| `--plaintext` | Listen on HTTP instead of mTLS. Use behind a TLS-terminating reverse proxy. |
-| `--disable-gateway-auth` | Skip mTLS client certificate checks. Use when a reverse proxy cannot forward client certs. |
-| `--registry-username` | Username for registry authentication. Defaults to `__token__` when `--registry-token` is set. Only needed for private registries. Also configurable with `OPENSHELL_REGISTRY_USERNAME`. |
-| `--registry-token` | Authentication token for pulling container images. For GHCR, a GitHub PAT with `read:packages` scope. Only needed for private registries. Also configurable with `OPENSHELL_REGISTRY_TOKEN`. |
-
-## Stop and Destroy
-
-Stop a gateway while preserving its state for later restart:
-
-```shell
-openshell gateway stop
-```
-
-Permanently destroy a gateway and all its state:
-
-```shell
-openshell gateway destroy
-```
-
-For cloud gateways, `gateway destroy` removes only the local registration. It does not affect the remote deployment.
-
-Target a specific gateway with `--name`:
-
-```shell
-openshell gateway stop --name my-gateway
-openshell gateway destroy --name my-gateway
+openshell gateway info --name production
```
## Troubleshoot
@@ -195,28 +189,35 @@ Check gateway health:
```shell
openshell status
+openshell gateway info
```
-View gateway logs:
+For Docker-backed local gateways, inspect the process or container started by your local workflow:
```shell
-openshell doctor logs
-openshell doctor logs --tail # stream live
-openshell doctor logs --lines 50 # last 50 lines
+mise run gateway:docker
+openshell status
```
-Run a command inside the gateway container for deeper inspection:
+For Kubernetes gateways, inspect the Helm release and gateway workload:
```shell
-openshell doctor exec -- kubectl get pods -A
-openshell doctor exec -- sh
+kubectl -n openshell get pods
+kubectl -n openshell logs statefulset/openshell
+helm -n openshell get values openshell
+helm -n openshell status openshell
```
-If the gateway is in a bad state, recreate it:
+For Podman or MicroVM gateways, inspect the driver-specific gateway logs and confirm the gateway endpoint is reachable with `openshell status`.
-```shell
-openshell gateway start --recreate
-```
+For sandbox startup failures, inspect the selected compute platform:
+
+| Platform | What to check |
+|---|---|
+| Docker | Docker daemon health, image availability, gateway logs, and sandbox container state. |
+| Podman | Podman socket availability, rootless networking, image availability, and sandbox container state. |
+| Kubernetes | Events and sandbox pods in the namespace configured by `server.sandboxNamespace`. |
+| MicroVM | VM driver logs, rootfs availability, and gateway logs. |
## Next Steps
diff --git a/docs/sandboxes/manage-providers.mdx b/docs/sandboxes/manage-providers.mdx
index fbfc4d380..b39f86f59 100644
--- a/docs/sandboxes/manage-providers.mdx
+++ b/docs/sandboxes/manage-providers.mdx
@@ -49,7 +49,7 @@ This looks up the current value of `$API_KEY` in your shell and stores it.
## Manage Providers
-List, inspect, update, and delete providers from the active cluster.
+List, inspect, update, and delete providers from the active gateway.
List all providers:
diff --git a/docs/sandboxes/manage-sandboxes.mdx b/docs/sandboxes/manage-sandboxes.mdx
index fb24bae9b..761c06cc7 100644
--- a/docs/sandboxes/manage-sandboxes.mdx
+++ b/docs/sandboxes/manage-sandboxes.mdx
@@ -11,9 +11,7 @@ position: 2
This page covers creating sandboxes and managing them. For background on what sandboxes are and how isolation works, refer to [About Sandboxes](/sandboxes/about).
-Docker must be running before you create a gateway or sandbox. If it is not, the CLI
-returns a connection-refused error (`os error 61`) without explaining
-the cause. Start Docker and try again.
+You need an active gateway before creating a sandbox. Docker-backed gateways require Docker to be running. Podman, Kubernetes, and MicroVM-backed gateways require their respective runtime or cluster to be reachable from the gateway.
@@ -25,11 +23,16 @@ Create a sandbox with a single command. For example, to create a sandbox with Cl
openshell sandbox create -- claude
```
-Every sandbox requires a gateway. If you run `openshell sandbox create` without a gateway, the CLI auto-bootstraps a local gateway.
+Every sandbox requires a gateway. Register or select one before running sandbox commands:
+
+```shell
+openshell gateway add http://127.0.0.1:18080 --local --name local
+openshell gateway select local
+```
### Remote Gateways
-If you plan to run sandboxes on a remote host or a cloud-hosted gateway, set up the gateway first. Refer to [Manage Gateways](/sandboxes/manage-gateways) for deployment options and multi-gateway management.
+If you plan to run sandboxes on a remote host, Kubernetes cluster, or cloud-hosted gateway, set up the gateway first. Refer to [Manage Gateways](/sandboxes/manage-gateways) for deployment options and multi-gateway management.
### GPU Resources
diff --git a/docs/security/best-practices.mdx b/docs/security/best-practices.mdx
index cd7a627bc..4140ca0b9 100644
--- a/docs/security/best-practices.mdx
+++ b/docs/security/best-practices.mdx
@@ -238,16 +238,16 @@ The agent never receives the provider API key.
## Gateway Security
-The gateway secures communication between the CLI, sandbox pods, and external clients with mutual TLS and token-based authentication.
+The gateway secures communication between the CLI, sandbox workloads, and external clients with mutual TLS and token-based authentication.
### mTLS
-Communication between the CLI, sandbox pods, and the gateway is secured by mutual TLS.
-OpenShell generates a cluster CA at bootstrap and distributes it through Kubernetes secrets.
+Communication between the CLI, sandbox workloads, and the gateway is secured by mutual TLS.
+For Helm deployments, provide a CA and certificate bundle through the Kubernetes secrets consumed by the chart.
| Aspect | Detail |
|---|---|
-| Default | mTLS required. Both client and server present certificates that the cluster CA signed. |
+| Default | mTLS required. Both client and server present certificates that the deployment CA signed. |
| What you can change | Enable dual-auth mode (`allow_unauthenticated=true`) for Cloudflare Tunnel deployments, or disable TLS entirely for trusted reverse-proxy setups. |
| Risk if relaxed | Dual-auth mode accepts clients without certificates and defers authentication to the HTTP layer (Cloudflare JWT). Disabling TLS removes transport-level authentication entirely. |
| Recommendation | Use mTLS (the default) unless deploying behind Cloudflare or a trusted reverse proxy. |