Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 12 additions & 16 deletions .agents/skills/build-from-issue/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -402,29 +402,24 @@ git diff --name-only main -- e2e/

If there are no changes under `e2e/`, skip this phase entirely.

If E2E files were modified, deploy to the local cluster and run the E2E test suite:
If E2E files were modified, run the relevant E2E lane for the driver touched by the change:

```bash
# Deploy all changes to the local k3s cluster
mise run cluster:deploy

# Run the E2E sandbox tests
mise run test:e2e:sandbox
# Docker-backed gateway smoke E2E
mise run e2e:docker
```

`mise run test:e2e:sandbox` depends on `cluster:deploy` and `python:proto`, then runs `uv run pytest -o python_files='test_*.py' e2e/python`. However, since the cluster may need explicit deploy for code changes beyond just E2E test files, always run `mise run cluster:deploy` first as a separate step to ensure all sandbox/proxy/policy changes are live on the cluster before running E2E tests.
Use `mise run e2e:podman`, `mise run e2e:vm`, or a Helm-backed Kubernetes E2E lane when the change targets those drivers.

**E2E retry loop** (up to 3 attempts):

1. Run `mise run cluster:deploy` (only on the first attempt, or if code was changed between attempts).
2. Run `mise run test:e2e:sandbox`.
3. If tests fail:
1. Run the selected E2E lane.
2. If tests fail:
- Read the pytest output carefully — identify which tests failed and why.
- Distinguish between **test bugs** (the test itself is wrong) and **implementation bugs** (the code under test is wrong).
- Fix the failing code or tests.
- If code changes were made (not just test fixes), re-run `mise run cluster:deploy` before retrying.
- Decrement the retry counter and try again.
4. If tests pass, Phase 2 is green.
3. If tests pass, Phase 2 is green.

**If all 3 E2E attempts fail**, stop and report to the user:
- Which E2E tests are failing
Expand Down Expand Up @@ -576,8 +571,8 @@ Local E2E tests passed. CI does not currently run E2E tests, so this comment ser
| Field | Value |
|-------|-------|
| **Commit** | `<commit-sha>` |
| **Command** | `mise run test:e2e:sandbox` |
| **Cluster deploy** | `mise run cluster:deploy` (completed before test run) |
| **Command** | `<selected e2e command>` |
| **Gateway mode** | `<docker / podman / vm / helm>` |
| **Result** | ✅ All passed |

### Test Summary
Expand Down Expand Up @@ -645,8 +640,9 @@ If the `state:in-progress` label is present, the skill was previously started bu
| `gh pr create --title "..." --body "..."` | Create a pull request |
| `gh api user --jq '.login'` | Get current GitHub username |
| `mise run pre-commit` | Run pre-commit checks (includes unit tests, lint, format) |
| `mise run cluster:deploy` | Deploy all changes to local k3s cluster |
| `mise run test:e2e:sandbox` | Run E2E sandbox tests (depends on cluster:deploy) |
| `mise run e2e:docker` | Run smoke E2E against a standalone Docker-backed gateway |
| `mise run e2e:podman` | Run smoke E2E against a Podman-backed gateway |
| `mise run e2e:vm` | Run smoke E2E against the VM compute driver |

## Example Usage

Expand Down
23 changes: 12 additions & 11 deletions .agents/skills/debug-inference/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ openshell sandbox create -- curl https://inference.local/v1/chat/completions --j

Interpretation:

- **`cluster inference is not configured`**: set the managed route with `openshell inference set`
- **`cluster inference is not configured`**: set the managed gateway route with `openshell inference set`
- **`connection not allowed by policy`** on `inference.local`: unsupported method or path
- **`no compatible route`**: provider type and client API shape do not match
- **Connection refused / upstream unavailable / verification failures**: base URL, bind address, topology, or credentials are wrong
Expand Down Expand Up @@ -232,7 +232,7 @@ In this case, OpenShell routing is usually working correctly. The failing hop is

This is not the same issue as the Colima CoreDNS fix.

OpenShell injects `host.docker.internal` and `host.openshell.internal` into sandbox pods with `hostAliases`. That path bypasses cluster DNS lookup. If the request still times out, the usual cause is host firewall or network policy, not CoreDNS.
OpenShell injects `host.docker.internal` and `host.openshell.internal` into sandbox workloads when the selected compute platform supports it. That path bypasses runtime DNS lookup. If the request still times out, the usual cause is host firewall or network policy, not DNS.

### Verify the Problem

Expand All @@ -248,40 +248,41 @@ OpenShell injects `host.docker.internal` and `host.openshell.internal` into sand
curl -sS http://172.17.0.1:11434/v1/models
```

3. Test the same endpoint from the OpenShell cluster container:
3. Test the same endpoint from a gateway or sandbox container on the Docker network:

```bash
docker exec openshell-cluster-<gateway> wget -qO- -T 5 http://host.docker.internal:11434/v1/models
docker ps --filter name=openshell --format '{{.Names}}'
docker exec <container-name> wget -qO- -T 5 http://host.docker.internal:11434/v1/models
```

If steps 1 and 2 succeed but step 3 times out, the host firewall or network configuration is blocking the container-to-host path.

### Fix

Allow the Docker bridge network used by the OpenShell cluster to reach the host-local inference port. The exact command depends on your firewall tooling (iptables, nftables, firewalld, UFW, etc.), but the rule should allow:
Allow the Docker bridge network used by the OpenShell gateway and sandbox containers to reach the host-local inference port. The exact command depends on your firewall tooling (iptables, nftables, firewalld, UFW, etc.), but the rule should allow:

- **Source**: the Docker bridge subnet used by the OpenShell cluster container (commonly `172.18.0.0/16`)
- **Destination**: the host gateway IP injected into sandbox pods for `host.docker.internal` (commonly `172.17.0.1`)
- **Source**: the Docker bridge subnet used by OpenShell containers (commonly `172.18.0.0/16`)
- **Destination**: the host gateway IP injected into sandbox workloads for `host.docker.internal` (commonly `172.17.0.1`)
- **Port**: the inference server port (e.g. `11434/tcp` for Ollama)

To find the actual values on your system:

```bash
# Docker bridge subnet for the OpenShell cluster network
# Docker bridge subnet for the OpenShell network
docker network inspect $(docker network ls --filter name=openshell -q) --format '{{range .IPAM.Config}}{{.Subnet}}{{end}}'

# Host gateway IP visible from inside the container
docker exec openshell-cluster-<gateway> cat /etc/hosts | grep host.docker.internal
docker exec <container-name> cat /etc/hosts | grep host.docker.internal
```

Adjust the source subnet, destination IP, or port to match your local Docker network layout.

### Verify the Fix

1. Re-run the cluster container check:
1. Re-run the container network check:

```bash
docker exec openshell-cluster-<gateway> wget -qO- -T 5 http://host.docker.internal:11434/v1/models
docker exec <container-name> wget -qO- -T 5 http://host.docker.internal:11434/v1/models
```

2. Re-test from a sandbox:
Expand Down
Loading
Loading