feat: Goe rewrite#5
Open
jacobseunglee wants to merge 9 commits into
Open
Conversation
Implements the GoE v2 foundation: Pydantic v2 models for the full entity graph + procedure DSL, a step-by-step procedure executor with interpolation/ assertions/output capture, and a thin TestEnvironment adapter over v1 TestEnvironmentTool. 74 tests passing (3 browser/listen xfailed with documented root causes). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g, and SUID privesc Signed-off-by: Akshay Rohatgi <52616034+Akshay-Rohatgi@users.noreply.github.com>
…ations implemented and tests for attack procedures added Signed-off-by: Akshay Rohatgi <52616034+Akshay-Rohatgi@users.noreply.github.com>
…ng working: detached background processes, attacker container reset on retry, chromium browser installed via PPA Signed-off-by: Akshay Rohatgi <52616034+Akshay-Rohatgi@users.noreply.github.com>
…self-review to address commonly seen custom app development pitfalls Signed-off-by: Akshay Rohatgi <52616034+Akshay-Rohatgi@users.noreply.github.com>
…g tests and visualizers Signed-off-by: Akshay Rohatgi <52616034+Akshay-Rohatgi@users.noreply.github.com>
* feat: model change * delete game_of_everything/goe/jacobtest.yaml * fix: workflow * fix: switch off plan determined runtime
Adds the v2 evaluation suite (goe/eval), metrics instrumentation (goe/metrics), single-system orchestration/packaging (goe/flow, goe/packaging), workflow artifacts (goe/artifacts), and their tests and fixtures. Correctness fixes from code review: - build.py: a non-zero deploy exit no longer falls through to a possible PASS; it routes into the retry loop as a design_flaw. - runtimes/registry.py: create parent dirs for nested source-file paths (set -e no longer aborts); raise on unknown db_type. - eval/llm_judge.py: print_judge_result tolerates missing keys. - eval/golden.py: edge coverage requires a real connecting edge, not independent provides/requires matches. - eval/runner.py: capture real run start time (durations were ~0). - flow/orchestrator.py + checkpoint.py: persist and restore failure_category on the resume path. Cleanups: - metrics/collector.py: drop dead capture_artifacts ternary. - bedrock.py: cache the bedrock-runtime client per region/creds instead of rebuilding it on every call. - runtimes: consolidate per-runtime knowledge into the template YAMLs (target_image, deps_install_template, pre_start); deploy() is now table-driven and _RUNTIME_IMAGES is removed. Co-authored-by: Jacob Lee <66867022+jacobseunglee@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Combines planner improvements (atom catalog + grading step) with Phase 4
increment 1 (multi-entity chain test + multi-system packaging).
## Planner Improvements
Fixes planner producing wrong runtimes (apache_php for SSH) and invented
atoms by grounding all prompts in actual atom inventory with few-shot examples.
**New pipeline structure:**
- Step 1: plan_entities → includes runtime + atoms in stubs
- Step 1.5: grade_stubs → LLM validates/corrects stubs (5-check rubric)
- Step 2: specify_entities → adds edges with validated runtime/atoms
**Atom catalog** (goe/planner/_atom_catalog.py):
- Parses 13 web vuln atoms from atoms/web_vulnerabilities/*.md
- Extracts descriptions, compatible runtimes (from code examples), capabilities
- Provides rich markdown table for prompt injection
**Prompts rewritten** (all 4 steps + grading):
- design_systems: port-to-runtime mapping + 2 examples
- plan_entities: atom catalog injection, runtime selection rules, 2 examples
- grade_stubs (new): 5-check rubric (atom exists, runtime matches, web vs
system, single responsibility, chain logic)
- specify_entities: rich atom table, runtime affinity rules, edge consistency
- connect_edges: edge type selection guide, 2 examples
**Fixes:**
- resolve.py: match structural port to exposed ports (not always first)
- topology_environment.py: create containers on network directly (not none→connect)
## Phase 4: Multi-System Orchestration
**Chain test** (goe/flow/chain_test.py):
- L3 validation after all entities pass L2
- Gates overall run (FAILED → RunResult.success=False, CLI exits non-zero)
- TopologyEnvironment: one ubuntu:22.04 container per system + shared Kali attacker
- Chain attacker agent (Opus) synthesizes end-to-end procedure
- Retries up to 2× on failure
**Multi-system packaging** (goe/packaging/packager.py):
- Single-system: unchanged (deploy.sh + playbook.yaml)
- Multi-system: per-system deploy scripts + docker-compose.yml + chain_playbook.yaml
- Port collision detection scoped per system_id
**Cross-system addressing** (goe/executor/interpolation.py):
- ${system.<system_id>.host} / ${system.<system_id>.port}
- Existing ${target_host}, ${edge.*}, ${steps.*} unchanged
**Orchestrator** (goe/flow/orchestrator.py):
- Runs chain test when len(built) > 1
- Chain test result gates success
- Persists chain_test in checkpoint
**CLI** (goe/flow/__main__.py):
- goe flow test <output_dir> — replays packaged runs
- Auto-detects chain_playbook.yaml for multi-system replay
## Verification
End-to-end test: "SQLi → SSH" scenario that was failing before:
- ✓ SSH entity now has runtime=ubuntu (was apache_php)
- ✓ Both entities build successfully
- ✓ L3 chain test completes (synthesizes SQLi→SSH attack chain)
- ✓ Output includes chain_playbook.yaml with cross-system addressing
Co-Authored-By: Jacob Lee <66867022+jacobseunglee@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Akshay Rohatgi <52616034+Akshay-Rohatgi@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.