-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Summary
Implement comprehensive evaluation gates and a clear state management system to ensure agent reliability, enable recovery from failures, and maintain execution integrity across sessions.
Problem statement
Current agent implementation lacks sufficient safeguards and recovery mechanisms:
- Insufficient evaluation gates: Code changes are not adequately validated before execution, leading to potential runtime errors, security vulnerabilities, and broken builds
- No clear state tracking: The agent operates without explicit state management, making it impossible to recover from failures or resume interrupted tasks
- Missing sandboxing: Executed code runs without proper isolation, risking system integrity
- Limited recovery options: When failures occur, the agent cannot rollback or resume from a known good state
- Quality regression risk: Without automated quality gates (lint, test, static analysis), code quality degrades over time
This compromises the reliability and safety of the autonomous agent system, especially for production use cases.
Proposed solution
1. Multi-layered Evaluation Gates
Implement a mandatory pipeline before any code execution:
Pre-execution gates:
- Static Analysis: Run clippy, cargo-audit, cargo-deny for Rust code
- Linting: Enforce strict formatting with rustfmt, custom lint rules
- Type Checking: Ensure compilation passes before execution attempt
- Security Scan: Scan for secrets, unsafe blocks, and known vulnerabilities
Test gates:
- Unit Tests: Mandatory passing test suite for modified modules
- Integration Tests: Verify component interactions
- Sandbox Tests: Execute in isolated environment first
Post-execution validation:
- Smoke Tests: Quick verification that build produces working binary
- Contract Tests: Validate API contracts and interfaces
- Performance Regression: Check for significant performance degradation
2. State Management & Recovery System
Implement explicit state tracking with recovery capabilities:
State capture:
pub struct AgentState {
pub task_id: Uuid,
pub phase: ExecutionPhase, // Planning, Executing, Validating, Completed
pub checkpoint: Checkpoint, // Filesystem snapshot + metadata
pub dependencies: Vec<Dependency>,
pub validation_results: Vec<ValidationResult>,
pub created_at: DateTime,
pub updated_at: DateTime,
}Recovery mechanisms:
- Automatic snapshots: Create filesystem snapshots before major operations
- Checkpoint resume: Ability to resume from last successful checkpoint
- Rollback capability: Revert to previous state on validation failure
- Session persistence: Store state in SQLite for cross-session recovery
3. Sandboxing & Isolation
Enhanced sandbox controls:
- Filesystem isolation: Restrict file access to designated workspace
- Network sandboxing: Control network access per operation type
- Resource limits: CPU, memory, and time constraints
- Process isolation: Run builds/tests in separate processes with cleanup guarantees
4. Implementation phases
Phase 1: Core gates (P0)
- Integrate cargo-clippy with custom configuration
- Add cargo-audit for security scanning
- Implement pre-execution validation pipeline
- Add basic smoke test gate
Phase 2: State management (P0)
- Design and implement AgentState data model
- Add SQLite persistence for state tracking
- Implement checkpoint creation mechanism
- Add rollback capability
Phase 3: Enhanced testing (P1)
- Integrate test runner with gate system
- Add integration test automation
- Implement test coverage thresholds
Phase 4: Advanced sandboxing (P1)
- Enhanced filesystem isolation
- Network access controls
- Resource monitoring and limits
Non-goals / out of scope
- Real-time collaboration features
- Multi-agent coordination state sharing
- Complex visual state debugging UI
- Integration with external CI/CD systems (initially)
- Automatic code quality scoring metrics
- Predictive failure analysis using ML
Alternatives considered
-
Post-merge validation: Run gates after code is committed
- Rejected: Too late to catch issues, pollutes git history
-
Manual approval gates: Require human review for all changes
- Rejected: Defeats purpose of autonomous agent, too slow
-
External CI integration: Rely entirely on GitHub Actions
- Rejected: Increases latency, requires network, not suitable for local dev
-
Simple validation only: Run tests but no state management
- Rejected: Does not address recovery and resume requirements
Acceptance criteria
- All code changes pass static analysis, linting, and type checking before execution
- Automated test suite runs with configurable coverage thresholds
- AgentState tracks execution phase and captures checkpoints at key transitions
- Failed executions can be rolled back to last known good state
- Session state persists to SQLite and survives process restarts
- Sandbox restricts filesystem and network access appropriately
- Recovery mechanism documented in AGENTS.md
- Benchmark shows <10% overhead from gate execution
- Failed gates produce clear, actionable error messages
Architecture impact
- src/agent/: Add state machine and gate orchestration
- src/eval/: New module for evaluation gates (lint, test, analysis)
- src/state/: New module for state management and persistence
- src/sandbox/: Enhance existing sandbox with stricter controls
- src/store/: Extend SQLite schema for state checkpoint storage
- AGENTS.md: Document new evaluation and recovery workflows
Risk and rollback
Risk: Increased latency due to gate execution.
Mitigation: Implement parallel gate execution, caching, and incremental validation.
Risk: Overly restrictive gates block legitimate changes.
Mitigation: Configurable gate severity, bypass with explicit user approval, escape hatches.
Risk: State storage growth (disk space).
Mitigation: Automatic cleanup of old checkpoints, compression, retention policies.
Risk: Sandbox escape vulnerabilities.
Mitigation: Defense in depth, regular security audits, use established sandbox technologies.
Rollback:
- Disable gates via feature flag:
eval_gates_enabled: false - Revert to in-memory state only:
state_persistence: memory - Emergency bypass: User confirmation to skip gates for critical fixes
Breaking change?
Yes
This introduces mandatory evaluation gates that may reject code previously accepted. Migration path:
- Phase 1: Gates run but don't block (warning mode)
- Phase 2: Gates block by default with override option
- Phase 3: Gates mandatory (current proposal)
Data hygiene checks
- I removed personal/sensitive data from examples, payloads, and logs.
- I used neutral, project-focused wording and placeholders.