[Feature]: Implement robust evaluation gates and state recovery mechanism

## Summary

Implement comprehensive evaluation gates and a clear state management system to ensure agent reliability, enable recovery from failures, and maintain execution integrity across sessions.

## Problem statement

Current agent implementation lacks sufficient safeguards and recovery mechanisms:

1. **Insufficient evaluation gates**: Code changes are not adequately validated before execution, leading to potential runtime errors, security vulnerabilities, and broken builds
2. **No clear state tracking**: The agent operates without explicit state management, making it impossible to recover from failures or resume interrupted tasks
3. **Missing sandboxing**: Executed code runs without proper isolation, risking system integrity
4. **Limited recovery options**: When failures occur, the agent cannot rollback or resume from a known good state
5. **Quality regression risk**: Without automated quality gates (lint, test, static analysis), code quality degrades over time

This compromises the reliability and safety of the autonomous agent system, especially for production use cases.

## Proposed solution

### 1. Multi-layered Evaluation Gates

Implement a mandatory pipeline before any code execution:

**Pre-execution gates:**
- **Static Analysis**: Run clippy, cargo-audit, cargo-deny for Rust code
- **Linting**: Enforce strict formatting with rustfmt, custom lint rules
- **Type Checking**: Ensure compilation passes before execution attempt
- **Security Scan**: Scan for secrets, unsafe blocks, and known vulnerabilities

**Test gates:**
- **Unit Tests**: Mandatory passing test suite for modified modules
- **Integration Tests**: Verify component interactions
- **Sandbox Tests**: Execute in isolated environment first

**Post-execution validation:**
- **Smoke Tests**: Quick verification that build produces working binary
- **Contract Tests**: Validate API contracts and interfaces
- **Performance Regression**: Check for significant performance degradation

### 2. State Management & Recovery System

Implement explicit state tracking with recovery capabilities:

**State capture:**
```rust
pub struct AgentState {
    pub task_id: Uuid,
    pub phase: ExecutionPhase,  // Planning, Executing, Validating, Completed
    pub checkpoint: Checkpoint,   // Filesystem snapshot + metadata
    pub dependencies: Vec<Dependency>,
    pub validation_results: Vec<ValidationResult>,
    pub created_at: DateTime,
    pub updated_at: DateTime,
}
```

**Recovery mechanisms:**
- **Automatic snapshots**: Create filesystem snapshots before major operations
- **Checkpoint resume**: Ability to resume from last successful checkpoint
- **Rollback capability**: Revert to previous state on validation failure
- **Session persistence**: Store state in SQLite for cross-session recovery

### 3. Sandboxing & Isolation

Enhanced sandbox controls:

- **Filesystem isolation**: Restrict file access to designated workspace
- **Network sandboxing**: Control network access per operation type
- **Resource limits**: CPU, memory, and time constraints
- **Process isolation**: Run builds/tests in separate processes with cleanup guarantees

### 4. Implementation phases

**Phase 1: Core gates (P0)**
- [ ] Integrate cargo-clippy with custom configuration
- [ ] Add cargo-audit for security scanning
- [ ] Implement pre-execution validation pipeline
- [ ] Add basic smoke test gate

**Phase 2: State management (P0)**
- [ ] Design and implement AgentState data model
- [ ] Add SQLite persistence for state tracking
- [ ] Implement checkpoint creation mechanism
- [ ] Add rollback capability

**Phase 3: Enhanced testing (P1)**
- [ ] Integrate test runner with gate system
- [ ] Add integration test automation
- [ ] Implement test coverage thresholds

**Phase 4: Advanced sandboxing (P1)**
- [ ] Enhanced filesystem isolation
- [ ] Network access controls
- [ ] Resource monitoring and limits

## Non-goals / out of scope

- Real-time collaboration features
- Multi-agent coordination state sharing
- Complex visual state debugging UI
- Integration with external CI/CD systems (initially)
- Automatic code quality scoring metrics
- Predictive failure analysis using ML

## Alternatives considered

1. **Post-merge validation**: Run gates after code is committed
   - Rejected: Too late to catch issues, pollutes git history

2. **Manual approval gates**: Require human review for all changes
   - Rejected: Defeats purpose of autonomous agent, too slow

3. **External CI integration**: Rely entirely on GitHub Actions
   - Rejected: Increases latency, requires network, not suitable for local dev

4. **Simple validation only**: Run tests but no state management
   - Rejected: Does not address recovery and resume requirements

## Acceptance criteria

- [ ] All code changes pass static analysis, linting, and type checking before execution
- [ ] Automated test suite runs with configurable coverage thresholds
- [ ] AgentState tracks execution phase and captures checkpoints at key transitions
- [ ] Failed executions can be rolled back to last known good state
- [ ] Session state persists to SQLite and survives process restarts
- [ ] Sandbox restricts filesystem and network access appropriately
- [ ] Recovery mechanism documented in AGENTS.md
- [ ] Benchmark shows <10% overhead from gate execution
- [ ] Failed gates produce clear, actionable error messages

## Architecture impact

- **src/agent/**: Add state machine and gate orchestration
- **src/eval/**: New module for evaluation gates (lint, test, analysis)
- **src/state/**: New module for state management and persistence
- **src/sandbox/**: Enhance existing sandbox with stricter controls
- **src/store/**: Extend SQLite schema for state checkpoint storage
- **AGENTS.md**: Document new evaluation and recovery workflows

## Risk and rollback

**Risk**: Increased latency due to gate execution.
**Mitigation**: Implement parallel gate execution, caching, and incremental validation.

**Risk**: Overly restrictive gates block legitimate changes.
**Mitigation**: Configurable gate severity, bypass with explicit user approval, escape hatches.

**Risk**: State storage growth (disk space).
**Mitigation**: Automatic cleanup of old checkpoints, compression, retention policies.

**Risk**: Sandbox escape vulnerabilities.
**Mitigation**: Defense in depth, regular security audits, use established sandbox technologies.

**Rollback**: 
- Disable gates via feature flag: `eval_gates_enabled: false`
- Revert to in-memory state only: `state_persistence: memory`
- Emergency bypass: User confirmation to skip gates for critical fixes

## Breaking change?

Yes

This introduces mandatory evaluation gates that may reject code previously accepted. Migration path:
- Phase 1: Gates run but don't block (warning mode)
- Phase 2: Gates block by default with override option
- Phase 3: Gates mandatory (current proposal)

## Data hygiene checks

- [x] I removed personal/sensitive data from examples, payloads, and logs.
- [x] I used neutral, project-focused wording and placeholders.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Implement robust evaluation gates and state recovery mechanism #22

Summary

Problem statement

Proposed solution

1. Multi-layered Evaluation Gates

2. State Management & Recovery System

3. Sandboxing & Isolation

4. Implementation phases

Non-goals / out of scope

Alternatives considered

Acceptance criteria

Architecture impact

Risk and rollback

Breaking change?

Data hygiene checks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: Implement robust evaluation gates and state recovery mechanism #22

Description

Summary

Problem statement

Proposed solution

1. Multi-layered Evaluation Gates

2. State Management & Recovery System

3. Sandboxing & Isolation

4. Implementation phases

Non-goals / out of scope

Alternatives considered

Acceptance criteria

Architecture impact

Risk and rollback

Breaking change?

Data hygiene checks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions