Most voice agent benchmarks evaluate either what the agent does or how it sounds β EVA evaluates both.
EVA is an open-source evaluation framework for conversational voice agents that scores complete, multi-turn spoken conversations across two fundamental dimensions:
- π― EVA-A (Accuracy) β Did the agent complete the task correctly and faithfully?
- β¨ EVA-X (Experience) β Was the interaction natural, concise, and appropriate for spoken dialogue?
Using a realistic bot-to-bot architecture, EVA runs fully automated evaluations without human listeners β end to end, from speech in to judgment out.
- Metrics for both EVA-A and EVA-X, fully documented and validated with judge prompts, code, etc.
- 50 airline scenarios spanning flight rebooking, cancellations, vouchers, and more
- Results for 20 cascade and audio-native systems (speech-to-speech models, large audio language models) β see Experiment Setup for model configurations.
Agents that score well on task completion tend to score worse on conversational experience β and vice versa. The accuracyβexperience tradeoff is real, consistent, and previously unmeasured.
If you're only interested in running the latest stable version of EVA, you can clone with --branch latest, and optionally speed things up with --depth 1 --no-tags --single-branch.
git clone https://github.com/ServiceNow/eva.git --branch latest --depth 1 --no-tags --single-branchOtherwise, for development, you can clone the default branch, main.
git clone https://github.com/ServiceNow/eva.gitWe recommend using uv for fast, reliable dependency management. If you don't have uv installed, see the uv installation guide.
This project requires Python 3.11β3.13 (set via requires-python in pyproject.toml). uv will automatically select a compatible version. If you're using pip, make sure you're running a supported Python version.
cd eva
# Install all dependencies (uv automatically creates a virtual environment)
uv sync --all-extras
# Copy environment template
cp .env.example .env
# Edit .env with your API keys (ELEVENLABS_API_KEY, OPENAI_API_KEY required)After installation, you can run EVA using either:
evaβ CLI entry point (e.g.,eva --help)python main.pyβ script at the repo root (e.g.,python main.py --help)
If using an IDE, point your Python interpreter to .venv/bin/python so commands run in the virtual environment automatically. Otherwise, prefix commands with uv run or activate the environment with source .venv/bin/activate.
Alternative: using pip
This project requires Python 3.11. If you need to manage multiple Python versions, consider using pyenv.
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install --upgrade pip
pip install -e ".[dev]"Required:
OPENAI_API_KEY(or another LLM provider): Powers the assistant LLM and text judge metricsEVA_MODEL_LIST: Model deployments that reference your API key (see.env.example). Also configurable via--model-listCLI flag. Only used for regular LLMs.ELEVENLABS_API_KEY+ agent IDs: For user simulation- STT/TTS API key and model: Passed via
EVA_MODEL__STT_PARAMS/EVA_MODEL__TTS_PARAMS(default provider is Cartesia)
For all metrics:
OPENAI_API_KEY: GPT-5.2 for text judge metrics (task completion, conciseness, turn taking, etc.)GOOGLE_APPLICATION_CREDENTIALS: Gemini via Vertex AI (audio judge metrics)AWS_ACCESS_KEY_ID+AWS_SECRET_ACCESS_KEY: Claude via Bedrock (faithfulness metric)
Key Environment Variables:
# Framework Configuration
EVA_DOMAIN=airline
EVA_MAX_CONCURRENT_CONVERSATIONS=5
EVA_DEBUG=false # Run only 1 record for testing when enabled
EVA_RECORD_IDS=1.2.1,1.2.2 # Run specific records only (remove to run all records)
# Pipeline Model Configuration (nested under EVA_MODEL__)
EVA_MODEL__LLM=gpt-5-mini # LLM model name (must match EVA_MODEL_LIST)
EVA_MODEL__STT=deepgram # deepgram | openai_whisper
EVA_MODEL__TTS=cartesia # cartesia | elevenlabs
EVA_MODEL__STT_PARAMS={"api_key":"", "alias": "deepgram-nova-3", "model": "nova-3"}
EVA_MODEL__TTS_PARAMS={"api_key":"", "alias": "cartesia-sonic-3", "model": "sonic-3"}
# Or speech-to-speech model (mutually exclusive with LLM)
# EVA_MODEL__S2S=gpt-realtime-mini # Audio-native model name (S2S, S2T+TTS)
# Logging
EVA_LOG_LEVEL=INFO # DEBUG | INFO | WARNING | ERRORSee .env.example for the complete list of configuration options.
The CLI arguments take precedence over environment variables, which in turn take precedence over the .env file.
eva --domain airline --model.llm gpt-5-mini --max-concurrent-conversations 10Here is an example of shell loop to sweep over domains, models, or any combination of parameters.
Each iteration is an independent eva run. The loop continues on failure and exits with the last non-zero exit code.
exit_code=0;
for domain in airline itsm medical_hr; do
for llm in gpt-5-mini gpt-5; do
eva --domain "$domain" --model.llm "$llm" || exit_code=$?;
done;
done;
exit $exit_codeπ‘ If you need a single command, like in Docker, you can wrap the shell script with sh -c '...'.
Re-run specific metrics on an existing run.
eva \
--run-id <existing_run_id> \
--metrics task_completion,faithfulness,concisenessEVA includes a Streamlit analysis app for visualizing and comparing results:
streamlit run apps/analysis.pyThe app reads from the output/ directory by default and provides three views: cross-run comparison, run overview, and per-record detail (transcripts, audio, metrics, conversation traces). See apps/README.md for full documentation.
# Build the image
docker compose build
# Run a benchmark
docker compose run --rm benchmarkInstall pre-commit hooks to lint and format code:
pre-commit installInstall the [dev] extra dependencies as shown in the Installation section.
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_postprocessor_transcript.py -v
# Run with coverage
pytest tests/ --cov=eva
# Run metrics tests
pytest tests/integration/test_metrics.py -vExisting benchmarks evaluate voice agent components in isolation β speech understanding, TTS quality, or conversational dynamics β but none assess the full pipeline end to end. In real deployed systems, errors compound across modules and failure modes interact in ways that component-level evaluation cannot capture. EVA addresses this by treating voice agent quality as an integrated whole, evaluating accuracy and experience jointly across complete multi-turn spoken conversations.
| Framework | Interaction Mode | Multi-turn | Tool Calling | Goal Completion | Experience Metrics | Pass@k Pass^k |
Supported Systems |
|---|---|---|---|---|---|---|---|
| EVA | Live bot-to-bot | β | β | β
Task Completion, Speech Fidelity, Faithfulness |
β
Conciseness, Turn-taking, Latency, Progression |
β | Audio-native, Cascade |
| VoiceAgentΒBench | Static, TTS-synthesized | β | β | β | β | Audio-native, Cascade | |
| CAVA | Partial simulation | β | β | Latency, Tone-awareness |
β | Audio-native, Cascade | |
| FDB-v2 | Live, automated examiner | β | β | β | β
Turn-taking fluency, Correction handling, Safety |
β | Audio-native |
| FDB-v1 | Static, pre-recorded | β | β | β | β
Turn-taking, Backchanneling, Interruption |
β | Audio-native |
| FD-Bench | Live, simulated | β | β | β | β
Interruption, Delay, Robustness |
β | Audio-native |
| Talking Turns | Static, curated | β | β | β | β
Turn change, Backchannel, Interruption |
β | Audio-native, Cascade |
EVA evaluates agents using a bot-to-bot audio architecture β no human listeners, no text replays. Two conversational AIs speak to each other over a live WebSocket connection, producing realistic speech-to-speech interactions that capture real STT behavior and turn-taking dynamics.
| Component | Role |
|---|---|
| π User Simulator (ElevenAgent) | Plays the role of a caller with a defined goal and persona |
| π€ Voice Agent (Pipecat) | The system under evaluation β supports cascade (STTβLLMβTTS) and speech-to-speech models |
| π§ Tool Executor | The engine that provides deterministic, reproducible tool responses via custom Python functions. It dynamically queries and modifies a predefined per-scenario database. |
| β Validators | Automated checks that verify conversations are complete and that the user simulator faithfully reproduced its intended goal β no human annotation required. Conversations that fail validation are automatically regenerated, ensuring only clean, correctly executed runs enter evaluation. |
| π Metrics Engine | Scores each conversation using the audio recording, transcripts, and tool call logs. |
output/<run_id>/
βββ config.json # Run configuration snapshot
βββ results.csv # Quick results table
βββ metrics_summary.json # Aggregate metrics (after metrics run)
βββ metrics_summary.csv # Per-category metrics breakdown
βββ records/<record_id>/
βββ result.json # Conversation result
βββ audio_assistant.wav # Assistant audio channel
βββ audio_user.wav # User audio channel
βββ audio_mixed.wav # Mixed stereo audio
βββ transcript.jsonl # Turn-by-turn transcript
βββ audit_log.json # Complete interaction log
βββ pipecat_logs.jsonl # Pipecat framework events
βββ elevenlabs_events.jsonl # ElevenLabs events
βββ metrics.json # Per-record metric scores and details
| π― EVA-A Β· Accuracy | β¨ EVA-X Β· Experience |
|---|---|
| Did the agent complete the task correctly? | Was the conversational experience high quality? |
| Task Completion Β· Deterministic | Turn Taking Β· LLM Judge BETA |
Agent Speech Fidelity Β· Audio LLM Judge BETA |
Conciseness Β· LLM Judge |
| Faithfulness Β· LLM Judge | Conversation Progression Β· LLM Judge |
See the Metrics documentation for detailed scoring rubrics and judge prompts. For the data structures that metrics operate on, see MetricContext documentation.
We created three datasets on different enterprise domains, each selected to target a distinct axis of difficulty for voice agents. All three require accurate transcription of structured named entities over voice (e.g., confirmation codes and employee identifiers), but differ in their primary challenge. Airline Customer Service Management (CSM) tests temporal reasoning and complex policy adherence in high-stakes flight rebooking scenarios. Healthcare Human Resources Service Delivery (HRSD) stresses entity density, requiring callers to communicate multiple registration and license numbers across clinical and administrative HR workflows. Enterprise Information Technology Service Management (ITSM) introduces branching conversational flows (e.g., incident resolution attempts must fail before ticket escalation is permitted) and tiered authentication reflecting the access sensitivity of different workflows.
Within each domain, scenarios span three dimensions: Single-Intent (one workflow per call), Multi-Intent (one to four concurrent workflows, testing compositional task completion without context loss), and Adversarial (hard policy constraints under social pressure, e.g., refusing compensation to an ineligible caller).
See the Data documentation for a detailed breakdown of the data structure and scenario design, and the Database & Tool Schema for the airline scenario database format.
eva/
βββ main.py # Main entry point
βββ pyproject.toml # Python project configuration
βββ apps/ # Streamlit apps
βββ Dockerfile # Docker configuration
βββ compose.yaml # Docker Compose configuration
βββ src/eva/
β βββ cli.py # CLI interface
β βββ run_benchmark.py # Benchmark runner
β βββ models/ # Pydantic data models
β βββ orchestrator/ # Framework execution
β β βββ runner.py # Main orchestrator
β β βββ worker.py # Per-conversation worker
β β βββ validation_runner.py # Validation runner
β β βββ port_pool.py # Port management
β βββ assistant/ # Pipecat-based assistant
β β βββ agentic/ # Agent orchestration
β β βββ tools/ # Python-based tool implementations
β β βββ pipeline/ # Audio/LLM processing pipeline
β β βββ services/ # STT/TTS/LLM factories
β βββ user_simulator/ # ElevenLabs user simulator
β βββ metrics/ # Evaluation metrics
β β βββ base.py # Base metric classes
β β βββ processor.py # Metrics context processor
β β βββ runner.py # Metrics execution
β β βββ registry.py # Metric registry
β β βββ aggregation.py # Metric aggregation
β β βββ accuracy/ # Task completion metrics
β β βββ experience/ # Responsiveness, progression, turn-taking
β β βββ diagnostic/ # Diagnostic metrics (not in final scores)
β β βββ validation/ # Quality control metrics
β βββ utils/ # Utilities (LLM client, log processing)
βββ scripts/ # Utility scripts
β βββ run_text_only.py # Text-only evaluation runner
β βββ docker_entrypoint.py # Docker entry point
β βββ check_version_bump.py # Version checking
β βββ check_version_bump.py # Version checking
βββ configs/ # Configuration files
β βββ prompts/ # Judge and simulation prompts
β β βββ judge.yaml # Judge metric prompts
β β βββ simulation.yaml # User simulator prompts
β βββ agents/ # Agent configurations
β βββ airline_agent.yaml
βββ docs/ # Documentation
β βββ metrics/ # Per-metric documentation
β βββ data.md # Data documentation
β βββ experiment_setup.md # Experiment setup guide
β βββ llm_configuration.md # LLM provider setup guide
β βββ metric_context.md # Metric context documentation
β βββ limitations.md # Known limitations
β βββ demo/ # Demo audio files
βββ data/ # Data files
β βββ airline_dataset.jsonl # Evaluation dataset
β βββ airline_scenarios/ # Per-record scenario databases
βββ tests/ # Test suite
β βββ unit/ # Unit tests
β βββ integration/ # Integration tests
β βββ artifacts/ # Test artifacts and fixtures
β βββ fixtures/ # Shared test fixtures
βββ website/ # Project website (React/TypeScript)
We welcome contributions! Please read our Contributing Guidelines before submitting a pull request. For larger features, we recommend reaching out first to ensure alignment with our roadmap.