feat: Add QA evaluation structured outputs for Starlight (Brent Council) by roshan-vapi · Pull Request #5 · VapiAI/gitops

roshan-vapi · 2026-02-24T21:34:57Z

Summary

Adds 5 structured output YAML files for automated post-call QA evaluation of Brent Council Housing Benefits calls (Starlight project).

4 QA category structured outputs that evaluate call transcripts against Brent Council's manual QA criteria
1 wrap-up code structured output that classifies calls into 19 predefined categories

Linear Issue

PRO-846

Files Created

File	Category	Questions	Auto-Fail
`resources/structuredOutputs/starlight-qa-engagement.yml`	Engagement	7 (1.1-1.7)	1.3, 1.4, 1.5
`resources/structuredOutputs/starlight-qa-right-first-time.yml`	Right First Time	8 (2.1-2.8)	2.3, 2.4, 2.5
`resources/structuredOutputs/starlight-qa-signposting.yml`	Signposting	2 (3.1-3.2)	None
`resources/structuredOutputs/starlight-qa-explaining.yml`	Explaining	2 (4.1-4.2)	None
`resources/structuredOutputs/starlight-wrap-up-code.yml`	Call Classification	N/A	N/A

Schema Design

Each QA structured output produces per-question evaluations with:

result: yes / no / not_applicable
reasoning: explanation referencing the conversation
evidence: array of { message_text, timestamp } excerpts

Top-level fields:

auto_fail: true if ANY auto-fail question received no
overall_pass: true only if auto_fail is false
category_score: fraction string e.g. "5/7"

Auto-fail logic: If any auto-fail question in ANY of the 4 categories receives no, the ENTIRE call evaluation fails. Each structured output sets its own auto_fail flag; the consuming application must check across all 4.

Key Design Decisions

Model: gpt-4.1 at temperature: 0 for deterministic, accurate QA evaluation
Multilingual support: All outputs include explicit instructions to evaluate in transcript language
AI agent adaptation: Questions that don't apply to AI agents (ACW, system logging, hold time) have not_applicable guidance
Glossary: Full Brent Council Housing Benefits terminology embedded in each output's description
assistant_ids: []: Empty because Starlight assistant configs are not yet in the gitops repo; will be populated when they are added
Wrap-up code second-tier: Placeholder secondary_classification_notes field for pending tier definitions

Line Count Note

This PR is 778 lines, which exceeds the 500-line guideline. However, all additions are declarative YAML data files with repetitive per-question schema structure. The 5 files are logically atomic units that cannot be meaningfully split -- each represents a single structured output definition. No code was modified.

How to Test

Verify YAML validity: each file parses correctly with the yaml npm package
Verify schema.type is always a simple string (not an array) per AGENTS.md warning
After push to Vapi (npm run push:dev), verify structured outputs appear in the dashboard
Run a test call and verify the structured outputs produce expected evaluation results

Validation

All 5 files validated as correct YAML with required fields (name, type, target, description, model, schema, assistant_ids, workflow_ids)
schema.type confirmed as simple string "object" in all files (avoids .toLowerCase() crash)
All question properties validated to have result, reasoning, and evidence sub-properties
name fields follow snake_case convention per AGENTS.md

Add 5 structured output YAML files for automated post-call QA evaluation of Brent Council Housing Benefits calls: - starlight-qa-engagement.yml: 7 questions (3 auto-fail: 1.3, 1.4, 1.5) - starlight-qa-right-first-time.yml: 8 questions (3 auto-fail: 2.3, 2.4, 2.5) - starlight-qa-signposting.yml: 2 questions (no auto-fail) - starlight-qa-explaining.yml: 2 questions (no auto-fail) - starlight-wrap-up-code.yml: call classification into 19 wrap-up codes Each QA structured output evaluates per-question with result (yes/no/not_applicable), reasoning, and transcript evidence. Auto-fail logic: if ANY auto-fail question receives "no", the entire evaluation fails across all categories. All outputs include multilingual transcript support, AI agent adaptation notes, and the full Brent Council Housing Benefits glossary. Closes PRO-846 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

## ELI5 **Problem.** `npm run push -- <env>` immediately starts hitting the live dashboard. There was no way to ask "what would this push do?" before firing it. So a fat-fingered command — wrong org, missing file path, wide-scope push when you meant scoped — hit production immediately, and recovery meant `pull` + manual revert. The only existing dry-run concept gated *deletions*, not creates or updates. **What this fix does.** Adds a `--dry-run` flag to `push`. Instead of firing POST/PATCH/DELETE, the engine counts the intent and prints `[dry-run] would <METHOD> <endpoint> <body-preview>` per resource. The state file is never written (so synthetic IDs don't pollute it), and the end-of-run summary shows `Would create N, would update M, would delete K`. GETs still run because drift detection (Stack G) and operator preview both need to see current platform state. **Outcome you'll notice.** Run `npm run push -- <env> --dry-run` to preview any push. Especially useful for "did I scope this right?" and "is the pre-push lint reporting drift I should address first?" before the real push. Cheapest individual operator-safety win in the stack — no schema changes, no engine architecture moves. --- Operators today can't validate "is this push doing what I think it's doing" before it lands on prod. push.ts has a dry-run concept only for deletions; updates and creates fire immediately. Cheapest individual operator-safety win (improvements.md #5). - src/config.ts: parseFlags now accepts --dry-run alongside --force / --bootstrap. Exports DRY_RUN. - src/api.ts: vapiRequest gates POST/PATCH on DRY_RUN — counts the intent, prints `[dry-run] would <METHOD> <endpoint>` with a 120-char body preview, and returns a synthetic id so caller code threads through. vapiDelete gets the same treatment. GETs always run (drift preview needs them). - src/push.ts: banner ("🧪 DRY-RUN") at start, summary at end ("Would create N, would update M, would delete K"), saveState entirely skipped in dry-run so synthetic ids never leak into the state file. - AGENTS.md: document --dry-run in Available Commands. - tests/push-dry-run.test.ts: --dry-run is parse-accepted, banner prints, state file is NEVER created (verified end-to-end via spawn). - improvements.md: #5 → RESOLVED. Closes improvements.md #5. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

vapi-tasker Bot added the tasked-to-tasker label Feb 24, 2026

roshan-vapi added the merge-queue label Feb 24, 2026

dhruva-reddy mentioned this pull request May 1, 2026

feat: push --dry-run preview mode #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add QA evaluation structured outputs for Starlight (Brent Council)#5

feat: Add QA evaluation structured outputs for Starlight (Brent Council)#5
roshan-vapi wants to merge 1 commit intomainfrom
tasker/PRO-846-qa-structured-outputs

roshan-vapi commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

roshan-vapi commented Feb 24, 2026

Summary

Linear Issue

Files Created

Schema Design

Key Design Decisions

Line Count Note

How to Test

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant