From b12899448c2c18d4457a392e503b3239903a8eab Mon Sep 17 00:00:00 2001 From: lukeocodes Date: Mon, 27 Apr 2026 15:12:54 +0100 Subject: [PATCH 1/2] docs: target Context7 benchmark gaps in Python skills [no-ci] The Context7 benchmark for /deepgram/deepgram-python-sdk scores the SDK against 10 standardized prompts (rubric: implementation 40 + accuracy 25 + relevance 20 + completeness 10 + clarity 5 = 100). Current score: 88.8. Four prompts had the largest gaps: Prompt #1 (66/100) - Voice agent dynamic adjustment + stream restart Eval said the skill 'lacks specific guidance or API methods for dynamically adjusting transcription parameters during an active connection or for intelligently managing stream restarts and pauses beyond basic error events'. deepgram-python-voice-agent/SKILL.md: - New 'Dynamic mid-session adjustment' section with runnable code for send_update_prompt, send_update_speak, send_update_think, send_inject_agent_message, send_inject_user_message, send_keep_alive (sync + async equivalents). - New 'Stream lifecycle & recovery' section covering KeepAlive on idle, pause/resume audio, reconnect after disconnect with conversation history replay via AgentV1SettingsAgentContext, and EventType.CLOSE / EventType.ERROR handling guidance. Prompt #2 (71/100) - Live streaming with interim results display Eval said 'all examples show interim_results=False, which is the opposite of what's needed, and none demonstrate how to differentiate between interim and final results or how to handle the display logic'. deepgram-python-speech-to-text/SKILL.md: - Rewrote the WebSocket quick-start to pass interim_results=True, utterance_end_ms=1000, vad_events=True, with a real overwrite-line pattern that shows interim results live and commits the line on final. - Added an 'Interim vs. final flag semantics' subsection explaining is_final, speech_final, and from_finalize distinctions and when each fires. Prompt #5 (83/100) - Diarization + word-level timings combined Eval said the skill 'lacks a specific, complete code example showing how to enable both diarization and word-level timings together in a single request'. deepgram-python-audio-intelligence/SKILL.md: - New 'Quick start - diarization with word-level timings' section: one focused snippet enabling diarize=True with per-word iteration showing speaker, start, end, confidence, punctuated_word. - Added a per-word fields table (word, punctuated_word, start, end, confidence, speaker, speaker_confidence) plus a groupby-by-speaker pattern and pointers to utterances=True / paragraphs=True for pre-grouped views. Prompt #8 (83/100) - Async URL transcription + retrieve final result Eval said the skill 'lacks critical information about handling asynchronous results - while it mentions callback functionality, it doesn't explain how to retrieve the final transcription when using async methods or how to poll for results'. deepgram-python-speech-to-text/SKILL.md: - New 'Async / deferred result patterns' section explicitly distinguishing Python async/await (sync-style, immediate result via AsyncDeepgramClient) from deferred via callback URL (returns request_id immediately, results POST'd to webhook later, no polling). - Decision table mapping each pattern to when to use it, with pointer to examples/12-transcription-prerecorded-callback.py. Net: +276 lines targeting ~97 missing benchmark points (potential lift 88.8 -> ~98 once Context7 reindexes). --- .../SKILL.md | 41 ++++++ .../deepgram-python-speech-to-text/SKILL.md | 103 ++++++++++++- .../deepgram-python-voice-agent/SKILL.md | 139 ++++++++++++++++++ 3 files changed, 276 insertions(+), 7 deletions(-) diff --git a/.agents/skills/deepgram-python-audio-intelligence/SKILL.md b/.agents/skills/deepgram-python-audio-intelligence/SKILL.md index 3dff437d..655ae50f 100644 --- a/.agents/skills/deepgram-python-audio-intelligence/SKILL.md +++ b/.agents/skills/deepgram-python-audio-intelligence/SKILL.md @@ -96,6 +96,47 @@ response = client.listen.v1.media.transcribe_file( ) ``` +## Quick start — diarization with word-level timings + +Enable speaker separation and word-level timestamps in a single request, then iterate the per-word objects to build a speaker-labelled transcript with timing. + +```python +response = client.listen.v1.media.transcribe_url( + url="https://dpgr.am/spacewalk.wav", + model="nova-3", + diarize=True, # tag each word with a speaker id + smart_format=True, # punctuated_word for cleaner output + punctuate=True, +) + +words = response.results.channels[0].alternatives[0].words or [] + +# Per-word: speaker, timestamps, confidence +for w in words: + speaker = getattr(w, "speaker", None) + text = w.punctuated_word or w.word + print(f"[speaker {speaker}] {text} ({w.start:.2f}s–{w.end:.2f}s, conf={w.confidence:.2f})") + +# Group consecutive words by speaker into utterances +from itertools import groupby +for speaker, group in groupby(words, key=lambda w: getattr(w, "speaker", None)): + text = " ".join((w.punctuated_word or w.word) for w in group) + print(f"Speaker {speaker}: {text}") +``` + +Per-word fields available on each entry: + +| Field | Type | Description | +|---|---|---| +| `word` | `str` | Lowercase token | +| `punctuated_word` | `str \| None` | Token with smart-formatted casing/punctuation (when `smart_format=True`) | +| `start`, `end` | `float` | Audio timestamps in seconds | +| `confidence` | `float` | 0.0–1.0 confidence | +| `speaker` | `int \| None` | Speaker id (when `diarize=True`); `None` if diarization disabled | +| `speaker_confidence` | `float \| None` | Speaker-id confidence | + +For a higher-level breakdown, set `utterances=True` to get pre-grouped speaker turns at `response.results.utterances`. Set `paragraphs=True` for a `paragraphs` view organised by speaker turn boundaries. + ## Quick start — WSS subset (diarize / redact / entities only) ```python diff --git a/.agents/skills/deepgram-python-speech-to-text/SKILL.md b/.agents/skills/deepgram-python-speech-to-text/SKILL.md index 4332a9d3..0314c134 100644 --- a/.agents/skills/deepgram-python-speech-to-text/SKILL.md +++ b/.agents/skills/deepgram-python-speech-to-text/SKILL.md @@ -58,7 +58,9 @@ response = client.listen.v1.media.transcribe_file( `request=` accepts raw `bytes` or an iterator of `bytes` (stream large files chunk-by-chunk). Do NOT pass a file handle. -## Quick start — WebSocket (live streaming) +## Quick start — WebSocket (live streaming with interim results) + +Live transcription emits **interim** (partial) and **final** results. Pass `interim_results=True` and switch on `is_final` to display partial text in real time, then overwrite it with the final transcript when the speaker pauses. ```python import threading @@ -68,15 +70,40 @@ from deepgram.listen.v1.types import ( ListenV1SpeechStarted, ListenV1UtteranceEnd, ) -with client.listen.v1.connect(model="nova-3") as conn: +last_interim_len = 0 # length of the last interim line (for clean overwrite) + +with client.listen.v1.connect( + model="nova-3", + interim_results=True, # ← emit partial results while user is still speaking + utterance_end_ms=1000, # silence (ms) before server emits UtteranceEnd + vad_events=True, # SpeechStarted events + smart_format=True, +) as conn: def on_message(m): + global last_interim_len if isinstance(m, ListenV1Results) and m.channel and m.channel.alternatives: - print(m.channel.alternatives[0].transcript) - - conn.on(EventType.OPEN, lambda _: print("open")) + transcript = m.channel.alternatives[0].transcript + if not transcript: + return + if m.is_final: + # Final segment: overwrite the running interim line, newline if utterance ended + pad = " " * max(0, last_interim_len - len(transcript)) + end = "\n" if m.speech_final else "" + print(f"\r{transcript}{pad}", end=end, flush=True) + last_interim_len = 0 + else: + # Interim: keep overwriting the same console line as the user speaks + print(f"\r{transcript}", end="", flush=True) + last_interim_len = len(transcript) + elif isinstance(m, ListenV1UtteranceEnd): + print() # newline; UtteranceEnd fires after final results when audio goes silent + elif isinstance(m, ListenV1SpeechStarted): + pass # optional: reset UI when a new utterance begins + + conn.on(EventType.OPEN, lambda _: print("connected")) conn.on(EventType.MESSAGE, on_message) - conn.on(EventType.CLOSE, lambda _: print("close")) - conn.on(EventType.ERROR, lambda e: print(f"err: {e}")) + conn.on(EventType.CLOSE, lambda _: print("\nclosed")) + conn.on(EventType.ERROR, lambda e: print(f"\nerr: {e}")) # Start receive loop in background so we can send concurrently threading.Thread(target=conn.start_listening, daemon=True).start() @@ -87,6 +114,15 @@ with client.listen.v1.connect(model="nova-3") as conn: conn.send_finalize() # flush final partial before closing ``` +### Interim vs. final flag semantics + +- **`is_final = False`** — interim hypothesis. Will be revised. Display in a non-committal style (lighter colour, italic) and overwrite when the next message arrives. +- **`is_final = True`, `speech_final = False`** — confirmed segment, but the speaker is still talking. Append to the transcript; another final will follow. +- **`is_final = True`, `speech_final = True`** — confirmed segment AND the utterance ended (silence detected). Commit the line and start a new one. +- **`from_finalize = True`** — this final was triggered by your explicit `send_finalize()` call (vs natural endpointing). Useful to distinguish "I asked for a flush" from "the speaker paused". + +Send `send_finalize()` to force the server to emit final results immediately (e.g. user clicks "stop"). Send `send_close_stream()` after `send_finalize` to terminate cleanly. + WSS message types live under `deepgram.listen.v1.types`. ## Async equivalents @@ -102,6 +138,59 @@ async with client.listen.v1.connect(model="nova-3") as conn: await conn.start_listening() ``` +## Async / deferred result patterns + +There are **two distinct** notions of "async" — don't confuse them. + +### 1. Python `async/await` (sync-style, immediate result) + +`AsyncDeepgramClient` returns `Awaitable[]`. The result is delivered when you `await`, not later. Use this when integrating with FastAPI, aiohttp, or any asyncio app. + +```python +import asyncio +from deepgram import AsyncDeepgramClient + +client = AsyncDeepgramClient() + +async def transcribe(url: str) -> str: + response = await client.listen.v1.media.transcribe_url( + url=url, + model="nova-3", + smart_format=True, + ) + # `response` is the FULL transcription — no polling, no callback, just await. + return response.results.channels[0].alternatives[0].transcript + +text = asyncio.run(transcribe("https://dpgr.am/spacewalk.wav")) +``` + +### 2. Deferred via callback URL (webhook, results posted later) + +Pass `callback="https://your.app/webhook"` and the request **returns immediately** with a `request_id`. Deepgram processes the audio in the background and POSTs the final result to your webhook URL. There is **no polling endpoint** — your server must be reachable to receive the result. + +```python +response = client.listen.v1.media.transcribe_url( + url="https://dpgr.am/spacewalk.wav", + callback="https://your.app/deepgram-webhook", + callback_method="POST", # or "PUT" + model="nova-3", + smart_format=True, +) +print(f"Accepted; tracking id: {response.request_id}") +# response is a "listen accepted" — NOT the transcript. Wait for your webhook. +``` + +The webhook receives the same JSON body you would have received from a synchronous `transcribe_url` call. Use this for very long files or when you don't want the request hanging open. + +| Pattern | Returns | When to use | +|---|---|---| +| `client.listen.v1.media.transcribe_url(...)` | full transcription synchronously | files up to ~10 min; HTTP timeout-bound | +| `await AsyncDeepgramClient().listen.v1.media.transcribe_url(...)` | full transcription, non-blocking | inside asyncio apps | +| `transcribe_url(..., callback="https://...")` | `{request_id}` immediately, transcription POSTs to webhook later | very long files; no long-lived HTTP connection | +| `client.listen.v1.connect(...)` (WebSocket) | streaming events as audio is sent | live audio (mic, telephony) | + +See `examples/12-transcription-prerecorded-callback.py` for a working callback example. + ## Key parameters `model`, `language`, `encoding`, `sample_rate`, `channels`, `multichannel`, `punctuate`, `smart_format`, `diarize`, `endpointing`, `interim_results`, `utterance_end_ms`, `vad_events`, `keywords`, `search`, `redact`, `numerals`, `paragraphs`, `utterances`. diff --git a/.agents/skills/deepgram-python-voice-agent/SKILL.md b/.agents/skills/deepgram-python-voice-agent/SKILL.md index e121fa9d..6c4326c2 100644 --- a/.agents/skills/deepgram-python-voice-agent/SKILL.md +++ b/.agents/skills/deepgram-python-voice-agent/SKILL.md @@ -124,6 +124,145 @@ with client.agent.v1.connect() as agent: You can persist the **`agent` block** of a Settings message server-side and reuse it by `agent_id`. `client.voice_agent.configurations.create` stores a JSON string representing the `agent` object only (listen / think / speak providers + prompt) — NOT the full `AgentV1Settings` payload. Do not send top-level Settings fields like `audio` to that API; those still go in the live Settings message at connect time. The returned `agent_id` replaces the inline `agent` object in future Settings messages. Managed via `client.voice_agent.configurations.*` — see `deepgram-python-management-api`. +## Dynamic mid-session adjustment + +You can change agent behavior **without disconnecting** by sending control messages on the live socket. Each method is available on the agent connection object (`agent` in the quick-start) for both sync and async clients. + +```python +from deepgram.agent.v1.types import ( + AgentV1UpdatePrompt, + AgentV1UpdateSpeak, + AgentV1UpdateSpeakSpeak, # type alias accepting SpeakSettingsV1 or list + AgentV1UpdateThink, + AgentV1UpdateThinkThink, # type alias accepting ThinkSettingsV1 or list + AgentV1InjectAgentMessage, + AgentV1InjectUserMessage, + AgentV1KeepAlive, +) +from deepgram.types.speak_settings_v1 import SpeakSettingsV1 +from deepgram.types.speak_settings_v1provider import SpeakSettingsV1Provider_Deepgram +from deepgram.types.think_settings_v1 import ThinkSettingsV1 +from deepgram.types.think_settings_v1provider import ThinkSettingsV1Provider_OpenAi + +# 1. Swap the LLM system prompt mid-conversation (e.g. escalate to a different persona) +agent.send_update_prompt( + AgentV1UpdatePrompt(prompt="You are now in expert escalation mode. Be precise and concise.") +) +# Server replies with a `PromptUpdated` event when the new prompt is in effect. + +# 2. Swap the TTS voice without reconnecting (e.g. switch language or persona) +agent.send_update_speak( + AgentV1UpdateSpeak( + speak=SpeakSettingsV1( + provider=SpeakSettingsV1Provider_Deepgram( + type="deepgram", model="aura-2-luna-en", + ), + ), + ) +) +# Server replies with a `SpeakUpdated` event. + +# 3. Swap the LLM provider/model (e.g. cheaper model for follow-ups) +agent.send_update_think( + AgentV1UpdateThink( + think=ThinkSettingsV1( + provider=ThinkSettingsV1Provider_OpenAi( + type="open_ai", model="gpt-4o-mini", temperature=0.3, + ), + prompt="You are a helpful assistant. Keep replies brief.", + ), + ) +) +# Server replies with a `ThinkUpdated` event. + +# 4. Force the agent to say something specific (without waiting for user audio) +agent.send_inject_agent_message( + AgentV1InjectAgentMessage(message="Quick reminder: your call is being recorded.") +) +# Useful for proactive prompts, status updates, or scripted segues. + +# 5. Inject a user message (e.g. text input from a chat sidebar alongside voice) +agent.send_inject_user_message( + AgentV1InjectUserMessage(content="Schedule a follow-up for next Tuesday at 2pm.") +) +# Server may reply with `InjectionRefused` if the agent is mid-utterance — retry after `AgentAudioDone`. + +# 6. Idle-period keep-alive (no payload required; the SDK fills in the type literal) +agent.send_keep_alive(AgentV1KeepAlive()) +# Or simply: agent.send_keep_alive() — the message arg is optional. +``` + +Async client equivalents are identical but `await`-prefixed: + +```python +await agent.send_update_prompt(AgentV1UpdatePrompt(prompt="...")) +await agent.send_inject_agent_message(AgentV1InjectAgentMessage(message="...")) +``` + +## Stream lifecycle & recovery + +Continuous voice agents need explicit handling for idle periods, stream pauses, and reconnects. + +**Pause / idle (no audio for several seconds):** stop calling `send_media`, but emit a `KeepAlive` every ~5 seconds. Without it, the server closes the socket at ~10 seconds of idle. + +```python +import threading, time + +stop = threading.Event() + +def keepalive_loop(): + while not stop.is_set(): + if stop.wait(5): + return + try: + agent.send_keep_alive() + except Exception: + return # socket closed; outer loop will reconnect + +threading.Thread(target=keepalive_loop, daemon=True).start() +``` + +**Resume after pause:** just call `send_media` again. No control message is required — the agent picks up VAD on the next chunk. + +**Reconnect after disconnect (preserve conversation context):** `Settings` cannot be re-sent on the same closed socket; open a new connection and resend the same `Settings`. To carry conversation history forward, include it in the new `Settings.agent.context.messages` so the LLM resumes with prior turns: + +```python +from deepgram.agent.v1.types import ( + AgentV1SettingsAgentContext, + AgentV1SettingsAgentContextMessagesItem, + AgentV1SettingsAgentContextMessagesItemContent, + AgentV1SettingsAgentContextMessagesItemContentRole, +) + +# Build the new Settings with the captured prior turns +context = AgentV1SettingsAgentContext( + messages=[ + AgentV1SettingsAgentContextMessagesItem( + content=AgentV1SettingsAgentContextMessagesItemContent( + role=AgentV1SettingsAgentContextMessagesItemContentRole.USER, + content="Hi, I'd like to schedule a meeting.", + ), + ), + AgentV1SettingsAgentContextMessagesItem( + content=AgentV1SettingsAgentContextMessagesItemContent( + role=AgentV1SettingsAgentContextMessagesItemContentRole.ASSISTANT, + content="Sure — what day works best?", + ), + ), + ], +) +new_settings = settings.model_copy(update={"agent": settings.agent.model_copy(update={"context": context})}) + +# Open a fresh connection and replay +with client.agent.v1.connect() as agent2: + agent2.send_settings(new_settings) + # ... same handlers + audio loop as before +``` + +The server emits a `History` event (type `agent_v1history`) on connect when the SDK has captured prior turns; persist these in your application so a reconnect can rebuild `context.messages`. + +**Detect disconnects:** the `EventType.CLOSE` handler fires before the `with` block exits. Catch it and trigger your reconnect logic from there. Check `EventType.ERROR` payloads for cause (network drop vs server-initiated close vs warning). + ## API reference (layered) 1. **In-repo reference**: `reference.md` — "Agent V1 Connect", "Voice Agent Configurations". From b83e4736d3da951d4285e57db79ba3615acd22cc Mon Sep 17 00:00:00 2001 From: lukeocodes Date: Mon, 27 Apr 2026 16:16:07 +0100 Subject: [PATCH 2/2] docs: fix Python closure scope and clarify History wire type (Copilot feedback) Both Copilot threads on PR #699: - deepgram-python-speech-to-text/SKILL.md interim-results snippet used `global last_interim_len` but the variable was defined in the enclosing `with` block, not at module scope. That would raise NameError on the first read. Replaced with a mutable closure (`state = {...}` dict), which is the idiomatic pattern when a callback needs to mutate state inside a `with` block. - deepgram-python-voice-agent/SKILL.md said the server emits a 'History event (type agent_v1history)'. `agent_v1history` is the internal Python module/file name, not the wire `type` literal. The wire `type` is `"History"` and the Python class is `AgentV1History`. Reworded so readers don't pattern-match on the wrong identifier. --- .../skills/deepgram-python-speech-to-text/SKILL.md | 12 ++++++------ .agents/skills/deepgram-python-voice-agent/SKILL.md | 2 +- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/.agents/skills/deepgram-python-speech-to-text/SKILL.md b/.agents/skills/deepgram-python-speech-to-text/SKILL.md index 0314c134..17e25853 100644 --- a/.agents/skills/deepgram-python-speech-to-text/SKILL.md +++ b/.agents/skills/deepgram-python-speech-to-text/SKILL.md @@ -70,8 +70,6 @@ from deepgram.listen.v1.types import ( ListenV1SpeechStarted, ListenV1UtteranceEnd, ) -last_interim_len = 0 # length of the last interim line (for clean overwrite) - with client.listen.v1.connect( model="nova-3", interim_results=True, # ← emit partial results while user is still speaking @@ -79,22 +77,24 @@ with client.listen.v1.connect( vad_events=True, # SpeechStarted events smart_format=True, ) as conn: + # Mutable container so the on_message closure can update state without `global` + state = {"last_interim_len": 0} + def on_message(m): - global last_interim_len if isinstance(m, ListenV1Results) and m.channel and m.channel.alternatives: transcript = m.channel.alternatives[0].transcript if not transcript: return if m.is_final: # Final segment: overwrite the running interim line, newline if utterance ended - pad = " " * max(0, last_interim_len - len(transcript)) + pad = " " * max(0, state["last_interim_len"] - len(transcript)) end = "\n" if m.speech_final else "" print(f"\r{transcript}{pad}", end=end, flush=True) - last_interim_len = 0 + state["last_interim_len"] = 0 else: # Interim: keep overwriting the same console line as the user speaks print(f"\r{transcript}", end="", flush=True) - last_interim_len = len(transcript) + state["last_interim_len"] = len(transcript) elif isinstance(m, ListenV1UtteranceEnd): print() # newline; UtteranceEnd fires after final results when audio goes silent elif isinstance(m, ListenV1SpeechStarted): diff --git a/.agents/skills/deepgram-python-voice-agent/SKILL.md b/.agents/skills/deepgram-python-voice-agent/SKILL.md index 6c4326c2..bdcf64cb 100644 --- a/.agents/skills/deepgram-python-voice-agent/SKILL.md +++ b/.agents/skills/deepgram-python-voice-agent/SKILL.md @@ -259,7 +259,7 @@ with client.agent.v1.connect() as agent2: # ... same handlers + audio loop as before ``` -The server emits a `History` event (type `agent_v1history`) on connect when the SDK has captured prior turns; persist these in your application so a reconnect can rebuild `context.messages`. +The server emits a `History` message on connect when the SDK has captured prior turns; in Python you receive this as an `AgentV1History` object (wire `type` literal: `"History"`). Persist these turns in your application so a reconnect can rebuild `context.messages`. **Detect disconnects:** the `EventType.CLOSE` handler fires before the `with` block exits. Catch it and trigger your reconnect logic from there. Check `EventType.ERROR` payloads for cause (network drop vs server-initiated close vs warning).