feat: add Pulse STT support for smallest.ai pulse (streaming + pre-recorded) #4858
feat: add Pulse STT support for smallest.ai pulse (streaming + pre-recorded) #4858mahimairaja wants to merge 24 commits intolivekit:mainfrom
Conversation
|
Tested prerecorded: import asyncio
from pathlib import Path
import aiohttp
from dotenv import load_dotenv
from livekit.agents import utils
from livekit.plugins import smallestai
load_dotenv()
async def main():
wav = Path(__file__).resolve().parent / "sample.wav"
async with aiohttp.ClientSession() as session:
stt = smallestai.STT(language="en", http_session=session)
frames = [
f
async for f in utils.audio.audio_frames_from_file(
str(wav), sample_rate=16000, num_channels=1
)
]
event = await stt.recognize(frames)
print(event.alternatives[0].text if event.alternatives else "")
if __name__ == "__main__":
asyncio.run(main()) |
|
Testing streaming: from dotenv import load_dotenv
from livekit import agents
from livekit.agents import Agent, AgentServer, AgentSession, room_io
from livekit.plugins import silero
from livekit.plugins.openai.llm import LLM
from livekit.plugins.smallestai.stt import STT
from livekit.plugins.smallestai.tts import TTS
from livekit.plugins.turn_detector.english import EnglishModel
load_dotenv()
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(
instructions="""You are a helpful voice AI assistant.""",
)
server = AgentServer()
@server.rtc_session(agent_name="my-agent")
async def my_agent(ctx: agents.JobContext):
session = AgentSession(
stt=STT(),
llm=LLM(model="gpt-4.1-mini"),
tts=TTS(),
vad=silero.VAD.load(),
turn_detection=EnglishModel(),
)
await session.start(
room=ctx.room,
agent=Assistant(),
room_options=room_io.RoomOptions(),
)
await session.generate_reply(instructions="Greet the user and offer your assistance.")
if __name__ == "__main__":
agents.cli.run_app(server) |
|
After conversations with @ harshitajain165 from smallest.ai, I came to know that few more steps needed for streaming support from the smallest server. for now I am moving this PR to draft. |
| if is_given(encoding): | ||
| self._opts.encoding = encoding | ||
| if is_given(word_timestamps): | ||
| self._opts.word_timestamps = word_timestamps | ||
| if is_given(diarize): | ||
| self._opts.diarize = diarize | ||
|
|
||
| for stream in self._streams: | ||
| stream.update_options( | ||
| language=language, | ||
| sample_rate=sample_rate, | ||
| encoding=encoding, | ||
| word_timestamps=word_timestamps, | ||
| diarize=diarize, | ||
| ) |
There was a problem hiding this comment.
🟡 STT.update_options modifies own state before stream validation, corrupting state on failure
STT.update_options at line 249 sets self._opts.encoding = encoding before propagating to streams at line 256. When the stream's update_options calls _validate_stream_encoding(encoding) at stt.py:315, any non-"linear16" encoding (e.g., "mulaw", "opus") raises ValueError. At that point, the STT's _opts.encoding is already corrupted to the invalid value. This has two consequences: (1) subsequent calls to self.stream() (stt.py:224) will also fail because _validate_stream_encoding(config.encoding) uses the now-invalid STT encoding, and (2) if multiple streams exist, only the first stream's update_options is attempted — later streams are never updated. The STT is left in a broken state where streaming is permanently disabled until the user manually resets with update_options(encoding="linear16").
Prompt for agents
In livekit-plugins/livekit-plugins-smallestai/livekit/plugins/smallestai/stt.py, the STT.update_options method (lines 235-262) should validate the encoding before modifying self._opts. Add a call to _validate_stream_encoding(encoding) at the top of the method (or at least before line 249 where self._opts.encoding is set) when encoding is given and there are active streams. For example, after line 243 add:
if is_given(encoding) and self._streams:
_validate_stream_encoding(encoding)
This ensures the STT state is not modified if the validation will fail when propagated to streams.
Was this helpful? React with 👍 or 👎 to provide feedback.
What does this PR does?
Adds Speech-to-Text (STT) support to the
livekit-plugins-smallestaiplugin using Smallest AI's Pulse STT API. The existing plugin only supported TTS, this PR brings it to parity with plugins like Deepgram, ElevenLabs, and Soniox that offer both TTS and STT.Closes #4856
Summary of Changes
New:
STTclass (stt.py)/api/v1/pulse/get_text)wss://waves-api.smallest.ai/api/v1/pulse/get_text)language="multi")streaming=True,interim_results=TrueNew:
SpeechStreamclass (stt.py)AudioByteStream(~4096 byte chunks per Smallest AI docs)START_OF_SPEECH→INTERIM_TRANSCRIPT/FINAL_TRANSCRIPT→END_OF_SPEECH{"type": "end"}signalingUsage
Configuration via
SMALLEST_API_KEYenvironment variable (same key used for TTS).Testing
language="en"andlanguage="multi"(auto-detection)API Reference