A multimodal voice assistant with vision capabilities that can see and discuss what users show through their camera using LiveKit's voice agents.
VisionAgent - A voice-enabled AI assistant that combines speech interaction with computer vision, allowing users to show objects, documents, or scenes through their camera and have natural conversations about what the agent sees.
- Computer Vision Integration: Processes video frames from user's camera in real-time
- Multimodal Conversation: Combines visual context with voice interaction
- Automatic Frame Capture: Buffers the latest video frame when users speak
- Multi-Track Support: Handles video streams from remote participants
- Voice-Enabled: Built using LiveKit's voice capabilities with support for:
- Speech-to-Text (STT) using Deepgram
- Large Language Model (LLM) using X.AI's Grok-2-Vision model
- Text-to-Speech (TTS) using Rime
- Voice Activity Detection (VAD) using Silero
- Modern Web Interface: Next.js frontend with video sharing capabilities
- User connects to the LiveKit room through the web interface
- User enables their camera to share video with the agent
- The agent subscribes to the user's video track automatically
- When the user speaks, the agent captures the current video frame
- The captured frame is added to the conversation context along with the transcribed speech
- Grok-2-Vision processes both the visual and textual input
- The agent responds with voice, able to describe and discuss what it sees
- Users can show different objects or scenes and ask questions about them
- Python 3.10+
livekit-agents>=1.0- LiveKit account and credentials
- API keys for:
- X.AI (for Grok-2-Vision model access)
- Deepgram (for speech-to-text)
- Rime (for text-to-speech)
- Node.js and pnpm (for the frontend)
-
Clone the repository
-
Install dependencies:
pip install -r requirements.txt
-
Create a
.envfile in the parent directory with your API credentials:LIVEKIT_URL=your_livekit_url LIVEKIT_API_KEY=your_api_key LIVEKIT_API_SECRET=your_api_secret XAI_API_KEY=your_xai_key DEEPGRAM_API_KEY=your_deepgram_key RIME_API_KEY=your_rime_key
-
Start the agent:
python agent.py dev
-
In a separate terminal, navigate to the frontend directory and start the Next.js app:
cd agent-vision-frontend pnpm install pnpm dev
The application will be available at http://localhost:3000. Enable your camera when prompted to start showing things to the agent.
- VisionAgent: Core agent class that handles both voice and vision inputs
- Video Stream Management: Automatically subscribes to video tracks from participants
- Frame Buffering: Stores the latest video frame for processing when user speaks
- User's video track is detected when they join or publish video
- Agent creates a VideoStream to receive frames
- Latest frame is continuously buffered as video streams
- When user completes their turn (stops speaking), the current frame is captured
- Frame is added as ImageContent to the chat message
- Grok-2-Vision processes the multimodal input (text + image)
- Agent generates a response based on both visual and conversational context
- Video input support with camera selection
- Screen sharing capabilities
- Chat interface for text input (optional)
- Real-time transcription display
- Modern, responsive UI with dark mode support
The agent maintains conversation context that includes:
- User's spoken/typed messages
- Captured video frames at the moment of each user utterance
- Agent's responses
- Full conversation history with visual context
- Change Vision Model: Replace Grok-2-Vision with other multimodal LLMs like GPT-4o or Claude 3
- Modify Frame Capture Logic: Adjust when frames are captured (e.g., continuous vs. on-demand)
- Add Visual Analysis Tools: Integrate specialized vision APIs for OCR, object detection, etc.
- Enhance Agent Instructions: Update the prompt to specialize in specific visual tasks