Remote Voice Conversations with Your Coding Agent

Picture this: Claude is mid-refactor, you step away to make coffee, and your phone buzzes. You ask "Are we done?" and hear it read back the task status. You say "run the tests" and a minute later it tells you three passed, one failed. You never touched your laptop.

The co-author of pipecat, Aleix Conchillo, built a Pipecat MCP Server over the weekend that makes this possible. It bridges any MCP-compatible coding agent — Claude Code, Cursor, Codex, etc. — to a pipecat voice pipeline over WebRTC. Your agent gets ears and a mouth and it shares the screen too, so you can see file diffs, confirm changes, and even see what is on your display. An agent sitting idly feels such a waste, and now they don't have to be.

The MCP server exposes listen, speak, stop, list_windows, screen_capture, and capture_screenshot. That last pair is worth dwelling on: the agent can see your screen. You can ask "show me the terminal?" and it'll start capturing the window, run it through the vision pipeline, and you will see it in your WebRTC session. Voice and vision together turn this into a fly-by-wire session as if you were at your desk.

The Pipecat SKILL adds guardrails on top. It asks for verbal confirmation before making changes to files — an extra layer of safety when running a coding agent with enhanced privileges (think Claude with --dangerously-skip-permissions). You hear "I'm about to modify server.ts, shall I proceed?" before anything changes.

How It Works

The MCP server spawns a child process running the pipecat pipeline. Everything runs locally: RNNoiseFilter for background noise suppression, SileroVAD for voice activity detection, SmartTurnAnalyzerV3 for turn-taking, MLX/Fast Whisper for speech-to-text, and MLX Kokoro TTS for speech synthesis. All components are open-source, open-weights, and run locally on your machine.

MCP Client (Claude Code, Cursor, etc.)
    │
    ▼
MCP Server (parent process) ◄──► Pipecat Agent (child process)
    │                                  │
    ▼                                  ▼
Handles tool calls              Voice + vision pipeline:
via HTTP at :9090/mcp           Audio → STT → TTS → Audio
                                Screen → Vision → Image files

Two calls do the heavy lifting. listen() blocks until you finish speaking — Silero VAD detects 0.2s of silence, then SmartTurn confirms the utterance is complete, and the transcription returns to the MCP client. speak(text) queues text for TTS and returns immediately. VAD keeps running during playback, so you can interrupt the agent mid-sentence. That detail matters: without it, you'd have to wait for the agent to finish talking before you could correct it. For those who work with pipecat, these are the basic interruption and mute strategies.

// Pipecat Pipeline
                    ┌─── Main branch ───────────────┐
Transport (In)      │ Whisper → User Agg. → Kokoro  │
│                   │                               │
│                   │                               │
├─► ScreenCap ──► ParallelPipeline                  ├─► Assist. Agg. → Transport (Out)
                    │                               │
                    └─── Vision branch ─────────────┘
                VisionProcessor (saves frames on demand)

It's early, but it has rapidly evolved. Aleix quickly added the option for local models in addition to the cloud-hosted models. You can also swap the SimpleWebRTC for DailyWebRTC, in case you encounter restrictive firewalls. Fast Whisper's accuracy may be hit or miss depending on your accent, but you can probably swap in Voxtral soon. Running everything locally means you can swap models as better ones appear.

Today, coding agents keep you tethered to your terminal. You sit, you type, you watch. In some cases, you can teleport to a cloud sandbox. Pipecat MCP Server breaks those constraints. The agent keeps working while you're away, and you stay in the loop.

The full source is at pipecat-mcp-server.