Voxtral Realtime STT: segmented vs. streaming

Mistral released Voxtral Realtime Mini in February 2026 — a 4B-parameter streaming STT model with a causal encoder. The benchmarks and early demos looked encouraging, but I was waiting for an MLX port before I could test it on-device.

Awni Hannun built exactly that with voxmlx. Meanwhile, Aleix had built the pipecat-mcp-server, which already uses Whisper MLX and Kokoro for on-device voice conversations (I've written about both in earlier TILs). Marrying Voxtral with the MCP server was the obvious next step.

Architecture

MLX Whisper (distilled whisper-large-v3-turbo) uses a bidirectional encoder. It needs the full utterance before it can transcribe. The encoder sees all audio frames at once, so it has maximum context. This means it is inherently batch/segmented: VAD (Voice Activity Detection) detects silence, the complete audio chunk gets encoded, then decoded. Voilà, the transcribed sentence. In the sample of conversations, it takes ~300 ms from end-of-speech to final transcription (In pipecat the timestamps from UserStoppedSpeaking to TranscriptionFrame).

Voxtral Realtime uses a causal encoder. The convolution and transformer layers only attend to past frames. Which means in streaming mode, you can feed audio incrementally via encode_step() and get encoder embeddings out without waiting for the utterance to end.

The key parameter is delay_ms (multiples of 80 ms, since each encoder token covers 80 ms of audio). This controls how far behind the decoder runs relative to the encoder. At 480 ms, the decoder lags 6 tokens behind, giving the encoder time to have processed more frames before decoding begins. At 160 ms, the lag is just 2 tokens. This is the fundamental latency/accuracy knob — more lag means the encoder has built up more context by the time the decoder needs it. Calling this delay is perhaps a misnomer, it is more like a context buffer. The user has not stopped speaking, and partial text output is not useful in the sense that we do not push the text to the LLM until the utterance is complete.

"Full context" in Whisper means bidirectional attention over all frames. "Full utterance" in Voxtral means all audio is present, but attention is still one-directional. The distinction matters because even when Voxtral segmented sees the whole utterance, early frames do not benefit from later frames the way they do in Whisper.

Segmented vs. Streaming with the same model

Even with Voxtral's causal encoder, you can run it in two modes:

Segmented buffers the full utterance, then runs the complete encode-then-decode pass. The model still only uses causal attention (no bidirectional context), but it processes all frames in one shot. We measured ~300 ms from end-of-speech to final transcription at 480 ms delay.

Streaming feeds audio to encode_step() as transport packets arrive. ptime can be 10 ms or 20 ms, so 4–8 packets make up the 80 ms audio token. The prefill happens once enough audio covers the prompt prefix, then incremental decoding emits tokens during speech. We measured ~160 ms from end-of-speech to final transcription because most encoding and decoding has already happened by the time the user stops talking.

The latency win comes from overlapping compute with speech. In segmented mode, all compute happens after silence is detected. In streaming mode, only the right-pad flush and final decode steps remain. This difference alone accounts for the ~140 ms latency win between streaming and segmented modes.

To summarise, it is not "streaming is better" but a three-way trade-off:

	Whisper (MLX)	Voxtral segmented	Voxtral streaming
Encoder	Bidirectional	Causal	Causal (incremental)
Transcription starts	After speech ends	After speech ends	During speech
End-of-turn to transcript	~300 ms	~300 ms	~160 ms
Accuracy	Highest (full context)	Good (causal, full utterance)	Delay-dependent (480 ms good, 160 ms noisy)
Compute pattern	Burst after silence	Burst after silence	Continuous during speech
Memory	Temp WAV file	Temp WAV file	KV caches for encoder + decoder (needs `mx.clear_cache()`)

Whisper MLX does zero work during speech, then a short compute burst when the user stops speaking. The full transcription typically completes in ~300 ms. Whisper feels fast despite being batch-only because it is a distilled model optimised for MLX. Voxtral streaming takes the opposite approach: it spreads compute across the entire speech duration, so there is less left to do when the user stops. Both land in the 160–300 ms range from end-of-turn to transcript, but for different reasons.

Next I want to try antirez's voxtral.c, a pure-C implementation that avoids the Python/MLX overhead entirely. If the latency numbers hold up, swapping the backend in the MCP server could shave off more time and make it viable on lower-end hardware too.

Updated (2026-02-15): I opened a PR adding both segmented and streaming Voxtral STT. More testing is needed. The whole PR was built while pair-programming via voice with Claude Code. Initially with Whisper as STT, then segmented Voxtral, and finally streaming Voxtral once the latency trade-off became apparent. About 10–12 hours over 3 days. Still early days, but the results are promising.