Introducing Smart Turn Multimodal
2026-Jan-14
Deciding when to talk is something humans do subconsciously every day. Historically, conversational AI has focused on phone calls, limiting input to voice. Now that we speak to devices with cameras and screens, we can take advantage of multiple modalities. Voice AI has benefited from open source frameworks like Pipecat and LiveKit. Inspired by their work, we are releasing an audio-visual fork of Pipecat's Smart Turn that incorporates visual cues into real-time end-of-turn detection. This release is experimental; expect rough edges and ongoing changes.
The Problem: Silence is Ambiguous
Pure audio-based turn detection relies on prosody, grammar, and silence to decide when a user has finished speaking. But even good audio models struggle with ambiguous pauses: a 500ms silence could mean the user is thinking or waiting for a response. Without visual context, systems must use conservative timeout thresholds (often >1000ms), which hurts conversational fluidity.
Visual cues work differently. A person can hold the floor through body language — an open mouth, averted gaze, or raised hand — even in complete silence. This is the gap Smart Turn Multimodal aims to fill.
Example reel of cases where multimodal endpointing proved beneficial
Late Fusion Architecture
Smart Turn Multimodal uses a late fusion approach: audio and video are processed in separate encoder streams and merged only at the final bottleneck layer. This preserves the pre-trained Whisper encoder while letting a video encoder modulate the final decision.
The two branches are:
- Audio branch: The Whisper Tiny encoder from Smart Turn v3.2, processing 8 seconds of audio
- Video branch: An R3D-18 (3D ResNet) encoder pretrained on Kinetics-400, processing the last 32 frames (~1 second)
The video branch learns to modulate the audio prediction. If Whisper says "Ambiguous (0.5)" but the video shows a closed mouth, the fused output pushes toward "Complete (0.9)". If your camera is off or unavailable, the model gracefully falls back to audio-only inference via zero-padding — no code changes required.
(8 seconds)"] -->|Log-Mel Spec| B("Whisper Encoder") C["Video Frames
(Last 32 frames)"] -->|"Resize 112x112"| D("Video Encoder
R3D-18") end subgraph FeatureExtraction ["Feature Extraction"] B -->|Context Pooling| E["Audio Embedding
384-dim"] D -->|Linear Projection| F["Video Embedding
256-dim"] end subgraph LateFusion ["Late Fusion"] E --> G{"Concat"} F --> G G -->|"Fused Vector
640-dim"| H["Fusion Layer
(Linear + GELU)"] end subgraph Output H -->|Project back to 384| I["Classifier"] I -->|Sigmoid| J(("Turn End
Probability")) end style B fill:#e1f5fe,stroke:#01579b,stroke-width:2px style D fill:#fff3e0,stroke:#ff6f00,stroke-width:2px style H fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style J fill:#fce4ec,stroke:#880e4f,stroke-width:2px
Smart Turn Multimodal architecture: Audio (Whisper) and Video (R3D-18) branches converge at the fusion layer.
Component Details
Audio Encoder (Whisper Tiny)
The audio branch uses the encoder portion of Whisper Tiny. It converts 8 seconds of audio into 80-channel log-mel spectrograms (80 Mel bands, ~400 time steps). The backbone is a Transformer Encoder pre-trained on 680k hours of audio. Instead of mean pooling, we use cross-attention pooling (from Smart Turn v3.2) to extract a single 384-dim context vector representing the utterance.
Video Encoder (R3D-18)
The video branch uses a 3D ResNet-18 (R3D-18) pre-trained on the Kinetics-400 dataset. Unlike 2D CNNs, 3D convolutions capture spatiotemporal features, letting the model distinguish a "static open mouth" from "mouth closing."
- Input: 32 frames at 112×112 resolution (~1 second of video)
- Output: 512-dim features projected down to 256-dim for fusion
(Batch, 3, 32, 112, 112)"] -->|"3D Conv"| L1["Layer 1
Spatiotemporal Features"] L1 -->|"ResNet Blocks"| L3["R3D-18 Backbone"] L3 -->|"AvgPool3D"| L4["Raw Features
(Batch, 512)"] L4 -->|"Linear Projection"| Output["Video Embedding
(Batch, 256)"] style Input fill:#fff3e0,stroke:#ff6f00,stroke-width:2px style Output fill:#fff3e0,stroke:#ff6f00,stroke-width:2px
R3D-18 video encoder: processes spatiotemporal features (motion over time).
The Fusion Layer
We use concatenation-based late fusion. The audio encoder produces a 384-dim embedding; the video encoder produces a 256-dim embedding. These are concatenated into a 640-dim vector and projected back to 384-dim through a linear layer, layer norm, and GELU activation. The result feeds into the classifier, which outputs a turn-end probability via sigmoid.
If video is missing (camera off), the video embedding is replaced with a zero tensor. This "modality dropout" makes the model robust to camera failures without code changes.
e_a: 384-dim"] B["Video Embedding
e_v: 256-dim"] M{"Missing Video?"} M -- Yes --> Z["Zero Tensor
0: 256-dim"] M -- No --> B A --> C{"Concatenate"} Z -.-> C B --> C C -->|"Combined: 640-dim"| D["Linear Layer
640 -> 384"] D --> E["Layer Norm"] E --> F["GELU Activation"] F -->|"h_fused: 384-dim"| G["To Classifier"] style C fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style G fill:#f3e5f5,stroke:#4a148c,stroke-width:2px style Z stroke-dasharray: 5 5
Fusion mechanism: concatenation lets video features modulate audio confidence.
Training
We use a two-stage training process to avoid losing the audio model's learned features:
- Stage 1 (Alignment): Freeze the audio encoder and classifier. Only the video encoder and fusion layer are trainable. The video branch learns to output embeddings that modulate the audio prediction in the right direction.
- Stage 2 (Joint fine-tuning): Unfreeze the full network and train with a low learning rate. This lets the audio branch adapt slightly to visual context without losing its core capabilities.
Training Data
A subset of Meta's Casual Conversations dataset is used for training. The data is cleaned to remove non-conversational entries (e.g., videos where the person is directed to move or express emotions).
Usage
The multimodal model requires both audio and video tensors at inference time. When video is unavailable, pass
None and the model automatically uses a zero tensor, falling back to audio-only behavior internally.
To run or train the model, see the Smart Turn Multimodal GitHub repository.
Current Limitations
This is an experimental release. Known limitations:
- Dataset variety: Currently trained on one dataset of mostly unscripted monologues. Generalization to diverse conversation styles is still being validated.
- VAD-triggered: The model is activated by VAD-detected silence. In reality, humans often predict turn endings before silence occurs — a direction for future work.
- Late fusion: Fusion happens at the bottleneck only. This keeps the audio backbone intact but limits early cross-modal interactions.
Thanks
We'd like to thank:
- Pipecat/Daily for leading the open-source voice AI ecosystem
- Meta for the Casual Conversations dataset