susuROBO blog: Smart Turn Multimodal Architecture

The Problem: Silence is Ambiguous

Pure audio-based turn detection relies on prosody, grammar, and silence to decide when a user has finished speaking. But even good audio models struggle with ambiguous pauses: a 500ms silence could mean the user is thinking or waiting for a response. Without visual context, systems must use conservative timeout thresholds (often >1000ms), which hurts conversational fluidity.

Visual cues work differently. A person can hold the floor through body language — an open mouth, averted gaze, or raised hand — even in complete silence. This is the gap Smart Turn Multimodal aims to fill.

Example reel of cases where multimodal endpointing proved beneficial

Late Fusion Architecture

Smart Turn Multimodal uses a late fusion approach: audio and video are processed in separate encoder streams and merged only at the final bottleneck layer. This preserves the pre-trained Whisper encoder while letting a video encoder modulate the final decision.

The two branches are:

Audio branch: The Whisper Tiny encoder from Smart Turn v3.2, processing 8 seconds of audio
Video branch: An R3D-18 (3D ResNet) encoder pretrained on Kinetics-400, processing the last 32 frames (~1 second)

The video branch learns to modulate the audio prediction. If Whisper says "Ambiguous (0.5)" but the video shows a closed mouth, the fused output pushes toward "Complete (0.9)". If your camera is off or unavailable, the model gracefully falls back to audio-only inference via zero-padding — no code changes required.

flowchart LR subgraph Inputs A["Audio Waveform
(8 seconds)"] -->|Log-Mel Spec| B("Whisper Encoder") C["Video Frames
(Last 32 frames)"] -->|"Resize 112x112"| D("Video Encoder
R3D-18") end subgraph FeatureExtraction ["Feature Extraction"] B -->|Context Pooling| E["Audio Embedding
384-dim"] D -->|Linear Projection| F["Video Embedding
256-dim"] end subgraph LateFusion ["Late Fusion"] E --> G{"Concat"} F --> G G -->|"Fused Vector
640-dim"| H["Fusion Layer
(Linear + GELU)"] end subgraph Output H -->|Project back to 384| I["Classifier"] I -->|Sigmoid| J(("Turn End
Probability")) end style B fill:#e1f5fe,stroke:#01579b,stroke-width:2px style D fill:#fff3e0,stroke:#ff6f00,stroke-width:2px style H fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style J fill:#fce4ec,stroke:#880e4f,stroke-width:2px

Smart Turn Multimodal architecture: Audio (Whisper) and Video (R3D-18) branches converge at the fusion layer.

Component Details

Audio Encoder (Whisper Tiny)

The audio branch uses the encoder portion of Whisper Tiny. It converts 8 seconds of audio into 80-channel log-mel spectrograms (80 Mel bands, ~400 time steps). The backbone is a Transformer Encoder pre-trained on 680k hours of audio. Instead of mean pooling, we use cross-attention pooling (from Smart Turn v3.2) to extract a single 384-dim context vector representing the utterance.

Video Encoder (R3D-18)

The video branch uses a 3D ResNet-18 (R3D-18) pre-trained on the Kinetics-400 dataset. Unlike 2D CNNs, 3D convolutions capture spatiotemporal features, letting the model distinguish a "static open mouth" from "mouth closing."

Input: 32 frames at 112×112 resolution (~1 second of video)
Output: 512-dim features projected down to 256-dim for fusion

flowchart TD Input["Input Tensor
(Batch, 3, 32, 112, 112)"] -->|"3D Conv"| L1["Layer 1
Spatiotemporal Features"] L1 -->|"ResNet Blocks"| L3["R3D-18 Backbone"] L3 -->|"AvgPool3D"| L4["Raw Features
(Batch, 512)"] L4 -->|"Linear Projection"| Output["Video Embedding
(Batch, 256)"] style Input fill:#fff3e0,stroke:#ff6f00,stroke-width:2px style Output fill:#fff3e0,stroke:#ff6f00,stroke-width:2px

R3D-18 video encoder: processes spatiotemporal features (motion over time).

The Fusion Layer

We use concatenation-based late fusion. The audio encoder produces a 384-dim embedding; the video encoder produces a 256-dim embedding. These are concatenated into a 640-dim vector and projected back to 384-dim through a linear layer, layer norm, and GELU activation. The result feeds into the classifier, which outputs a turn-end probability via sigmoid.

If video is missing (camera off), the video embedding is replaced with a zero tensor. This "modality dropout" makes the model robust to camera failures without code changes.

flowchart TD A["Audio Embedding
e_a: 384-dim"] B["Video Embedding
e_v: 256-dim"] M{"Missing Video?"} M -- Yes --> Z["Zero Tensor
0: 256-dim"] M -- No --> B A --> C{"Concatenate"} Z -.-> C B --> C C -->|"Combined: 640-dim"| D["Linear Layer
640 -> 384"] D --> E["Layer Norm"] E --> F["GELU Activation"] F -->|"h_fused: 384-dim"| G["To Classifier"] style C fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style G fill:#f3e5f5,stroke:#4a148c,stroke-width:2px style Z stroke-dasharray: 5 5

Fusion mechanism: concatenation lets video features modulate audio confidence.

Training

We use a two-stage training process to avoid losing the audio model's learned features:

Stage 1 (Alignment): Freeze the audio encoder and classifier. Only the video encoder and fusion layer are trainable. The video branch learns to output embeddings that modulate the audio prediction in the right direction.
Stage 2 (Joint fine-tuning): Unfreeze the full network and train with a low learning rate. This lets the audio branch adapt slightly to visual context without losing its core capabilities.

Training Data

A subset of Meta's Casual Conversations dataset is used for training. The data is cleaned to remove non-conversational entries (e.g., videos where the person is directed to move or express emotions).

Usage

The multimodal model requires both audio and video tensors at inference time. When video is unavailable, pass None and the model automatically uses a zero tensor, falling back to audio-only behavior internally.

To run or train the model, see the Smart Turn Multimodal GitHub repository.

Current Limitations

This is an experimental release. Known limitations:

Dataset variety: Currently trained on one dataset of mostly unscripted monologues. Generalization to diverse conversation styles is still being validated.
VAD-triggered: The model is activated by VAD-detected silence. In reality, humans often predict turn endings before silence occurs — a direction for future work.
Late fusion: Fusion happens at the bottleneck only. This keeps the audio backbone intact but limits early cross-modal interactions.

Thanks

We'd like to thank:

Pipecat/Daily for leading the open-source voice AI ecosystem
Meta for the Casual Conversations dataset

Introducing Smart Turn Multimodal