OpenAI's Realtime API

2024-Oct-5

Blog writing these days has to be almost real-time. Sitting on content even for a few days risks making it obsolete. In the previous post, I discussed some of the obstacles to the usability of voice interfaces, such as interruptions and latency. Not even two weeks have passed, and OpenAI has released the Realtime API, which directly addresses these issues. Here are my first thoughts.

The Beginning of the End of End-of-Speech

I imagine that many in conversational AI, like me, have been trying to wrap their minds around the recent release of OpenAI's Realtime API. Inflexible turn-taking has been a major thorn affecting the usability of spoken dialog systems, and the announcement of a production-quality real-time audio-to-audio API (aka full-duplex, aka continuous recognition and generation) is somewhat world-shaking.

After somewhat recomposing myself, here are some thoughts on how the Realtime API affects the design of conversational AI systems:

  1. External decision-making processes, such as contextual updates to the prompt before passing user input to the LLM, now come at a cost of significant latency compared with using the LLM's real-time response. This is likely to become an anti-pattern.
  2. A hybrid "talk before thinking" flow, where immediate responses are handled by an LLM while external control is regained occasionally (for example, when the LLM initiates function calls).

In other words, the Realtime API elevates the LLM to the level of orchestrating the time-sensitive flow of the conversation, while external routers may be called occasionally, preferably in an asynchronous, non-blocking fashion, to steer the conversation by injecting function call responses or prompt updates.

For those who want to continue orchestrating their own turn-taking boundaries, the Realtime API allows the client to use its own Voice Acitivity Detection (VAD). This respectful nod to the legacy of modular design will incur increased latency and will likely also become an anti-pattern.