How to build a robotic dialogue system with iPhone, OpenAI realtime API and DockKit

We strive to create digital
products that harmoniously coexist

Introduction

The field of LLM-based dialogue systems has evolved rapidly over the past year. Whereas early efforts focused on voice-dialogue fundamentals with generative AI, today we see a shift toward practical, production-ready applications particularly in enterprise contexts.

In this article we explore building a “dialogue robot” with concurrency and physical co-presence using:

iPhone 12 as the main device
Swift as the programming language
OpenAI Realtime API as the conversational engine
Apple DockKit for face tracking

The goal is to show that a functional interactive robotic system can be built with minimal investment.

System architecture and Realtime API capabilities

The Realtime API supports:

Multimodal conversation (text and audio)
Low latency interaction
Automatic interruption handling
Function-call support
JSON-formatted WebSocket communication

Audio processing: resampling and sending

Captured microphone audio must be adapted before transmission. API requirements:

Sample rate: 24,000 Hz
Approx. chunk size: 2400 samples (~0.1 seconds)
PCM 16-bit format
Base64 encoding

Therefore you must:

Resample audio to 24 kHz
Adjust chunk size to the expected frames
Convert format when necessary
Encode chunks in Base64

This step is critical to ensure stability and to avoid erratic behaviors such as incorrect responses or unexpected silences.

Audio playback, animation and echo handling

To play back audio received from the API you must:

Decode Base64
Convert audio format to the device playback format
Manage a playback queue
Monitor speaker state

Echo problem A common issue in robotic systems is the microphone capturing speaker output. In this prototype the problem was addressed by:

Temporarily disabling the microphone during playback
Sending zeroed audio samples while the microphone is disabled

While not an advanced echo cancellation solution, this approach prevents the robot from starting conversations with itself.

Face tracking with DockKit

DockKit is a motorized mount compatible with iPhone face tracking. Interestingly, direct use of the DockKit library was unnecessary. The approach that worked was:

Activate the front camera
Show a transparent preview layer
Let the native camera system handle tracking

This implies tracking may be managed at the OS level. The resulting robot:

Maintains eye contact
Reacts with facial animation
Responds in real time

Recommendations

Carefully tune turn_detection to avoid missed silence or false positives.
Implement precise 24 kHz resampling to guarantee stability.
Consider ML-based echo cancellation for commercial systems.
Use real-time RMS calculations if advanced lip-sync is desired.
Evaluate splitting STT, LLM and TTS to gain finer production control.

Conclusions

Development of robotic dialogue systems has reached a point where practical implementations are viable, especially for enterprise use. Using an iPhone, Swift and the OpenAI Realtime API, you can build a functional interactive robot at a fraction of the cost of traditional robotic platforms. Technical challenges remain echo cancellation, robust speech recognition and stability in noisy environments but the cost-benefit ratio of this approach is highly attractive. Combining consumer devices with advanced generative AI may accelerate wider adoption of service robots in coming years.

Glossary

Realtime API: Interface enabling low-latency multimodal interaction via WebSocket.
Resampling: Process of adapting an audio signal’s sample rate.
RMS (Root Mean Square): Method to compute the average intensity of an audio signal.
Turn Detection: System that detects when a speaker has finished speaking.
Echo Cancellation: Technique to prevent the microphone from capturing the speaker’s own sound.

Table of Contents