How to build a robotic dialogue system with iPhone, OpenAI realtime API and DockKit
This article explores how to build a minimal robotic dialogue system using an iPhone 12, Swift, the OpenAI Realtime API and DockKit. We analyze system architecture, audio processing, WebSocket session management, voice playback, facial animation, and practical mitigations for echo. We also reflect on the practical potential of low-cost robots for enterprise and home environments.

Table of Contents
Introduction
The field of LLM-based dialogue systems has evolved rapidly over the past year. Whereas early efforts focused on voice-dialogue fundamentals with generative AI, today we see a shift toward practical, production-ready applications particularly in enterprise contexts.
In this article we explore building a “dialogue robot” with concurrency and physical co-presence using:
- iPhone 12 as the main device
- Swift as the programming language
- OpenAI Realtime API as the conversational engine
- Apple DockKit for face tracking
The goal is to show that a functional interactive robotic system can be built with minimal investment.

System architecture and Realtime API capabilities
The Realtime API supports:
- Multimodal conversation (text and audio)
- Low latency interaction
- Automatic interruption handling
- Function-call support
- JSON-formatted WebSocket communication
Audio processing: resampling and sending
Captured microphone audio must be adapted before transmission. API requirements:
- Sample rate: 24,000 Hz
- Approx. chunk size: 2400 samples (~0.1 seconds)
- PCM 16-bit format
- Base64 encoding
Therefore you must:
- Resample audio to 24 kHz
- Adjust chunk size to the expected frames
- Convert format when necessary
- Encode chunks in Base64
This step is critical to ensure stability and to avoid erratic behaviors such as incorrect responses or unexpected silences.
Audio playback, animation and echo handling
To play back audio received from the API you must:
- Decode Base64
- Convert audio format to the device playback format
- Manage a playback queue
- Monitor speaker state

Echo problem A common issue in robotic systems is the microphone capturing speaker output. In this prototype the problem was addressed by:
- Temporarily disabling the microphone during playback
- Sending zeroed audio samples while the microphone is disabled
While not an advanced echo cancellation solution, this approach prevents the robot from starting conversations with itself.
Face tracking with DockKit
DockKit is a motorized mount compatible with iPhone face tracking. Interestingly, direct use of the DockKit library was unnecessary. The approach that worked was:
- Activate the front camera
- Show a transparent preview layer
- Let the native camera system handle tracking
This implies tracking may be managed at the OS level. The resulting robot:
- Maintains eye contact
- Reacts with facial animation
- Responds in real time

Recommendations
- Carefully tune turn_detection to avoid missed silence or false positives.
- Implement precise 24 kHz resampling to guarantee stability.
- Consider ML-based echo cancellation for commercial systems.
- Use real-time RMS calculations if advanced lip-sync is desired.
- Evaluate splitting STT, LLM and TTS to gain finer production control.
Conclusions
Development of robotic dialogue systems has reached a point where practical implementations are viable, especially for enterprise use. Using an iPhone, Swift and the OpenAI Realtime API, you can build a functional interactive robot at a fraction of the cost of traditional robotic platforms. Technical challenges remain echo cancellation, robust speech recognition and stability in noisy environments but the cost-benefit ratio of this approach is highly attractive. Combining consumer devices with advanced generative AI may accelerate wider adoption of service robots in coming years.
Glossary
- Realtime API: Interface enabling low-latency multimodal interaction via WebSocket.
- Resampling: Process of adapting an audio signal’s sample rate.
- RMS (Root Mean Square): Method to compute the average intensity of an audio signal.
- Turn Detection: System that detects when a speaker has finished speaking.
- Echo Cancellation: Technique to prevent the microphone from capturing the speaker’s own sound.

