Technology
02/13/2026

How to build a robotic dialogue system with iPhone, OpenAI realtime API and DockKit

This article explores how to build a minimal robotic dialogue system using an iPhone 12, Swift, the OpenAI Realtime API and DockKit. We analyze system architecture, audio processing, WebSocket session management, voice playback, facial animation, and practical mitigations for echo. We also reflect on the practical potential of low-cost robots for enterprise and home environments.

How to build a robotic dialogue system with iPhone, OpenAI realtime API and DockKit
Share
LinkedIn
X (Twitter)
Facebook

Table of Contents

Introduction

The field of LLM-based dialogue systems has evolved rapidly over the past year. Whereas early efforts focused on voice-dialogue fundamentals with generative AI, today we see a shift toward practical, production-ready applications particularly in enterprise contexts.

In this article we explore building a “dialogue robot” with concurrency and physical co-presence using:

  • iPhone 12 as the main device
  • Swift as the programming language
  • OpenAI Realtime API as the conversational engine
  • Apple DockKit for face tracking

The goal is to show that a functional interactive robotic system can be built with minimal investment.

DD2.png

System architecture and Realtime API capabilities

The Realtime API supports:

  • Multimodal conversation (text and audio)
  • Low latency interaction
  • Automatic interruption handling
  • Function-call support
  • JSON-formatted WebSocket communication

Audio processing: resampling and sending

Captured microphone audio must be adapted before transmission. API requirements:

  • Sample rate: 24,000 Hz
  • Approx. chunk size: 2400 samples (~0.1 seconds)
  • PCM 16-bit format
  • Base64 encoding

Therefore you must:

  • Resample audio to 24 kHz
  • Adjust chunk size to the expected frames
  • Convert format when necessary
  • Encode chunks in Base64

This step is critical to ensure stability and to avoid erratic behaviors such as incorrect responses or unexpected silences.

Audio playback, animation and echo handling

To play back audio received from the API you must:

  • Decode Base64
  • Convert audio format to the device playback format
  • Manage a playback queue
  • Monitor speaker state

DD3.png

Echo problem A common issue in robotic systems is the microphone capturing speaker output. In this prototype the problem was addressed by:

  • Temporarily disabling the microphone during playback
  • Sending zeroed audio samples while the microphone is disabled

While not an advanced echo cancellation solution, this approach prevents the robot from starting conversations with itself.

Face tracking with DockKit

DockKit is a motorized mount compatible with iPhone face tracking. Interestingly, direct use of the DockKit library was unnecessary. The approach that worked was:

  • Activate the front camera
  • Show a transparent preview layer
  • Let the native camera system handle tracking

This implies tracking may be managed at the OS level. The resulting robot:

  • Maintains eye contact
  • Reacts with facial animation
  • Responds in real time

DD5.png

Recommendations

  • Carefully tune turn_detection to avoid missed silence or false positives.
  • Implement precise 24 kHz resampling to guarantee stability.
  • Consider ML-based echo cancellation for commercial systems.
  • Use real-time RMS calculations if advanced lip-sync is desired.
  • Evaluate splitting STT, LLM and TTS to gain finer production control.

Conclusions

Development of robotic dialogue systems has reached a point where practical implementations are viable, especially for enterprise use. Using an iPhone, Swift and the OpenAI Realtime API, you can build a functional interactive robot at a fraction of the cost of traditional robotic platforms. Technical challenges remain echo cancellation, robust speech recognition and stability in noisy environments but the cost-benefit ratio of this approach is highly attractive. Combining consumer devices with advanced generative AI may accelerate wider adoption of service robots in coming years.

Glossary

  • Realtime API: Interface enabling low-latency multimodal interaction via WebSocket.
  • Resampling: Process of adapting an audio signal’s sample rate.
  • RMS (Root Mean Square): Method to compute the average intensity of an audio signal.
  • Turn Detection: System that detects when a speaker has finished speaking.
  • Echo Cancellation: Technique to prevent the microphone from capturing the speaker’s own sound.

Gain perspective with curated insights

How to build a robotic dialogue system with iPhone, OpenAI realtime API and DockKit | Meetlabs