Home / Blog / I Spent 3 Months Building a Voice Agent with OpenA...

I Spent 3 Months Building a Voice Agent with OpenAI's Realtime API — Here's Everything the Docs Don'

By CaelLee | | 9 min read

I Spent 3 Months Building a Voice Agent with OpenAI's Realtime API — Here's Everything the Docs Don'

I threw away two months of work last week. Two. Whole. Months. Just... gone.

The reason? I fundamentally misunderstood how voice interruption works in OpenAI's Realtime API. My agent was either an impenetrable wall that steamrolled over people mid-sentence, or it cut them off like a caffeinated intern who'd mainlined four Red Bulls. If you're building voice agents right now, learn from my suffering — I've already made every mistake so you don't have to.

I've been in the conversational AI trenches since the GPT-3 days. Remember when we thought that was mind-blowing? When OpenAI dropped the Realtime API last October, I genuinely thought "finally, no more cobbling together these Frankenstein STT→LLM→TTS pipelines that shatter if you breathe on them wrong."

Oh, sweet summer child. I was so naive.

The demo videos? They look like actual magic. Smooth, natural, like chatting with a real person. What they conveniently leave out: what happens when a user sneezes mid-sentence, changes their mind halfway through, or — god forbid — actually tries to interrupt the AI like a normal human conversation. You know, the thing humans do constantly. Every single conversation. All the time.

The architecture that broke my brain

Here's the thing about the Realtime API that the docs sort of... gesture at vaguely... but don't scream from the rooftops: interruption isn't one thing. It's actually three completely separate mechanisms that all need to dance together in perfect sync:

  1. Server-side VAD (Voice Activity Detection) — the API's built-in turn detection
  2. Client-side audio buffer management — what YOU do with the audio stream on your end
  3. Response cancellation — the response.cancel event

I assumed #1 would handle everything. Just set it and forget it, right?

That's like assuming your car's cruise control will parallel park for you. Technically related to driving, sure. Not the same thing at all. Not even close.

My biggest facepalm moment (and there were many)

Built this whole customer service agent for a client. Tested it internally with my team for two weeks — worked beautifully. We were high-fiving. Deployed to beta testers on a Thursday afternoon.

Friday morning: Slack is on fire. "It won't let me talk." "It keeps talking over me." "I said 'wait' three times and it just kept going." Someone sent a recording and I could literally hear their frustration building with each exchange. It was painful to listen to.

The issue? I had turndetection set to servervad with the default thresholds. The default silencedurationms is 500ms. That's half a second. Sounds reasonable on paper, right?

Wrong. So incredibly wrong.

In real conversations, people pause mid-sentence ALL THE TIME. They say "I want to... um... book a flight to..." and the API goes "GREAT, LET ME HELP YOU BOOK THAT FLIGHT" because it detected 500ms of silence during their "um." The AI is enthusiastically responding to half a thought while the user is still formulating the rest. It's like having a conversation with someone who finishes your sentences — but gets them wrong 70% of the time.

But here's the counterintuitive part that drove me absolutely insane — if you make the silence threshold too long (say 2000ms), users feel like they're talking to a brick wall. They finish their sentence and just sit there in awkward silence while the AI... waits. And waits. And the user goes "hello??" which then interrupts the AI that was finally about to respond. Now you've got two interruptions stacked on top of each other and everything's a mess.

There's this uncanny valley of silence timing. Too short and you're interrupting. Too long and you seem broken. The sweet spot is somewhere in between, and finding it took me 47 iterations. I wish I was exaggerating.

What actually worked (after 47 iterations, no joke)

Here's my current setup. It's not perfect but it doesn't make me want to throw my laptop out the window anymore:


turn_detection: {
 type: "server_vad",
 threshold: 0.5, // lower than default, more sensitive to speech
 prefix_padding_ms: 300, // this is CRUCIAL - captures speech BEFORE the trigger point
 silence_duration_ms: 800, // longer than default, trust me on this
 create_response: true
}

But the real magic? It's in the client-side handling. The stuff the docs barely mention in passing, like it's an afterthought.

I maintain a circular buffer of the last 2 seconds of audio. Always recording, always buffering. When the user interrupts (detected via inputaudiobuffer.speech_started event), here's what happens:

  1. Immediately send response.cancel
  2. Flush my buffer to capture what they said DURING the AI's response
  3. Feed that audio back as the new input

This catches those "wait, no, I meant..." moments that happen while the AI is still yapping away. You know, like how actual humans interrupt each other. Revolutionary concept, I know.

Actually, wait — I should clarify something. The buffer doesn't "capture" audio in the sense of recording from the mic during playback. It's more like... you're always streaming audio to the API, right? So the buffer is just holding onto the last 2 seconds of that stream. When an interruption happens, those 2 seconds contain the beginning of the user's interruption. Does that make sense? I probably explained that badly. It's one of those things that's obvious once you understand it but impossible to describe clearly.

The "barge-in" nightmare

Here's something I learned the hard way at 2am on a Tuesday: response.cancel doesn't actually stop audio playback immediately. There's a race condition. The API sends audio chunks, you're playing them through your WebSocket, user starts talking, you cancel, but there's still 200-300ms of audio in the pipeline.

So the user hears the AI keep talking for a split second after they've started speaking. It's jarring. It feels broken. It is broken.

My solution? I mute the audio output the MOMENT I detect speech, before even sending the cancel event. Yes, it's a hack. Yes, it works. No, I'm not proud of it. My codebase judges me silently every time I open that file.


// Dirty but effective
// I literally wrote a comment above this that says "sorry future me"
audioContext.gainNode.gain.setValueAtTime(0, audioContext.currentTime);
// Then send cancel
ws.send(JSON.stringify({ type: "response.cancel" }));

The muting happens in like 2-3ms. The cancel takes 100-300ms. That gap matters. That gap is the difference between "smooth conversation" and "why is this thing broken."

The thing that still keeps me up at night

Okay so here's a fun edge case that I still don't have a perfect solution for. Actually, "fun" is the wrong word. "Hair-pulling" is more accurate. "Soul-crushing" might be even closer.

What happens when: the user interrupts, the AI starts responding to the interruption, but the user was ACTUALLY interrupting to correct themselves, and now the AI is responding to the wrong thing entirely?

Like: User says "I need a flight to Denver—wait no, Chicago." The AI hears "Denver" and starts responding about Denver flights, but the user was correcting to Chicago. Now you've got the AI confidently talking about the wrong city while the user is getting increasingly frustrated. And the more the user tries to correct it, the more the AI doubles down on Denver because it keeps hearing fragments of its own response mixed with the user's corrections.

It's a cascading failure mode. It's beautiful in a horrifying way.

I don't have a perfect solution. I've tried a few things. What I'm doing now is using a "cool-down" period after interruptions where I buffer everything and only commit after 1.5 seconds of actual silence. Not just VAD silence, but confirmed end-of-utterance. It's... fine. It reduced my "wrong context" errors by about 60% but that remaining 40% still stings every time I see it in the logs.

I think the real solution involves some kind of semantic buffering where you're constantly re-evaluating whether the latest utterance supersedes the previous one. But that's a whole other rabbit hole I haven't gone down yet. Maybe next quarter. Maybe never. We'll see how my sanity holds up.

Real numbers from production (as of last week)

After 3 months and roughly 50,000 conversations:

6% doesn't sound great but honestly? It's manageable. The key is having a graceful fallback. "Sorry, I missed that — could you repeat?" goes a long way. Users are surprisingly forgiving when the AI admits it messed up. They're much less forgiving when it confidently plows ahead with the wrong thing.

What I wish the docs said

Instead of the happy-path demo that works in a quiet room with one person speaking perfect English at a measured pace, I wish the docs just came out and said:

"The Realtime API handles basic turn-taking. For anything resembling natural conversation, you need to implement interruption handling yourself. Here's a reference implementation, here are the edge cases, here's what will go wrong. Good luck. You're going to need it."

But no. We get the demo. The demo is a lie. The demo was recorded in ideal conditions that will never, ever exist in production.

TL;DR (because I wrote a novel)

Anyone else wrestling with this? I've seen some threads about using WebRTC instead of WebSockets for lower latency — curious if anyone's actually gotten that working in production. I tried it for like a weekend and the tooling was a nightmare, but maybe I was doing it wrong. Probably was. That whole weekend is a blur of STUN/TURN server configs and regret.

Also, if anyone from OpenAI is reading this (lol, as if), please add a response.interrupt event that actually works synchronously. Please. I'm begging you. I'll buy you coffee. I'll name my firstborn "Sam."

Edit: Thanks for the gold! Since people are asking — yes, I'll share my buffer management code in a gist. Give me a day to clean it up. It's currently held together with console.logs and shame. There are comments like "// idk why this works but don't touch it" and "// if you remove this everything breaks, I'm serious"

Edit 2: Several people asked about WebRTC. I did try it briefly in January — latency is definitely better, like noticeably better, but the tooling is a nightmare and debugging WebRTC issues made me want to switch careers. Stuck with WebSockets for now. If someone has a good WebRTC + Realtime API setup, please DM me. I'll buy you a beer.

Edit 3: Someone asked what TTS voice I'm using. Shimmer, obviously. Is there even another choice? If you're using anything else we can't be friends.

What's your experience with voice agent interruptions? Have you found a better approach? Drop a comment — I'm genuinely desperate for solutions to that last 6%.

voiceai #openai #realtimeapi #conversationalai #webdev #speechrecognition

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free