Home / Blog / Why We Burned a Sprint on Silence: Tuning OpenAI’s...

Why We Burned a Sprint on Silence: Tuning OpenAI’s Realtime VAD (And Saved a £200K Deal)

By CaelLee | | 7 min read

Why We Burned a Sprint on Silence: Tuning OpenAI’s Realtime VAD (And Saved a £200K Deal)

Last month, our voice agent interrupted a key client’s CFO mid-sentence. Twice. During a live demo.

You know that sickening moment when you're watching a demo implode and there's absolutely nothing you can do? I was sat there, coffee going cold, watching our CRO's face drain of colour. Yeah. That.

That technical embarrassment nearly cost us a £200K contract. Actually—let me be precise. It didn't kill the deal outright. But it pushed the close date back by at least two quarters. The CFO literally said, "Call me when it works." Then he hung up. On his own bloody demo.

I've been turning this over in my head for weeks. As an engineering lead, I usually write about team topology or scaling Node.js services. But today, I'm getting my hands dirty with a specific technical challenge that nearly derailed our latest product launch: configuring Voice Activity Detection in OpenAI's Realtime API. Specifically, the gpt-4o-realtime-preview-2024-10-01 snapshot.

Here's the thing. If you're building conversational AI, you know the standard models are good. But "good" isn't good enough for enterprise. Not even close. You need precision, and you need it at 3pm on a Thursday when someone's running a leaf blower outside the prospect's window.

TL;DR

The Default Trap

The default VAD in the Realtime API is aggressive. Like, really aggressive. It's optimised for latency—makes sense, nobody wants a sluggish bot—but it interprets a brief pause for thought as the end of a turn. In a quiet office, that's fine. Whatever.

But in a car? A trade show floor? A room with that one vent that rattles every time the HVAC kicks on? It's a disaster.

I sat with my audio engineers last sprint to analyse 50 hours of failed calls. We pulled every single one from the week of 4 November. The data was... well, it was humbling:

We weren't just losing words. We were losing context. And in business, lost context equals lost revenue. Not exactly a hot take, I know. But seeing it in the numbers hits different.

We needed to shift from detecting sound to detecting intent. That's the real challenge, isn't it? The space between the words.

The Configuration Playbook

OpenAI gives you parameters, not magic wands. I think that's the part a lot of teams miss. They flip a few switches, see the same results, and then complain on Twitter about how the API isn't production-ready.

The art is in tuning. Against your environment. Not the demo environment, not the quiet conference room where everyone's being polite.

Here are the three levers we pulled.

1. Silencing the "Phantom Interrupt"

The turn_detection.threshold parameter. This one is straightforward on paper—controls sensitivity to audio energy, scale of 0 to 1—but the defaults are way too jumpy.

We lowered it from 0.5 to 0.3 for our call-centre clients. Took three days of A/B testing to settle on that number. Not glamorous work. Just grinding through audio files and marking timestamps.

Sarah did this thing where she played a recording of a truck backing up over our test rig. At 0.5, the agent stopped mid-sentence and said, "I'm sorry, I didn't catch that." At 0.3, it ignored the beeping entirely. I still have that clip saved on my desktop. It's called beep-beep-victory.wav.

The metric: 40% reduction in false-positive interruptions. That's in environments with steady-state background noise at 45-60 dB. Open-plan offices, basically. The places most of our users actually work.

2. Honouring the Pause

This was our "CFO killer" fix. No question.

turndetection.prefixpadding_ms. Default is 300ms. For most people, that's fine. For executives who speak with... deliberate... thoughtful... pauses? It's a nightmare. You sound disrespectful. You sound rushed. You sound like you weren't listening.

We extended it to 800ms. This tells the model to hold onto the audio buffer longer before deciding the turn is over.

And yeah, it adds a tiny bit of perceived latency. My CEO asked about this. I told him: I'd rather explain 200ms of processing time than apologise for an interruptive bot. He nodded. That was the end of the meeting.

We tracked sentiment post-deployment—just a basic VADER analysis on the transcript—and "User Frustration Score" dropped 22%. That's not a vanity metric. That's people not yelling at our product.

Don't optimise for the demo. Optimise for the human. Took me 15 years in this industry to really internalise that.

3. Going Custom with `create_response`

Okay, this one gets a bit wild.

For our legal tech tool—used in these echo-heavy conference rooms with terrible acoustics—we bypassed server-side VAD entirely. We switched to client-side audio chunking, sending conversation.item.create events manually. Only when our custom model detected a semantic break.

The stack: a lightweight ONNX model running directly in the browser. Nothing fancy. Think we used the Silero VAD model, exported it to ONNX, and ran it on the audio track before it ever touched OpenAI's servers.


// Simplified example of our client-side VAD setup
import ort from 'onnxruntime-web';

const vadSession = await ort.InferenceSession.create('./silero_vad.onnx');
const audioChunk = extractAudioBuffer(mediaStream);

const result = await vadSession.run({
 input: new ort.Tensor('float32', audioChunk, [1, audioChunk.length])
});

if (result.output.data[0] > 0.7) {
 // Send to OpenAI only when speech probability is high
 ws.send(JSON.stringify({
 type: 'conversation.item.create',
 item: { /* audio data */ }
 }));
}

The fun part? We could factor in pitch contour. Not just "is there sound?" but "is this person trailing off like they're done, or holding the floor like they're thinking?" It's subtle. Most people don't notice they do it. But the model caught it.

This cost us two full sprints. 3 February through 28 February. Not cheap. But it reduced our server-side token consumption by 15%. We weren't sending dead air to the API anymore. Just actual speech.

ROI-wise, I think it paid for itself in two months. Probably less.

Is It Worth the Investment?

I ask myself this constantly. Should we just wait for the model to get better? OpenAI ships fast. Maybe the next release fixes all of this out of the box.

Maybe.

But here's where I land: we're a Series B startup. Controlling the user experience is our only moat. We can't out-scale Google. We can't out-spend Microsoft. But we can out-design them. We can care about the details they're too big to bother with.

The Realtime API is a raw material. It's lumber, not a finished house. The value we bring—the value any product team brings—is in refining that material. Sanding the edges. Making it feel right.

What I'd Tell Other Engineering Leaders

A few things I've learned, some of them the hard way:

We turned a £200K loss into a £500K upsell by learning to listen better. Both to our customers, and to the silence between their words.

Honestly, that's the whole job.

I'm curious—how are you handling the last mile of voice UX? Server-side VAD? Custom client models? Something I haven't seen yet? Drop it in the comments. I'm always looking to steal good ideas.

Last Tuesday I tested our latest VAD config while my neighbour was mowing his lawn. The bot didn't flinch. I nearly cried.

OpenAI #VoiceAI #EngineeringLeadership #ConversationalAI #RealtimeAPI

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free