I Replaced My $3,200/Month VA With an AI Agent — Here's What Actually Happened

Last Tuesday at 3 AM, I woke up in a cold sweat because my customer support agent had been hallucinating refund policies to 47 customers. Forty-seven.

I know the exact number because the OpenAI Agents SDK logs every single interaction with terrifying precision. Down to the millisecond. It's both beautiful and horrifying.

That's the thing about building production AI agents nobody talks about on Twitter — when they break, they break at scale. Not one customer getting dodgy info. Forty-seven. All at once. While you're sleeping.

I've spent the last four months rebuilding ReplyPilot's core automation engine using the new OpenAI Agents SDK (v0.1.6, if you're curious). Before this, I was burning £2,500/month on a virtual assistant team in the Philippines to handle customer onboarding and basic support tickets. They were great people — Maria especially, she'd been with me since the $2k MRR days. But scaling humans linearly with revenue isn't exactly the indie hacker dream Pieter Levels preaches.

Here's the unfiltered journey of shipping an agent that actually works in production. Not in a demo video. Not in a Twitter thread with 10k likes. Actually works.

The "Before" State (October 2024)

My stack was a Frankenstein monster. I'm not being dramatic:

Custom Python scripts calling the OpenAI API directly (written at 2 AM over three different weekends)
A janky state machine I built myself that I was weirdly proud of but also terrified to touch
Redis for conversation memory that kept filling up because I forgot to set TTLs. Twice.
47% of conversations needed human handoff anyway

Metrics that hurt:

Average response time: 14 minutes
Customer satisfaction: 3.8/5
Monthly cost: £2,500 fixed + £320 API credits
My sanity: negative

I knew I needed to rebuild. Kept putting it off because "it works, mostly." Classic indie hacker trap. You know the one.

Why I Bet on the Agents SDK (Instead of LangChain)

I tried LangChain last year. Actually, wait—I should clarify that I tried LangChain and LangGraph. Two different attempts. Harrison Chase is brilliant, genuinely, but the abstraction layers made me feel like I was debugging someone else's magic tricks. When something broke, I had to understand LangChain's internals, not just my own logic. That's... not great at 11 PM on a Saturday.

The OpenAI Agents SDK clicked for me because it's thin. Almost too thin, honestly. It gives you:

Agents with defined instructions and tools
Handoffs between specialised agents
Guardrails that run on every input/output
Tracing that actually makes sense

No black box orchestration layers. Just primitives I can compose. Or break. Probably both.

Here's what my agent architecture looks like now:


Triage Agent (classifies intent)
 ├── Onboarding Agent (product setup, tutorials)
 ├── Support Agent (troubleshooting, bugs)
 │ └── Refund Agent (only triggered for cancellation requests)
 └── Sales Agent (pricing questions, upgrades)

Each agent has its own system prompt, tool set, and guardrails. The handoff mechanism means the Onboarding Agent never accidentally promises a refund, and the Refund Agent never tries to upsell someone who's angry. That last one was a real problem with the old system, by the way. Nothing like getting a "Have you considered our annual plan?" email right after you've demanded your money back.

The Build Timeline (With Real Stumbles)

Week 1-2: Proof of Concept

I migrated the simplest flow first — onboarding new users. The SDK's @tool decorator made it stupid easy to connect my existing database functions. Almost too easy. I kept waiting for the catch.

First win: The agent correctly walked a user through connecting their Gmail account in 12 steps without getting lost. My old system would've given up at step 4. I literally screenshotted the conversation and sent it to my founder friends at 1 AM.

First failure: I didn't set max_turns and one agent got stuck in a loop asking the customer "Can you clarify that?" for 23 minutes. Twenty-three. The customer thought it was performance art. I wish I was joking.

Week 3-4: Guardrails Save My Arse

This is where the SDK earned its keep. I added output guardrails that check every agent response before it reaches the customer:

No promising features that don't exist (checks against a features.json file I update manually)
No hallucinated pricing (validates against Stripe API in real-time)
No toxic language (standard content filter, nothing fancy)
No PII leakage (regex + entity recognition, pretty basic stuff)

The first day I turned these on, the guardrails blocked 12 responses. Twelve. Times. My agent would've said something wrong. One of them was about to tell a customer we have an "enterprise plan with SSO" — we don't. We're a $10k MRR bootstrapped SaaS. We have Google login and prayers.

Week 5-6: The Tracing Revelation

The SDK's tracing dashboard exposed something I never would've found otherwise. It's basically a tree of agent calls with latency for each node. Beautiful. Depressing.

My Support Agent was calling the knowledge base search tool 3-4 times per query because my initial prompt said "search thoroughly." I thought I was being helpful. I was being expensive. It was spending $0.08 per conversation just on redundant searches. Across 2,000 conversations/month, that's $160 in wasted API calls.

Changed the prompt to "search once with the most specific query possible." Cost dropped to $0.02 per conversation.

Small prompt tweaks compound hard at scale. I think about this constantly now.

The Numbers After 60 Days in Production

Current metrics (December 2024):

Metric	Before	After	Change

Monthly cost	£2,820	£535	-81%

Response time	14 min	22 sec	-98%

CSAT score	3.8/5	4.4/5	+16%

Human handoff rate	47%	12%	-74%

The refund drop surprised me. Turns out when an agent responds in 22 seconds instead of 14 minutes, angry customers don't have time to stew in their frustration and escalate to "just give me my money back." Who knew speed was a feature?

Cost breakdown:

OpenAI API (GPT-4o for triage, GPT-4o-mini for everything else): £330/mo
Tracing/storage: £47/mo
Server (Railway): £158/mo
Total: £535/mo vs £2,820 before

I'm saving roughly £2,285/month. That's almost an extra £27,400/year in profit. As a solo founder, that's... well, it's everything.

What I'd Do Differently

1. Start with the evaluation framework, not the agent.

I built the agent first, then scrambled to create test cases. The SDK has an eval module I ignored until week 4. Don't be me. Write 50 test conversations before you write a single line of agent code. I know everyone says this. I ignored it. You probably will too. But at least feel bad about it.

2. Set tighter guardrails from day one.

I was so excited about the agent "working" that I shipped with minimal guardrails. Those 47 hallucinated refund policies? That happened in the first 48 hours because I didn't have output validation on the Support Agent. The fix took 20 minutes. The customer trust erosion took weeks. I'm still recovering, honestly.

3. Monitor cost per conversation obsessively.

I set up a daily Slack alert now: total conversations, average cost per conversation, number of guardrail triggers. If cost per conversation jumps 20% day-over-day, something's wrong. Usually it's a prompt that got too verbose or a tool that's being called redundantly. I catch it fast now.

4. Keep a human in the loop for edge cases.

My handoff rate is 12% now, and I review every single one of those conversations personally. It takes 15 minutes/day and I've caught three systemic issues I could fix at the prompt level. At $10k MRR, I can afford 15 minutes. At $50k MRR, I'll hire someone. But I'll never go full-auto. Probably.

The Part Nobody Talks About

Building AI agents is emotionally weird.

When my VA team handled support, a customer complaint felt like a process issue. When my agent messes up, it feels personal — like my code let someone down. I've had to consciously separate "the agent made a mistake" from "I'm a failure." Still working on that, tbh.

I also feel guilty about the cost savings. Those VAs were real people who did good work. Maria had been with me since the $2k MRR days. I gave them two months' notice and helped one find another client, but the whole "AI replacing jobs" thing hits different when you're the one doing it.

Pieter Levels talks about this in his book — the indie hacker path isn't always morally clean. I don't have a tidy answer. I just try to be honest about it.

Well... that's complicated.

Should You Use the Agents SDK?

Yes, if:

You're already in the OpenAI ecosystem
You need composable agents with clear boundaries
You value debuggability over abstraction
You're shipping to production, not just prototyping

No, if:

You need multi-model support (Claude, Gemini, etc.)
You want a visual workflow builder
You're doing research/experimentation, not production

For me, it was the right call. The thin abstraction layer means I understand exactly what's happening, and when things break, I can fix them without reading framework source code. Most of the time.

TL;DR

Replaced a £2,500/month VA team with an AI agent built on OpenAI's Agents SDK
Monthly costs dropped 81% (£2,820 → £535)
Response times went from 14 minutes to 22 seconds
Customer satisfaction improved from 3.8 to 4.4/5
Human handoff rate fell from 47% to 12%
The SDK's guardrails and tracing are genuinely useful — not just marketing fluff
Start with evaluation tests, not the agent itself (I learned this the hard way)
Keep humans in the loop for edge cases — 12% handoff rate feels about right

Current focus: Building an agent that proactively reaches out to users who haven't completed onboarding. The SDK's handoff mechanism makes this surprisingly clean — it's just another specialised agent that the Triage Agent can route to. I think it'll work. We'll see.

If you're building agents in production, I'd love to hear what's working for you. Especially if you've solved the "agent confidently lies about your product" problem better than I have. Because honestly, I'm still figuring that one out.

What's your handoff rate between agents and humans? Drop your numbers below — I'm genuinely curious if my 12% is good or if I'm still over-engineering this. Probably the latter.

Tags: #buildinpublic #ai #openai #agents #bootstrapping #saas #indiehacker

Emma builds ReplyPilot AI, an automated customer engagement tool for SaaS founders. She's been bootstrapping since 2022 and shares unfiltered revenue reports every month. Follow the journey at replypilot.ai

Refund requests	23/mo	18/mo	-22%

I Replaced My $3,200/Month VA With an AI Agent — Here's What Actually Happened

I Replaced My $3,200/Month VA With an AI Agent — Here's What Actually Happened

The "Before" State (October 2024)

Why I Bet on the Agents SDK (Instead of LangChain)

The Build Timeline (With Real Stumbles)

Week 1-2: Proof of Concept

Week 3-4: Guardrails Save My Arse

Week 5-6: The Tracing Revelation

The Numbers After 60 Days in Production

What I'd Do Differently

The Part Nobody Talks About

Should You Use the Agents SDK?

TL;DR

Cael Lee

Ready to get started?