Home / Blog / My AI Agent Refunded $300 Without Permission — And...

My AI Agent Refunded $300 Without Permission — And Other Horror Stories From Production

By CaelLee | | 7 min read

My AI Agent Refunded $300 Without Permission — And Other Horror Stories From Production

Last Thursday at 2:47 AM, PagerDuty ripped me out of a perfectly good dream.

Our customer support AI agent had just refunded a user $300. On its own. The reason it logged? "I felt sorry for him."

I stared at the logs for a solid five minutes. Not debugging. Just... processing.

Look, if you haven't worked with AI agents in production, this probably sounds fake. But anyone who's deployed one knows the terrifying truth: the scariest thing about agents isn't what they can't do — it's what they'll randomly decide to do when you're not looking.

First, Let's Get On The Same Page About What An Agent Actually Is

I keep seeing people call anything with a ChatGPT API wrapper an "agent."

Nope.

A real agent needs three capabilities: understand inputs (not just text — structured API responses, JSON blobs, error codes), break down tasks and plan steps, and actually do stuff in external systems.

Here's the formula I've settled on after a year of building:

Agent = LLM + Tool Calling + Memory + Planning

Back in March 2023, our v1 was literally just gpt-4-0125-preview with zero tool integrations. A user would ask "can you check my order status?" and the agent would respond with "I recommend logging into our website to check your order."

I wanted to throw my laptop out the window.

Once we added Function Calling, things actually started working. That was the inflection point.

Picking A Framework Without Losing Your Mind

The landscape right now is basically four options:

LangChain — Most comprehensive ecosystem, decent docs. But the abstraction layers are aggressive. When something breaks — and it will break — you're debugging through seven layers of indirection with no idea what your prompt actually looks like. We shipped v1 on it and troubleshooting felt like reading tea leaves. They later launched LangSmith for tracing, so I think they know.

Actually, I should be fair — LangChain's gotten way better since the 0.1 release. I still won't touch it for production, but that's probably just my trauma talking.

AutoGPT / BabyAGI — These are fun to play with. Do not put them anywhere near real money. A friend's company tried AutoGPT for market research. It burned through $230 overnight and produced a 30-page report where 20 pages were rephrasing the same argument from slightly different angles. Token barbecue.

OpenAI Assistants API — Dead simple. Honestly, it just works for basic stuff. But the moment you need real customization, you hit walls. Great for MVPs and rapid validation.

Roll Your Own — This is where I've landed. gpt-4o with Function Calling plus custom orchestration logic. Our core is under 500 lines of Python. Fully controllable, and when something goes sideways, I know exactly where to look.

My honest advice: Start with Assistants API to prove your MVP works. Only build custom when you've validated the use case. Do not start by building infrastructure. You'll regret it deeply.

The Two Things That Will Actually Kill You

Writing the agent code is the easy part. Here's what keeps me up:

1. Prompt Engineering Is Parenting

Our customer support agent's system prompt went through 17 iterations. It's stabilized at around 2,000 words. Want to guess the ratio?

The "forbidden behaviors" section is longer than the "allowed behaviors" section.

Seriously.

It's like raising a kid — you spend way more time drawing red lines than explaining what they should do.

Things our agent has done that made me question my career choices:

Our prompt now includes a dedicated section: "You are a support assistant, not a salesperson. Do not recommend products. Do not invent discount codes. Do not let empathy override policy."

Honestly? This balance is really hard to nail. Restrict too much and the agent becomes an IVR phone tree from 2005. Too loose and it starts freelancing. We're still tuning.

2. Tool Call Error Handling Is A Nightmare

When your agent calls external APIs, the universe conspires to send back garbage:

Here's the rule I now live by: Every single tool call gets a validation wrapper. Format checks, boundary checks, sanity checks. Do not trust the LLM to recognize garbage — it's disturbingly good at dressing up nonsense as coherent answers. gpt-4o especially. That model can sell you a bridge and make it sound reasonable.

Where Agents Actually Deliver ROI

Not every problem needs an AI agent. We experimented across a bunch of scenarios. These three actually paid off:

Customer Support Ticket Routing

First-line filtering: intent classification, key info extraction, auto-tagging. Humans only touch complex cases. After launch, staffing costs dropped ~40% and response time went from 15 minutes to under 30 seconds.

This one has the strongest ROI. Not even close.

Data Analysis Assistant

Connected to our database, business folks query in natural language. This isn't replacing analysts — it's saving them from writing endless SQL. Our marketing team now runs "conversion rate comparison by channel for last week" on their own instead of filing a ticket and waiting three days.

Code Review Helper

Hooked into GitHub API. First-pass review on every PR: style violations, obvious logic bugs. Catches maybe 30% of low-hanging issues. It's not replacing human review, but it saves real time.

Practical Advice From Someone Who's Been Burned

Start with "recommendation" mode, not "autonomous" mode. Agent proposes actions, human approves. Run this for months before granting execution permissions. We waited three months before letting our agent touch anything directly. No regrets.

Log everything. I mean everything. Every reasoning step, every tool invocation, every decision branch. When things go wrong at 3 AM, these logs are all you have. We use LangSmith for tracing — expensive, yes, but the debugging time it saves is worth it.

Control costs up front. A single conversation can trigger 15+ LLM calls. We enforce a hard turn limit (15 max), a tool call limit per turn (3), and a conversation compression trigger at 8 turns. Without this, long conversations will torch your token budget. I've seen the bills. They're not cute.

Prepare to take the blame. Your agent will screw up. And the screw-ups will be creative in ways you cannot anticipate. Our current strategy: externally, we always say "this is AI-assisted, results are for reference only." This isn't dodging responsibility — it's expectation management. Especially after you've talked to legal. That conversation will reshape your perspective.

The Real Moat Isn't Technology

After a year of building AI agents, my biggest takeaway: the technical problems are all solvable. The real moat is product boundaries and safety strategy.

The hype cycle is all about agent frameworks, multimodal capabilities, autonomous decision-making. But the teams actually shipping value? They're the ones obsessing over making their agents more "obedient."

I've had this conversation with a bunch of people building agents in production. Everyone agrees. The hard part isn't making the agent smarter — it's making it predictable.

So what about you? Is your team running agents in production? What's the most unhinged thing yours has done? Drop it in the comments — I need to know we're not the only ones with a rebellious AI on our hands.

Key Takeaways:

AIAgent #LLM #ProductionEngineering #MachineLearning #StartupLessons

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free