We Let Our AI Process a $4.70 Refund and It Tried to Send $47,000: A Function Calling Post-Mortem
We Let Our AI Process a $4.70 Refund and It Tried to Send $47,000: A Function Calling Post-Mortem
Last month, our AI agent autonomously processed a $47,000 refund.
It should've been $4.70.
I still remember the exact Slack message from our payments lead. "Mike, we have a problem. A big one." The root cause wasn't a bug in the traditional sense—it was a hallucination in our LLM's function-calling layer, specifically with GPT-4-turbo (we were on version gpt-4-0125-preview at the time). As engineering leaders, we're all racing to integrate generative AI into production workflows, but this incident was a brutal wake-up call. Deterministic safeguards aren't optional. They're the only thing standing between you and a very uncomfortable board meeting.
I'm sharing this technical post-mortem not to scare anyone away from LLMs—honestly, I think they're incredible—but to push for what I've started calling "defensive architecture" around them. When you give an LLM the power to execute functions, especially ones touching money or PII, you're not managing a chatbot anymore. You're managing a junior engineer with infinite confidence and zero real-world accountability. And that engineer doesn't care about your SOC 2 compliance.
The Incident: A 10,000x Error
It happened on a Tuesday. March 12th, around 2:30 PM EST. We use an LLM agent to parse customer service emails and trigger actions via function calling—pretty standard stuff for a SaaS company in 2025. A customer wrote in saying, "I was charged $4.70 for a subscription I canceled. Please refund this immediately."
The LLM correctly identified the intent: process_refund. No issues there. But during the function-calling step, it needed to extract the amount parameter. And this is where things went sideways. Instead of parsing "$4.70", the model hallucinated the value 4700. Just... dropped the decimal and added zeros. I still can't fully explain why. Our downstream function, which expected the amount in cents, multiplied it by 100 again. So now we're looking at a $470,000 attempted charge to our payment processor.
Yeah.
Thankfully, a daily velocity limit caught the transaction before it fully settled. Shoutout to Stripe's fraud detection, honestly. But the internal accounting nightmare and the 48-hour freeze on our payment gateway cost us significant engineering time. Also, I had to explain to our CFO why our AI decided a $4.70 refund was actually a down payment on a Tesla. That was a fun call.
The Technical Autopsy: 3 Layers of Failure
We didn't just have one bug. We had a systemic failure in how we trusted non-deterministic output. Actually, wait—I should clarify that it wasn't really "trust" in the philosophical sense. It was more like... we were moving so fast we didn't stop to think about what could go wrong. Classic startup mistake.
Here's how it broke down:
1. Implicit Type Coercion in the Schema
Our function definition for process_refund defined the amount parameter as an integer (representing cents). We assumed—and this is embarrassing to admit—that the LLM would mathematically convert the string "$4.70" to 470. Like, do the actual math. Instead, it saw a numeric string, stripped the non-digit characters, and hallucinated 4700 to fit the integer format. We failed to provide explicit constraints in the prompt or the schema. I think the model was trying to "help" by padding the number to something that looked more like a complete integer. But that's just my theory.
2. Missing Guardrails in the Execution Layer
The function itself had no sanity checks. Zero. It was a pure pass-through to the Stripe API. I'd read Building Microservices by Sam Newman years ago, and he talks about how a service should never blindly trust its upstream. We violated this principle completely. There was no check like:
if refund_amount > original_transaction_amount * 1.5:
require_manual_approval()
Nothing. Just "here's a number, go execute it."
3. Over-Reliance on AI "Reasoning"
We fell into the trap of treating the LLM as a deterministic parser. In our sprint planning—and I have to own this—I prioritized speed-to-market over a human-in-the-loop (HITL) strategy for high-stakes transactions. We learned the hard way that for any function call involving monetary value, a human (or at minimum a deterministic rule-based system) must verify the payload before execution. No exceptions.
The Fix: A Defensive Architecture for Function Calling
We spent the next sprint—actually, it bled into two sprints if I'm being honest—implementing what I'm calling a "Trust but Verify" architecture. If you're deploying function calling in production, here are the non-negotiable layers we added. I'm not saying this is perfect, but it's working so far.
Deterministic Post-Processing
Before any function executes, we run a Python validation script. For the refund example, we now use regex to extract the dollar amount deterministically from the original user text and cross-reference it with the LLM's extracted parameter:
import re
def extract_amount(text):
matches = re.findall(r'\$(\d+\.?\d*)', text)
if matches:
return float(matches[0])
return None
# Cross-reference with LLM's extraction
llm_amount = function_args.get("amount") / 100 # Convert cents to dollars
actual_amount = extract_amount(original_email)
if abs(llm_amount - actual_amount) / actual_amount > 0.01: # 1% tolerance
route_to_human_queue()
If the values don't match within a 1% tolerance, the task gets routed to a human queue. We're using Retool for that queue, by the way. Works pretty well.
Strict JSON Schema with "Why" Fields
We modified our function schemas to force the LLM to explain its extraction. Now, the model must output:
{
"amount": 470,
"reasoning": "Parsed $4.70 from the sentence and converted to 470 cents."
}
This slows down the response by about 200ms, but it provides an audit trail. And honestly? The 200ms doesn't matter. What matters is being able to debug these things at 3 AM when something breaks.
The "Blast Radius" Principle
We categorized all function-calling endpoints into Risk Levels (1-5). Level 4 (Financial Write) and Level 5 (Infrastructure Delete) now require a synchronous human approval step via a Slack bot before the API call is actually made. We built it using Slack's Block Kit and a simple Lambda function.
This single change has prevented three potential hallucinations from becoming incidents in the last two weeks alone. Three! In two weeks!
Leadership Lesson: Speed vs. Solvency
As a VP of Engineering at a startup, I push for velocity. That's literally my job. But I had to stand in front of my team at the retro and own the fact that my push to ship the AI feature led to a financial risk. I referenced The Phoenix Project by Gene Kim—which, if you haven't read it, you should—because we optimized for flow over feedback. We had removed the constraint (manual review) without understanding the consequences. It's exactly what they warn about.
The ROI of AI isn't just about reducing headcount or speeding up tickets. It's about scaling safely. Our KPIs have shifted. We no longer just track "Response Time" for the AI agent; we track what I'm calling "Hallucination-Induced Risk Exposure" (HIRE) as a primary metric. Last month, our HIRE score was $47,000. This month, it's $0.
I'm thinking about open-sourcing the dashboard we built for this. Would anyone be interested in that?
TL;DR / Key Takeaways
- LLMs are not deterministic parsers. Treat them like enthusiastic junior devs who need validation, not oracles.
- Add post-processing validation using regex or rule-based systems before executing any financial operations.
- Force the LLM to explain its reasoning in the output schema—it creates an audit trail and seems to improve accuracy.
- Categorize function calls by risk level. Anything touching money or infrastructure should require human approval.
- Track "Hallucination-Induced Risk Exposure" as a metric. If you can't measure it, you can't improve it.
I'm curious to hear from other leaders pushing into this space. How are you handling the non-deterministic nature of LLMs in your critical path? Are you using a pure-play HITL approach, or have you found a reliable way to use embeddings for validation? We've been experimenting with cosine similarity checks between the original text and the extracted parameters, but the results are... mixed. Would love to compare notes in the comments.
Anyway, that's my story. Learn from my mistakes. Please.
What's your experience with LLM function calling in production? Drop a comment below—especially if you've got war stories. I can't be the only one who's had to explain AI mistakes to a CFO.
AILiability #EngineeringLeadership #LLMOps #PostMortem #StartupLife #GenerativeAI #MachineLearning #SoftwareEngineering
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.