How I Stopped LLM Hallucinations from Bleeding $2K/Month Out of My SaaS

Last month, I stared at my Stripe dashboard and felt physically ill. 12 refund requests in 48 hours. All traced back to the same catastrophic bug—my AI chatbot was generating JSON that looked correct but was quietly destroying customer data. Empty strings where phone numbers belonged. The literal string "null" instead of actual null values. And my personal favorite: "status": "I'll check that for you!" because GPT-4 decided to go rogue mid-JSON generation. A customer sent me that screenshot. I still have it.

I was hemorrhaging trust and roughly $2,100 in annual contracts. That week fundamentally broke how I think about Function Calling. Here's the unfiltered, slightly embarrassing journey from naive JSON parsing to implementing schema-constrained decoding—and why this might save your indie SaaS from the exact same nightmare.

TL;DR: I replaced GPT-4's "hope and pray" JSON generation with schema-constrained decoding. My function call success rate jumped from 78% to 99.3%. Refunds dropped from 4.2% to 0.1%. Saved about $2,060/month. The implementation took 3 weeks and increased API costs by 19%—worth every penny.

The "Just Add Function Calling" Trap (Months 1-3)

When OpenAI launched Function Calling in June 2023, I shipped it over a single weekend. Felt like an absolute genius. My product, TalkFlow AI, lets non-technical teams build voice agents that trigger backend actions—booking appointments, updating CRMs, sending invoices. The pitch was beautifully simple: speak naturally, get structured output.

Under the hood? GPT-4 with function_call parameters and a basic try/catch block for JSON validation. That's literally it.

At $3K MRR, this approach worked beautifully. My 47 customers were happy. Then I onboarded a real estate agency with genuinely complex scheduling logic—multiple agents, timezone offsets, property IDs that looked like PROP-2024-XJ9-001. And suddenly, GPT-4 started generating:


{
 "agent_id": "Sarah (the one in Austin office)",
 "time_slot": "tomorrow morning",
 "property_ref": "that blue house on Oak Street"
}

I wish I was joking. That's an actual response from my production logs on March 12th, 2024, 3:47 PM UTC. My validation layer tried to coerce "tomorrow morning" into an ISO datetime and silently failed.

The result? Bookings created for January 1st, 1970. Unix epoch zero.

I discovered this when the agency owner called me at 8 PM. Furious. His agents were getting calendar notifications for events 54 years in the past. Try explaining that one to a paying customer. I basically just stammered and promised to fix it immediately.

The Pivot: Schema-Constrained Decoding (Not Just Better Validation)

I did what every indie hacker does first: frantically Googled at 2 AM. Found Pieter Levels tweeting about how he "just uses better prompts."

Look, I respect Pieter. But respectfully—no. That approach completely falls apart when you're dealing with 47 fields across 12 function schemas. Prompt engineering is a band-aid. The wound? It's architectural.

Here's what I eventually understood after way too many late nights: LLMs are autoregressive samplers, not structured data generators. Each token gets predicted based on probability distributions, not logical constraints. Even with perfect prompts and responseformat: { type: "jsonobject" }, the model can still generate:

Type violations: Integer fields getting floats, strings getting arrays
Missing required fields: The model decides phone_number is optional because the user didn't mention it
Hallucinated fields: Adding "customer_mood": "angry" when my schema only allows status: "escalated"
Invalid enum values: "paymentmethod": "creditcard" when my system expects "cc", "ach", or "wire"

The solution isn't better validation—though you should have that too. The actual fix is schema-constrained decoding (sometimes called constrained sampling or grammar-based generation). Fancy terms for a surprisingly simple idea: modify the token selection process itself so the LLM literally cannot generate tokens that violate your schema.

Actually—I should clarify something important. This is different from what OpenAI's structured outputs do now. Their implementation is great. But when I built this in March 2024, it wasn't available yet. I was on my own, scrambling for alternatives.

Implementation: How I Built It (Without a PhD)

Here's the approachable version that took me from zero to production in 3 weeks. And when I say 3 weeks, I mean 3 very, very long weeks. I think I slept through two weekends.

Week 1: Understanding the Token-Level Problem

When GPT generates "phone": ", the next token is sampled from roughly 50,000 possibilities. Without constraints, it might pick "555-", "null", or "I'll ask them". With schema constraints, we mask all tokens except those that could start a valid phone number string.

This happens during generation. Not after. That's the entire ballgame.

I used the outlines library by Rémi Louf. Shoutout to open-source—this probably saved me months of work. It compiles your JSON Schema into a finite-state machine that guides token selection at every step. Here's my stripped-down implementation:


from outlines import models, generate
from pydantic import BaseModel, Field
from typing import Optional

class BookingRequest(BaseModel):
 agent_id: str = Field(pattern=r'^[A-Z]{3}-\d{4}$')
 time_slot: str = Field(pattern=r'^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z$')
 property_ref: Optional[str] = Field(pattern=r'^PROP-\d{4}-[A-Z0-9]{4}-\d{3}$')
 priority: int = Field(ge=1, le=5)

model = models.openai("gpt-4")
generator = generate.json(model, BookingRequest)
result = generator("Book Sarah for tomorrow at 2pm, property PROP-2024-XJ9-001")
# Guaranteed valid or throws a clear error

The key insight: generate.json() doesn't just validate output after the fact. It constrains the model's token vocabulary at each step. If the next required token must be a digit, the model can only choose from tokens 0-9. No more "tomorrow morning" nonsense—those tokens are literally inaccessible. Probability gets set to absolute zero.

Week 2: Productionizing with Fallbacks

Constrained decoding is heavier than standard generation. 15-20% slower in my tests on an M2 MacBook Pro. I couldn't just blindly swap it in everywhere. Here's the tiered approach I landed on:

Critical functions (payments, calendar writes, CRM updates): Full schema-constrained generation with outlines
Semi-structured data (notes, summaries): Standard function calling with strict Pydantic validation post-hoc
Free text (chat responses): No constraints, but never used for structured data—period

My cost per 1,000 API calls went from $12.40 to $14.80. But my refund rate dropped from 4.2% to 0.1%. The math is embarrassingly clear: I was losing $2,100/month in refunds to save about $240/month in compute costs.

Bootstrapper lesson I learned the hard way: optimize for customer trust, not server costs. I think a lot of us get this exactly backwards.

Week 3: The Edge Cases That Still Haunt Me

Even with constrained decoding, I hit three problems that nearly made me quit:

Regex isn't enough for cross-field validation: My schema required endtime > starttime. That's semantic validation, not syntactic. I added a second validation pass using Pydantic's @validator decorators that runs immediately after generation. Catches about 93% of remaining issues. Not perfect. But close enough for production.

The "silent null" problem: When a user says "I don't have a property reference," the constrained model would sometimes generate "propertyref": "" instead of omitting the field entirely. This passed schema checks but broke my database constraints. Solution: explicit minlength=1 on optional string fields and nullable=False as the default. Took me 4 hours of debugging to figure that one out. I almost threw my laptop.

Streaming breaks constraints—badly: My product streams responses for UX smoothness. But token-by-token streaming means the first few tokens might be valid while later ones violate the schema. I had to buffer the entire function call response before displaying it to users. Added about 800ms latency. Users noticed the pause. I got two support tickets about it within the first week. Worth it for correctness, but still annoying.

The Results (With Real Numbers)

After 6 weeks of running schema-constrained decoding in production:

Function call success rate: 78% → 99.3%
Customer-reported bugs: 23/month → 2/month
Refund rate: 4.2% → 0.1% (saved ~$2,060/month)
MRR impact: Lost 3 customers during the buggy period, gained 11 after. Net: +$1,840 MRR
Time spent on LLM debugging: 15 hours/week → 2 hours/week

I shipped this as a "Reliability Update" to all customers on April 3rd, 2024. Three of them emailed me within 24 hours saying they noticed the improvement. One enterprise customer upgraded from $199/month to $499/month specifically because "the system finally does what we expect."

That felt good. Really, genuinely good. Like all those late nights actually meant something.

What I'd Do Differently

If I could go back and talk to my $3K MRR self, here's what I'd change:

Start with constrained decoding from day one: I wasted 4 months building validation layers that were fundamentally flawed. The cost difference is negligible at small scale. The engineering debt of retrofitting constraints? Brutal and completely avoidable.

Use JSON Schema as the single source of truth: I had three different representations of my data shape—OpenAI function definitions, Pydantic models, and database schemas. Now I generate all three from a single JSON Schema file using datamodel-code-generator. Life is dramatically simpler when you're not keeping schemas in sync manually.

Test with adversarial inputs weekly: I now have a script that runs 200 intentionally confusing prompts every Monday. Stuff like "Book me when the sun is high but not too high, for the house with the red door." Ensures zero schema violations. Catches regressions before customers do.

Build a "constraint explainer" for debugging: When the model can't generate valid output, it often produces nothing or cryptic errors. I added logging that shows which specific constraint failed and what the model was trying to generate. This turned 2-hour debugging sessions into 5-minute fixes. Absolute game changer.

The Bigger Picture for Indie Hackers

Here's what I keep thinking about: we're in this gold rush of AI wrappers right now. But the products that survive won't be the ones with the cleverest prompts. They'll be the ones that solve the boring, unsexy reliability problems.

Schema-constrained decoding isn't exciting. It doesn't make for viral Twitter threads. But it's the difference between a product that works 78% of the time and one that works 99.3% of the time. And customers—especially enterprise customers—will pay a premium for that difference.

When I see indie hackers like Danny Postma and Marc Lou raising prices on their AI products, I suspect they've quietly solved these same reliability issues under the hood. Customers pay for trustworthy AI. Not just AI.

I'm now at $10,230 MRR with 84 customers and a 1.8% monthly churn. My CAC is around $67—mostly content marketing and a few targeted ads on X. I'm not Pieter Levels. Not even in the same universe. But I'm building something sustainable. And honestly? It's because I stopped treating LLMs as magic and started treating them as probabilistic systems that need real, architectural guardrails.

Anyway. That's my story. What a ride this has been.

What's your experience with Function Calling reliability? Have you tried constrained decoding, or are you still in the "better prompts will fix it" phase? Drop your horror stories below—I'll share the worst one in my next update. Seriously, I want to hear them. Misery absolutely loves company.

Product: TalkFlow AI - Voice agents with structured actions

Revenue: $10,230 MRR | Customers: 84 | Churn: 1.8%

Previous milestone: $7K MRR after pivoting from chatbot templates to voice-first

buildinpublic #aiengineering #functioncalling #indiehackers #bootstrap #llmreliability

How I Stopped LLM Hallucinations from Bleeding $2K/Month Out of My SaaS

How I Stopped LLM Hallucinations from Bleeding $2K/Month Out of My SaaS

The "Just Add Function Calling" Trap (Months 1-3)

The Pivot: Schema-Constrained Decoding (Not Just Better Validation)

Implementation: How I Built It (Without a PhD)

Week 1: Understanding the Token-Level Problem

Week 2: Productionizing with Fallbacks

Week 3: The Edge Cases That Still Haunt Me

The Results (With Real Numbers)

What I'd Do Differently

The Bigger Picture for Indie Hackers

buildinpublic #aiengineering #functioncalling #indiehackers #bootstrap #llmreliability

Cael Lee

Ready to get started?