We Found a $45,000/Month AI Black Hole and Plugged It. Here's Your Diagnostic Toolkit
We Found a $45,000/Month AI Black Hole and Plugged It. Here's Your Diagnostic Toolkit
Last quarter, I sat in a board meeting where our CEO proudly announced a 40% increase in AI feature adoption. The room applauded.
I didn't.
I was staring at a different number on my laptop: our cloud infrastructure bill, specifically the line item for AI API calls, had increased by 340%. We were celebrating a growth metric while a cost black hole was silently consuming our runway. Nobody else seemed to connect those dots. I remember sitting there thinking, "Am I the only one seeing this?"
As VP of Engineering at a Series B startup, I've learned that "scaling" and "burning cash" can look terrifyingly similar. The promise of AI is intoxicating—I get it, I really do. But the pay-per-token model? It's a trap for the unprepared. We aren't paying for servers that hum along at a predictable rate. We're paying for every. single. word. our code generates.
Actually, wait—I should clarify something. This isn't a "don't use AI" post. We're all in on AI. But after a gruelling two-week audit (which I've started calling the "AI Black Hole Diagnosis" around the office—yeah, I know, very dramatic), we reclaimed $45,000 in monthly recurring costs without sacrificing a single product feature. Here's the diagnostic framework I wish I had six months ago, before we burned through, well... a lot.
TL;DR / Key Takeaways
- Verbose prompts are bleeding you dry. We cut input tokens by 18% just by trimming system prompts. That's instant savings.
- You probably don't need GPT-4 for everything. We downgraded 60% of non-critical tasks to cheaper models and saved $22,000 in month one. User experience? Unchanged.
- Your logging is a silent killer. Storing full API responses in logs cost us $3,200/month. Log metadata, not payloads.
- Give teams budgets, not mandates. When engineers own the P&L, they find solutions you'd never think of. One squad saved 73% by switching to a fine-tuned Llama 3 8B on our own infra.
The Three Hidden Sinks of AI Spend
Most engineering leaders I talk to are looking at the wrong metrics. They monitor total monthly cost. That's fine. But they don't dissect the unit economics of intelligence. You need to move from a macro view to a micro view. Here are the three specific leaks we found.
1. The "Verbose Professor" Prompt Tax
We discovered that our "summarisation" feature was costing us 12x more than necessary. Twelve. Times.
Why? The system prompt was a 400-word paragraph explaining how to be a helpful assistant, followed by the actual document. We were literally paying for the AI to read a novel before it wrote a haiku. I remember staring at the prompt template and just... laughing. It was absurd. Here's a sanitised version of what we had:
# BEFORE: The professor who won't stop lecturing
system_prompt = """
You are a helpful, thoughtful, and comprehensive assistant designed to
analyse documents with care and precision. You consider multiple perspectives,
weigh evidence carefully, and always structure your responses in a way that
is easy for users to understand. You avoid bias, you remain objective, and
you always prioritise accuracy over speed. When you summarise a document,
you ensure that all key points are captured without losing nuance...
[another 300 words of this]
"""
# AFTER: Just the facts
system_prompt = "Summarise the following document in under 80 words. Preserve key facts only."
The fix was embarrassingly simple. We stripped system prompts to the bare minimum. Moved from verbose prose to structured, terse instructions. This single change reduced our input token consumption by 18% immediately. Not gradually. Immediately.
The KPI Shift: We stopped tracking "successful completions" and started tracking "tokens per task." If a 50-word summary costs 2,000 tokens, you have a logic leak, not a cost problem. Period.
2. The "Gold-Plated" Model Fallacy
This one caused a fight. I mean, a real debate with one of my senior engineers, Sarah. She insisted we needed GPT-4 for a classification task because "it just felt smarter." I pushed back. When we ran a blind A/B test against GPT-3.5 Turbo—this was in late 2023, using the gpt-3.5-turbo-1106 snapshot—the accuracy difference was 0.7%.
Seven tenths of one percent.
The cost difference? 20x. We were using a supercomputer to do arithmetic.
I won't lie—this conversation was uncomfortable. Sarah's brilliant, and I value her judgement. But "it feels smarter" isn't a cost justification. We ended up implementing what I now call a "model router" architecture.
def route_task(task_type: str, complexity: str) -> str:
if task_type in ["classification", "sentiment", "extraction"]:
return "gpt-3.5-turbo" # $0.50/1M tokens
elif complexity == "high" or task_type == "reasoning":
return "gpt-4" # $10/1M tokens
# VP sign-off required for anything else
raise ValueError("Unauthorised model selection")
Simple tasks—classification, sentiment analysis, basic extraction—are now locked to cheaper, faster models. Complex reasoning tasks get the premium model. No exceptions without a VP-level sign-off. I've only signed off twice in four months.
The Data Point: By downgrading 60% of our non-critical pipeline calls, we saved $22,000 in the first month. Zero impact on user experience KPIs. I still bring this up with Sarah sometimes. She hates it. (Love you, Sarah.)
3. The Debugging Echo Chamber
This was the most embarrassing discovery. Not even close.
During a production incident back in March, a developer had added a console.log(fullapiresponse) line for debugging. It was never removed. For three months, we were not only paying to generate the output but also paying to store massive, unstructured JSON blobs in Datadog. Inflating both our AI bill and our observability bill.
The log retention alone was costing us $3,200/month. For debugging code that wasn't debugging anything anymore.
I felt physically ill when I found this. I'd walked past that developer's desk a dozen times since March. I'd even reviewed a PR for that service. Never noticed the log line. It's always the small things, isn't it?
The Fix: We enforced a strict "log the metadata, not the payload" policy. Built a lightweight proxy using FastAPI that strips response bodies before they hit the logger, keeping only token counts, latency, and model IDs.
@app.middleware("http")
async def sanitise_logs(request: Request, call_next):
response = await call_next(request)
# Strip payload, keep diagnostics
log_entry = {
"model": response.headers.get("x-model-used"),
"tokens": response.headers.get("x-tokens-consumed"),
"latency_ms": response.headers.get("x-response-time"),
"status": response.status_code,
# Deliberately NOT logging response.body
}
logger.info("ai_call", **log_entry)
return response
Took a weekend to build. Should have done it a year ago.
The KPI Shift: We now track "logging cost per 1,000 API calls." It should be flat. If it's not flat, someone left a payload log on again. And yes, it's happened twice since. Both times caught within 24 hours.
Building a Cost-Aware Culture, Not a Police State
You can't fix this with a dashboard alone. I've tried.
As leaders, we have to shift the engineering mindset from "ship it fast" to "ship it efficiently." And I'll be honest—this is harder than the technical stuff. I borrowed a concept from Marty Cagan's Empowered (if you haven't read it, I think it's probably the best product leadership book out there right now) and applied it to our AI costs: give the teams the problem, not the solution.
I didn't mandate which models to use. I gave every squad a weekly budget for "intelligence spend" relative to their feature's revenue or engagement. When the "Smart Reply" team saw they were spending $0.04 per user interaction while only generating $0.01 in ad revenue... well, they didn't need me to tell them to optimise. They owned the P&L and found a compression solution I hadn't even considered.
They ended up using a combination of prompt caching and switching to a fine-tuned Llama 3 8B model running on our own infrastructure. Saved 73%.
That's the thing about engineers. Tell them what to do, and you'll get compliance. Show them why it matters, and you'll get creativity.
Your Diagnostic Checklist
If you're feeling that sinking feeling in your stomach right now—and I've felt it, trust me—start here. This is the exact triage list I use:
- Pull a "Cost per Session" report. Don't look at aggregate cost. Divide total AI spend by active user sessions. Is this number trending up? If yes, your product is getting less efficient as it scales. Ours was up 22% month-over-month. That was the canary.
- Audit the top 10 prompts by volume. Are they concise? Do they include unnecessary few-shot examples that have become dead weight? We found one prompt that included five examples from 2022. The model didn't need them anymore. The examples were literally training on outdated data patterns.
- Check your retry logic. A failed API call that retries three times with exponential backoff is a 3x cost multiplier on a failure. Are you failing fast, or failing expensively? We had a bug where a malformed request was retrying five times before giving up. Five. That's just burning money. We added circuit breakers and now fail after two attempts.
The Bigger Picture
We are in an era where code writes prose, and that prose costs money. It's a paradigm shift that requires us to be as fluent in financial engineering as we are in software engineering. The black hole is real. But it's not a mystery. It's just a sum of small, unexamined decisions.
And honestly? That's good news. Because small decisions are fixable.
What's the most surprising cost leak you've found in your AI infrastructure? I'm collecting war stories for a future post—drop your experience in the comments. Bonus points if it's more embarrassing than my logging fiasco. (Please let it be more embarrassing. I need the validation.)
AIEngineering #EngineeringLeadership #CostOptimisation #StartupLife #TechStrategy
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.