Stop Treating Your LLM Costs Like a Black Box: A FinOps Blueprint That Actually Works
Stop Treating Your LLM Costs Like a Black Box: A FinOps Blueprint That Actually Works
Last quarter, I watched a team burn through 40% of their monthly cloud budget in two weeks.
Two. Weeks.
The culprit wasn't an infinite loop or a rogue Kubernetes cluster. It was a single feature release that quietly switched a GPT-3.5 call to GPT-4o without updating the cost attribution model. The CFO called me at 8:47 PM on a Thursday. I still remember the exact time because I was mid-bite into cold pad thai.
Not a fun conversation.
We've all gotten good at tracking compute and storage. But generative AI introduces this weird variability that makes traditional cloud cost management look... honestly, kind of primitive. We're not dealing with predictable, time-based resources anymore. We're dealing with token consumption. And token consumption is messy—it's directly tied to user behaviour, prompt engineering, model selection, even the time of day (peak traffic means longer context windows, which means... you get it).
As we scaled our platform from a few hundred internal beta users to about 40,000 in 14 months, I had to completely rethink our FinOps strategy. Actually, "rethink" is generous. I had to build one from scratch. Here's the framework I've implemented to move from reactive bill-shock to proactive cost attribution for our Generative AI APIs.
The Core Problem: Why Traditional Tagging Fails
Standard cloud tagging—you know, Project, Environment, Owner—completely collapses when applied to an LLM call. A single API endpoint might serve ten different product features, each with wildly different prompt lengths and model requirements. A "Summarise Document" feature is fundamentally more expensive than a "Suggest Title" feature, even if they hit the exact same /completions route.
I learned this the hard way. We initially just tagged the API gateway. Told us where the cost was. Never told us why.
We were flying blind.
The first step in a mature FinOps model—and I'm convinced of this now—is moving from resource-level attribution to business-logic attribution. Not the infrastructure. The intent.
Designing the Cost Attribution Schema
You cannot negotiate what you cannot measure. I probably say this three times a week now. You need to shift your observability from infrastructure metrics to business metrics. Here's the granular schema I mandated across all our AI services, after way too many meetings about it:
- Feature ID: The specific product capability (e.g.,
chatbotv2,docanalysis,code_gen). Not the endpoint. The capability. - Tenant Tier: Free, Pro, or Enterprise. This is crucial for calculating unit economics and CAC payback periods. We actually discovered our Enterprise tier was subsidising Free users at a 4:1 ratio, which... well, that's a different post.
- Model Fingerprint: Not just the model name, but the exact version and modality (e.g.,
gpt-4o-2024-05-13textonly). I'm obsessive about this now. A minor version bump can change tokenisation behaviour. - Token Vector: A breakdown of input tokens, output tokens, and search/grounding tokens. I also push for a "wasted tokens" metric—input context that the model didn't actually need. This one's controversial with the ML team. They think I'm being reductive. They're probably right, but the numbers don't lie.
The $0.04 vs. $0.80 Lesson
Okay, story time.
During our Q2 hackathon last year, one of our senior engineers—brilliant bloke, 15 years of experience—built this incredible RAG pipeline for legal documents. It was accurate, fast, beautiful. The kind of thing you demo to the board. But when we ran the cost attribution report (which, thankfully, we had just implemented), we saw that a single query was costing $0.80.
Eighty cents. Per query.
A similar feature built by a junior team—two engineers fresh out of bootcamp—used a more aggressive summarisation step before the final prompt. Their cost? $0.04.
The difference? The senior engineer was passing the entire raw document context (10,000+ tokens) into the prompt. The junior team was passing a structured JSON summary (500 tokens). Both outputs were factually correct. Both passed our eval suite. But that $0.76 delta, multiplied by 100,000 daily queries...
That's $76,000 a day. I'll let you do the annual maths.
We don't optimise for cost, by the way. That's the wrong framing. We optimise for cost-per-correct-output. Subtle distinction, but it matters.
Building the FinOps Financial Model
Once the telemetry is in place, you can build a dynamic financial model. I don't use static spreadsheets for this—tried that, it's a nightmare to maintain. I use a Metabase dashboard that feeds directly from our BigQuery data warehouse. Here's the structure I present to the board every month:
- The "Cost Per Interaction" Curve: Plot Feature ID against Cost Per User Session. You will immediately see outliers. I set a threshold: any feature averaging above $0.10 per session triggers an automatic architecture review. No exceptions. Well... almost no exceptions.
- Token Efficiency Ratio (TER): Output Tokens / Input Tokens. A low ratio often signals "lazy prompting" where the system is over-fetching context. We gamified this—the team with the highest TER while maintaining accuracy scores gets a budget bonus for their next sprint. It's silly, but it works.
- The Freemium Trap Analysis: We track GenAI cost as a line item against free-tier users. If the cumulative infrastructure cost of a free user exceeds their predicted LTV within the first 30 days, we throttle them to a cheaper, fine-tuned lightweight model automatically. Usually Gemini Flash or Claude Haiku, depending on the task. You can't let a free user's "summarise the entire internet" habit bankrupt you. And trust me, they'll try.
What I'd Actually Do This Week
If you're feeling the heat from your finance team right now—and I know some of you are, I've gotten the DMs—here's what I'd do:
- Log the token vector. Today. Not tomorrow, not after the next sprint planning. Add a structured log line that captures
{model, inputtokens, outputtokens, feature_flag}. You can build the analytics later. You can't recreate lost data. I'm speaking from pain here. - Implement a kill switch. Every AI feature needs a circuit breaker. If cost-per-second spikes 300% above baseline, the system should automatically fall back to a cached response or a simpler model. Revenue preservation is a reliability metric. I think this is going to be standard practice by 2026, but right now it's still surprisingly rare.
- Read "Cloud FinOps" by J.R. Storment and Mike Fuller. It's the bible for this stuff, even though it predates the LLM explosion. The principles of unit economics are universal. Actually, wait—I should clarify that the second edition is the one you want. The first edition is fine but missing some key chapters on variable cost models.
TL;DR
- Traditional cloud tagging fails for LLMs—you need business-logic attribution, not resource-level tagging
- Track Feature ID, Tenant Tier, Model Fingerprint, and Token Vector for every AI call
- Optimise for cost-per-correct-output, not raw cost
- Build a kill switch that falls back to cheaper models when costs spike
- Log your token vectors today—you can't analyse what you haven't captured
We're entering this weird era where an engineer's prompt design is a direct P&L activity. My role as a VP of Engineering isn't just about uptime and velocity anymore. It's about enabling a cost-conscious culture without stifling innovation. And honestly? That balance is harder than any technical problem I've faced.
How are you currently attributing your LLM costs? Are you still just looking at the AWS bill, or have you drilled down to the feature level? I'm genuinely curious about the hacks you've built—drop them in the comments. I read every single one, even if I don't always respond.
AIFinOps #EngineeringLeadership #CostOptimisation #GenerativeAI #SaaS
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.