Home / Blog / Stop Treating Your LLM Costs Like a Black Box: A F...

Stop Treating Your LLM Costs Like a Black Box: A FinOps Blueprint That Actually Works

By CaelLee | | 6 min read

Stop Treating Your LLM Costs Like a Black Box: A FinOps Blueprint That Actually Works

Last quarter, I watched a team burn through 40% of their monthly cloud budget in two weeks.

Two. Weeks.

The culprit wasn't an infinite loop or a rogue Kubernetes cluster. It was a single feature release that quietly switched a GPT-3.5 call to GPT-4o without updating the cost attribution model. The CFO called me at 8:47 PM on a Thursday. I still remember the exact time because I was mid-bite into cold pad thai.

Not a fun conversation.

We've all gotten good at tracking compute and storage. But generative AI introduces this weird variability that makes traditional cloud cost management look... honestly, kind of primitive. We're not dealing with predictable, time-based resources anymore. We're dealing with token consumption. And token consumption is messy—it's directly tied to user behaviour, prompt engineering, model selection, even the time of day (peak traffic means longer context windows, which means... you get it).

As we scaled our platform from a few hundred internal beta users to about 40,000 in 14 months, I had to completely rethink our FinOps strategy. Actually, "rethink" is generous. I had to build one from scratch. Here's the framework I've implemented to move from reactive bill-shock to proactive cost attribution for our Generative AI APIs.

The Core Problem: Why Traditional Tagging Fails

Standard cloud tagging—you know, Project, Environment, Owner—completely collapses when applied to an LLM call. A single API endpoint might serve ten different product features, each with wildly different prompt lengths and model requirements. A "Summarise Document" feature is fundamentally more expensive than a "Suggest Title" feature, even if they hit the exact same /completions route.

I learned this the hard way. We initially just tagged the API gateway. Told us where the cost was. Never told us why.

We were flying blind.

The first step in a mature FinOps model—and I'm convinced of this now—is moving from resource-level attribution to business-logic attribution. Not the infrastructure. The intent.

Designing the Cost Attribution Schema

You cannot negotiate what you cannot measure. I probably say this three times a week now. You need to shift your observability from infrastructure metrics to business metrics. Here's the granular schema I mandated across all our AI services, after way too many meetings about it:

The $0.04 vs. $0.80 Lesson

Okay, story time.

During our Q2 hackathon last year, one of our senior engineers—brilliant bloke, 15 years of experience—built this incredible RAG pipeline for legal documents. It was accurate, fast, beautiful. The kind of thing you demo to the board. But when we ran the cost attribution report (which, thankfully, we had just implemented), we saw that a single query was costing $0.80.

Eighty cents. Per query.

A similar feature built by a junior team—two engineers fresh out of bootcamp—used a more aggressive summarisation step before the final prompt. Their cost? $0.04.

The difference? The senior engineer was passing the entire raw document context (10,000+ tokens) into the prompt. The junior team was passing a structured JSON summary (500 tokens). Both outputs were factually correct. Both passed our eval suite. But that $0.76 delta, multiplied by 100,000 daily queries...

That's $76,000 a day. I'll let you do the annual maths.

We don't optimise for cost, by the way. That's the wrong framing. We optimise for cost-per-correct-output. Subtle distinction, but it matters.

Building the FinOps Financial Model

Once the telemetry is in place, you can build a dynamic financial model. I don't use static spreadsheets for this—tried that, it's a nightmare to maintain. I use a Metabase dashboard that feeds directly from our BigQuery data warehouse. Here's the structure I present to the board every month:

What I'd Actually Do This Week

If you're feeling the heat from your finance team right now—and I know some of you are, I've gotten the DMs—here's what I'd do:

  1. Log the token vector. Today. Not tomorrow, not after the next sprint planning. Add a structured log line that captures {model, inputtokens, outputtokens, feature_flag}. You can build the analytics later. You can't recreate lost data. I'm speaking from pain here.
  2. Implement a kill switch. Every AI feature needs a circuit breaker. If cost-per-second spikes 300% above baseline, the system should automatically fall back to a cached response or a simpler model. Revenue preservation is a reliability metric. I think this is going to be standard practice by 2026, but right now it's still surprisingly rare.
  3. Read "Cloud FinOps" by J.R. Storment and Mike Fuller. It's the bible for this stuff, even though it predates the LLM explosion. The principles of unit economics are universal. Actually, wait—I should clarify that the second edition is the one you want. The first edition is fine but missing some key chapters on variable cost models.

TL;DR

We're entering this weird era where an engineer's prompt design is a direct P&L activity. My role as a VP of Engineering isn't just about uptime and velocity anymore. It's about enabling a cost-conscious culture without stifling innovation. And honestly? That balance is harder than any technical problem I've faced.

How are you currently attributing your LLM costs? Are you still just looking at the AWS bill, or have you drilled down to the feature level? I'm genuinely curious about the hacks you've built—drop them in the comments. I read every single one, even if I don't always respond.

AIFinOps #EngineeringLeadership #CostOptimisation #GenerativeAI #SaaS

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free