GPT-5.6 Model Selection: How I Blew £11K Picking the Wrong AI, and the Framework That Saved Us

Last month, our inference costs spiked 340% in four days because I picked the wrong model variant. It wasn't a technical failure—it was a decision-making failure. And I own that.

Actually, "four days" makes it sound measured. It was Tuesday through Friday. But when you're watching your AWS bill tick up in real-time during a Slack incident channel at 11pm, it sure feels like overnight. I remember refreshing the Cost Explorer dashboard at 2am, hoping the numbers would somehow be different. They weren't.

Here's the thing: as engineering leaders, we're increasingly asked to make architectural decisions about AI models we didn't train, don't fully control, and can't always benchmark in advance. The new GPT-5.6 family—Sol, Terra, and Luna—represents exactly this kind of challenge. Three variants. Three different optimisation targets. No universal "best" choice.

Here's the framework I wish I'd had three months ago. Well, I sort of had pieces of it. Nothing coherent. I was basically pattern-matching from the GPT-4 Turbo days and hoping for the best. That didn't work.

TL;DR for the Impatient

Sol = Deep reasoning, expensive, slow. Use for complex analysis where accuracy trumps everything.
Terra = Balanced workhorse. Start here unless you've got a compelling reason not to.
Luna = Fast, cheap, less clever. Perfect for real-time chat and classification.
The real mistake isn't picking wrong—it's paralysing yourself with benchmarks that don't reflect your actual workload.
Ship Terra in two days, measure for two weeks, then optimise. (Unless you're in healthcare or something regulated—then add appropriate guardrails.)

The Three Variants, Minus the Marketing Fluff

I've read the model cards twice. Honestly, half of it's vibes. Here's what actually matters:

GPT-5.6 Sol — Optimised for reasoning depth and complex multi-step problem solving. Think legal document analysis, financial modelling, architecture review. Highest per-token cost, but often requires fewer tokens to reach correct conclusions. Emphasis on often. Not always.

GPT-5.6 Terra — The generalist workhorse. Balanced across reasoning, creativity, and instruction-following. This is the safe default for most business applications. Good at everything, best at nothing. I keep calling it "the golden retriever of models" and my team is genuinely sick of hearing it.

GPT-5.6 Luna — Latency-optimised and cost-efficient. Designed for high-throughput, real-time applications like chat, content moderation, or classification tasks. Sacrifices reasoning depth for speed. We're talking ~400ms p50 latency in our us-east-1 cluster. That's honestly wild for a model this capable.

The naming is cute. The tradeoffs are real.

The Mistake I Made (And How You Can Avoid It)

When we first evaluated these models, I did what most engineering leaders do: I looked at benchmark scores and picked the highest performer. Sol crushed MMLU, HumanEval, and every reasoning benchmark we threw at it. Easy choice, right?

Nope.

Our primary use case was a customer-facing support agent that needed sub-800ms response times. Sol averaged 2.3 seconds per response. Our CSAT dropped 12 points in the first week because users were staring at "typing…" indicators. I watched a session replay where someone typed "hello??" three times before the response came through. That one stung.

The lesson: benchmarks measure capability, not suitability.

Here's what I should have evaluated instead:

1. Latency Budget Per Interaction

What's the maximum acceptable response time for your use case? Luna delivers ~400ms p50 latency. Sol runs closer to 1.8-2.5s. Terra sits around 900ms. These numbers matter more than any accuracy metric if your users won't wait. And they won't.

2. Reasoning Depth Required

Are you summarising text or solving multi-step logic problems? We found that for 80% of our customer queries, Terra matched Sol's accuracy at 60% of the cost. Sol only pulled ahead on queries requiring 3+ reasoning steps. I think the threshold is actually closer to 4 steps based on our latest eval runs from January, but I'm still slicing the data.

3. Cost Per Meaningful Interaction

Don't optimise for cost-per-token. Optimise for cost-per-successful-outcome. Sol costs more per token but sometimes needs fewer tokens. Luna is cheap per token but might require retries. Measure end-to-end.

That last point took me embarrassingly long to internalise. I was staring at token counts in Datadog for two weeks before it clicked. My partner asked why I was muttering about "completion tokens" in my sleep.

A Decision Matrix That Actually Works

I've since developed a simple scoring framework for model selection. Rate your use case on three dimensions (1-5 scale):

Reasoning Complexity (1 = simple classification, 5 = multi-step legal analysis)
Latency Sensitivity (1 = batch processing okay, 5 = real-time user-facing)
Cost Sensitivity (1 = unlimited budget, 5 = extremely cost-constrained)

Then map your scores:

High Reasoning (4-5) + Low Latency Sensitivity (1-2) → Sol. This is your model for back-office intelligence, document review, or anything where accuracy trumps speed. We use it for contract analysis and it's genuinely excellent there.

Balanced scores (3s across the board) → Terra. When in doubt, start here. You can always optimise later, but Terra rarely fails catastrophically. I've seen it produce some mediocre summaries, but never anything that made me panic at 2am.

Low Reasoning (1-2) + High Latency Sensitivity (4-5) → Luna. Chat, real-time moderation, intent classification. Speed wins here.

The edge cases are where leadership judgement matters most. High reasoning AND high latency sensitivity? That's when you start exploring hybrid architectures—Luna for triage, Sol for deep analysis on flagged items. We're experimenting with this now and... it's complicated. The routing logic itself adds ~150ms, which partially defeats the purpose. Still iterating.

What This Means for Your Team

Model selection isn't just a technical decision. It's a business decision with engineering implications.

When I finally switched our customer agent from Sol to Terra (with Luna as a pre-filter), here's what happened:

Response times dropped 62% (from 2.1s to 800ms average)
Inference costs decreased 47% (saving roughly £11K/month at our scale—that's about $14K USD)
CSAT recovered to baseline within 10 days
Engineering morale improved because we stopped fighting fires caused by latency complaints

The ROI wasn't in the model. It was in the decision process.

I keep coming back to this. We spent £11K on a model we didn't need because I trusted a benchmark leaderboard over our own latency requirements. That's on me. I'm writing this partially so I don't do it again.

My Recommendation for Most Teams

If you're reading this and feeling overwhelmed, here's my simple heuristic:

Start with Terra. Ship it. Measure real-world performance for 2-4 weeks. Then decide if you need to optimise toward Sol (for accuracy) or Luna (for speed/cost).

The biggest mistake isn't picking the wrong model—it's delaying your decision by over-analysing benchmarks that don't reflect your actual workload. I've watched teams spend six weeks on evaluations when they could have shipped Terra in two days and gathered real data.

As Andy Grove wrote in High Output Management, "A common rule we should always try to heed is to detect and fix any problem at the lowest-value stage possible." In AI terms: ship fast with a safe default, then optimise based on production data, not benchmark anxiety.

Actually, I should probably caveat this. If you're in healthcare or legal or anything regulated, "ship fast" is terrible advice. You know your compliance requirements better than I do. The principle still holds though—just with appropriate guardrails.

I'm curious: how is your team approaching model selection in the GPT-5.6 era? Are you running formal evaluations, going with gut feel, or waiting for someone else to figure it out first?

Drop your approach in the comments—I'm genuinely collecting data points for a follow-up piece. Someone on my team suggested we just A/B test all three variants simultaneously and I'm still thinking about whether that's genius or chaos. Probably both.

AIEngineering #EngineeringLeadership #ModelSelection #GPT56 #TechStrategy

GPT-5.6 Model Selection: How I Blew £11K Picking the Wrong AI, and the Framework That Saved Us

GPT-5.6 Model Selection: How I Blew £11K Picking the Wrong AI, and the Framework That Saved Us

TL;DR for the Impatient

The Three Variants, Minus the Marketing Fluff

The Mistake I Made (And How You Can Avoid It)

1. Latency Budget Per Interaction

2. Reasoning Depth Required

3. Cost Per Meaningful Interaction

A Decision Matrix That Actually Works

What This Means for Your Team

My Recommendation for Most Teams

AIEngineering #EngineeringLeadership #ModelSelection #GPT56 #TechStrategy

Cael Lee

Ready to get started?