Home / Blog / GPT-5.6 Model Selection: How I Blew £11K Picking t...

GPT-5.6 Model Selection: How I Blew £11K Picking the Wrong AI, and the Framework That Saved Us

By CaelLee | | 6 min read

GPT-5.6 Model Selection: How I Blew £11K Picking the Wrong AI, and the Framework That Saved Us

Last month, our inference costs spiked 340% in four days because I picked the wrong model variant. It wasn't a technical failure—it was a decision-making failure. And I own that.

Actually, "four days" makes it sound measured. It was Tuesday through Friday. But when you're watching your AWS bill tick up in real-time during a Slack incident channel at 11pm, it sure feels like overnight. I remember refreshing the Cost Explorer dashboard at 2am, hoping the numbers would somehow be different. They weren't.

Here's the thing: as engineering leaders, we're increasingly asked to make architectural decisions about AI models we didn't train, don't fully control, and can't always benchmark in advance. The new GPT-5.6 family—Sol, Terra, and Luna—represents exactly this kind of challenge. Three variants. Three different optimisation targets. No universal "best" choice.

Here's the framework I wish I'd had three months ago. Well, I sort of had pieces of it. Nothing coherent. I was basically pattern-matching from the GPT-4 Turbo days and hoping for the best. That didn't work.

TL;DR for the Impatient

The Three Variants, Minus the Marketing Fluff

I've read the model cards twice. Honestly, half of it's vibes. Here's what actually matters:

The naming is cute. The tradeoffs are real.

The Mistake I Made (And How You Can Avoid It)

When we first evaluated these models, I did what most engineering leaders do: I looked at benchmark scores and picked the highest performer. Sol crushed MMLU, HumanEval, and every reasoning benchmark we threw at it. Easy choice, right?

Nope.

Our primary use case was a customer-facing support agent that needed sub-800ms response times. Sol averaged 2.3 seconds per response. Our CSAT dropped 12 points in the first week because users were staring at "typing…" indicators. I watched a session replay where someone typed "hello??" three times before the response came through. That one stung.

The lesson: benchmarks measure capability, not suitability.

Here's what I should have evaluated instead:

1. Latency Budget Per Interaction

What's the maximum acceptable response time for your use case? Luna delivers ~400ms p50 latency. Sol runs closer to 1.8-2.5s. Terra sits around 900ms. These numbers matter more than any accuracy metric if your users won't wait. And they won't.

2. Reasoning Depth Required

Are you summarising text or solving multi-step logic problems? We found that for 80% of our customer queries, Terra matched Sol's accuracy at 60% of the cost. Sol only pulled ahead on queries requiring 3+ reasoning steps. I think the threshold is actually closer to 4 steps based on our latest eval runs from January, but I'm still slicing the data.

3. Cost Per Meaningful Interaction

Don't optimise for cost-per-token. Optimise for cost-per-successful-outcome. Sol costs more per token but sometimes needs fewer tokens. Luna is cheap per token but might require retries. Measure end-to-end.

That last point took me embarrassingly long to internalise. I was staring at token counts in Datadog for two weeks before it clicked. My partner asked why I was muttering about "completion tokens" in my sleep.

A Decision Matrix That Actually Works

I've since developed a simple scoring framework for model selection. Rate your use case on three dimensions (1-5 scale):

Then map your scores:

The edge cases are where leadership judgement matters most. High reasoning AND high latency sensitivity? That's when you start exploring hybrid architectures—Luna for triage, Sol for deep analysis on flagged items. We're experimenting with this now and... it's complicated. The routing logic itself adds ~150ms, which partially defeats the purpose. Still iterating.

What This Means for Your Team

Model selection isn't just a technical decision. It's a business decision with engineering implications.

When I finally switched our customer agent from Sol to Terra (with Luna as a pre-filter), here's what happened:

The ROI wasn't in the model. It was in the decision process.

I keep coming back to this. We spent £11K on a model we didn't need because I trusted a benchmark leaderboard over our own latency requirements. That's on me. I'm writing this partially so I don't do it again.

My Recommendation for Most Teams

If you're reading this and feeling overwhelmed, here's my simple heuristic:

Start with Terra. Ship it. Measure real-world performance for 2-4 weeks. Then decide if you need to optimise toward Sol (for accuracy) or Luna (for speed/cost).

The biggest mistake isn't picking the wrong model—it's delaying your decision by over-analysing benchmarks that don't reflect your actual workload. I've watched teams spend six weeks on evaluations when they could have shipped Terra in two days and gathered real data.

As Andy Grove wrote in High Output Management, "A common rule we should always try to heed is to detect and fix any problem at the lowest-value stage possible." In AI terms: ship fast with a safe default, then optimise based on production data, not benchmark anxiety.

Actually, I should probably caveat this. If you're in healthcare or legal or anything regulated, "ship fast" is terrible advice. You know your compliance requirements better than I do. The principle still holds though—just with appropriate guardrails.

I'm curious: how is your team approaching model selection in the GPT-5.6 era? Are you running formal evaluations, going with gut feel, or waiting for someone else to figure it out first?

Drop your approach in the comments—I'm genuinely collecting data points for a follow-up piece. Someone on my team suggested we just A/B test all three variants simultaneously and I'm still thinking about whether that's genius or chaos. Probably both.

AIEngineering #EngineeringLeadership #ModelSelection #GPT56 #TechStrategy

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free