I Tested GPT-4o, Claude 3.5, and Gemini Pro Vision for a Week — Here's What Actually Works in Produc

Last Tuesday, around 11 PM, I was about to shut my laptop when a client rang. His team had spent the entire day arguing about which multimodal model to standardise on. Three factions, each dug in: the algorithms team swore GPT-4o had the best accuracy, backend insisted Gemini was stupidly cheap, and the product managers all backed Claude because "its outputs are the most predictable."

"Just tell us which one to use," he said.

I said, "Fine. Let's not argue. Let's test."

Five days of building environments, running data, and collating results. Here's what I found: pretty much every comparison article I'd read was already out of date. These three models have been quietly updated multiple times in early 2025. GPT-4o got a new version on 15 March with noticeably better Chinese OCR. Gemini Pro Vision slashed costs by 40% in late February. Claude 3.5 Sonnet — released back in October — got a silent API speed boost from Anthropic in January.

So here are my actual results. Fair warning: this isn't some academic paper. It's real business data, and the conclusions might only apply to my specific use cases.

TL;DR (For the Skimmers)

GPT-4o: Best accuracy, especially OCR. Nightmare for JSON parsing. Fast-ish but inconsistent.
Claude 3.5 Sonnet: Slightly lower accuracy, but rock-solid JSON output and the fastest response times. Great for production.
Gemini Pro Vision: Cheapest on paper. Hallucinates more than I'm comfortable with. Ended up being the most expensive when you factor in human review.
My setup now: I route tasks between all three based on image quality and task type. Cut costs by 35% and actually improved accuracy.

Test Environment

If you want to reproduce this, here's my setup. Go for it:


python==3.12.3
openai==1.55.0
google-generativeai==0.8.3
anthropic==0.39.0

I ran everything on an AWS EC2 c7a.xlarge in us-west-2 — close enough to all three API endpoints. Testing happened between 17-21 March, mostly in the evenings because, you know, actual work during the day.

Datasets:

MMBench Chinese edition (20 sub-tasks covering general multimodal understanding)
200 real scanned invoices and contracts from a previous logistics project (yes, some had coffee stains — the client's finance team are absolute savages with paperwork)
100 complex charts: half pulled from ArXiv papers, half from financial analyst reports

Four dimensions: accuracy, speed, cost, and stability. I'll share specific numbers and, more importantly, the gotchas.

Round 1: General Multimodal Understanding

First up, MMBench. Basic capabilities.

GPT-4o leads on accuracy:

Model	Overall	Text Recognition	Visual Reasoning	Spatial Understanding

GPT-4o	89.7%	94.2%	87.1%	85.3%

Claude 3.5	87.3%	91.8%	85.6%	82.9%

That 94.2% text recognition? It's the real deal. I tested it with a deliberately blurry delivery note photo — complete with water stains. Don't ask how that happened; this is just what the client's raw data looks like. GPT-4o nailed the recipient's phone number. Claude turned a 6 into an 8. Gemini? Just dropped the last two digits entirely.

But.

There's a catch.

GPT-4o's stability is... not great. I tested the same image five times. Two results were different. One run missed text in an annotation box. Another flipped the percentages in a chart. Claude and Gemini scored lower, sure, but at least their results were consistent.

Actually, wait — I should clarify. When I say "five times," I mean five separate API calls on different days, not back-to-back. Could be server load. Could be model randomness. I didn't run an ablation study, so take this with a pinch of salt.

For speed, I threw together a quick script with 50 requests:


import time
import statistics

latencies = []
for i in range(50):
 start = time.time()
 response = client.messages.create(
 model="claude-3-5-sonnet-20241022",
 messages=[{"role": "user", "content": [...]}]
 )
 latencies.append(time.time() - start)

print(f"P50: {statistics.median(latencies):.2f}s")
print(f"P95: {statistics.quantiles(latencies, n=20)[18]:.2f}s")

Results:

Claude 3.5 Sonnet: P50 1.8s, P95 3.2s
GPT-4o: P50 2.4s, P95 5.7s
Gemini Pro Vision: P50 3.1s, P95 6.8s

Claude is the fastest and most consistent. GPT-4o occasionally just... stalls. That P95 of 5.7 seconds is noticeable. I suspect Anthropic's done something clever with their inference architecture, but they haven't shared details — pure speculation on my part.

Round 2: Invoice and Contract Processing

Here's where things get interesting.

Two hundred real VAT invoices. Scanned at various angles and lighting conditions. Some had actual coffee stains. The client's finance department truly fears nothing when it comes to document abuse.

The task: extract invoice number, date, buyer name, total amount, tax amount, seller name. All six fields correct, or the invoice counts as failed.

Results:


Field-level accuracy:
GPT-4o: 91.3% (1096/1200)
Claude 3.5: 88.7% (1064/1200)
Gemini Pro: 84.2% (1010/1200)

Complete invoice accuracy (all 6 fields):
GPT-4o: 72.5% (145/200)
Claude 3.5: 68.0% (136/200)
Gemini Pro: 59.5% (119/200)

On paper, GPT-4o wins.

But let me tell you what nearly made me throw my laptop out the window — output formatting.

GPT-4o's response format is an absolute lottery. Sometimes JSON. Sometimes a Markdown table. Sometimes JSON wrapped in `json` markers. Sometimes — and I'm not making this up — plain text: "The invoice number is 12345678, and the date is 2025..."

I ended up writing this monstrosity:


def parse_gpt4o_invoice_output(raw_text):
 # I have lost the will to live
 if "```json" in raw_text:
 raw_text = raw_text.split("```json")[1].split("```")[0]
 elif "```" in raw_text:
 raw_text = raw_text.split("```")[1].split("```")[0]
 
 try:
 return json.loads(raw_text)
 except:
 # Fallback to regex. I hate everything.
 import re
 result = {}
 patterns = {
 "invoice_no": r"invoice\s*(?:number|no)[：:]\s*(\d+)",
 "date": r"date[：:]\s*(\d{4}[-/]\d{1,2}[-/]\d{1,2})",
 }
 # There's more of this nightmare below

Once, GPT-4o decided to rename the "totalamount" field to "totalamount_yuan." Autonomously. It just... added a currency suffix. My parsing script exploded. I spent half an hour combing through logs before I spotted it. This kind of "helpfulness" is a ticking time bomb in production.

Claude, by contrast, is a dream. You ask for JSON, you get JSON. Field names stay exactly as specified. Two hundred requests, two format anomalies. Gemini's not bad either, but it occasionally invents fields you didn't ask for — like suddenly adding "invoice_type" out of nowhere.

Round 3: Complex Chart Understanding

One hundred charts from papers and analyst reports. Line graphs, bar charts, scatter plots, heatmaps, Sankey diagrams — the works.

Basic value extraction was close: GPT-4o 86.2%, Claude 85.8%, Gemini 83.1%.

Deeper analysis is where they diverge.

GPT-4o gives you the cleanest analytical framework. Stuff like: "The chart shows NEV penetration rising from 35% to 52% between Q3 2024 and Q1 2025, with growth primarily driven by the £10,000-£20,000 price segment." Structured. Ready to use.

Claude is more cautious. It loves qualifiers: "Based on the chart, there appears to be... however, sample size limitations should be considered." In finance or healthcare, this is actually a feature — it won't over-interpret your data.

Gemini has a peculiar problem.

It "sees" things that aren't there. I tested it with a dual-axis line chart, and it insisted there was an annotation arrow pointing to a June 2024 data point. I checked. There's no arrow. I've stared at that chart more times than I'd like to admit. This hallucination appeared in roughly 8% of samples, especially with complex visualisations.

I didn't dig into whether this is a data issue or a model issue, but practically speaking — if you use Gemini for chart analysis, budget for human verification.

Cost: Cheap on Paper Doesn't Mean Cheap

First, unit prices (March 2025):

Gemini Pro	84.1%	88.5%	81.2%	79.7%

Model	Image Input	Text Output

GPT-4o	$0.00213/image	$15/1M tokens

Claude 3.5	$0.0048/image	$15/1M tokens

Gemini's the cheapest. Obvious.

But.

I modelled a scenario: 100,000 invoices per month:


API costs:
GPT-4o: ~$1,011
Claude: ~$1,242
Gemini: ~$670

Adding human review (calculated from accuracy rates, at $25/hour):
GPT-4o: +$22,917 = $23,928
Claude: +$26,667 = $27,909
Gemini: +$33,750 = $34,420

Gemini goes from cheapest to most expensive.

Because it's not accurate enough. Missed fields need humans to fill them in. Two minutes per invoice to review, and 40,000 invoices stack up to over a thousand hours. This is why I say unit pricing is a trap — the API savings get eaten by labour costs, plus you lose time.

My actual test bills:

OpenAI: $187.32
Anthropic: $203.45
Google: $89.67 (used $300 in free credits)

Google's generous with new users — $300 in credits gets you quite far in testing. But you can't bank on that in production.

My Recommendations

After a week of testing, here's my take:

Use GPT-4o if you need maximum accuracy, especially for OCR. And you're willing to write a bunch of format-handling code. And your budget isn't too tight.

Use Claude 3.5 Sonnet if you want predictable JSON output and don't fancy debugging format issues at 2 AM. Its speed advantage is real for interactive applications. In finance, healthcare, or legal contexts, Claude's conservative analysis style is genuinely a feature.

Use Gemini Pro Vision if your budget is really constrained, or you're still in validation phase. Clean scans and standard charts are fine. Complex stuff? Proceed with caution. The hallucination rate is a concern.

For most of my projects now, I use a mix. Here's what the logistics invoice system ended up looking like:


Image preprocessing → Quality scoring → Routing
 │
 Score >0.8 & invoice → Gemini (cheap)
 Chart analysis → Claude (stable)
 Everything else → GPT-4o (strongest)

The routing logic is dead simple:


def route_to_model(image_quality_score, task_type):
 if image_quality_score > 0.8 and task_type == "invoice":
 return "gemini"
 elif task_type == "chart_analysis":
 return "claude"
 else:
 return "gpt4o"

This cut total costs by about 35% compared to using GPT-4o for everything. Accuracy actually went up by two points because each model handles what it's best at.

Final Thoughts

Multimodal models have improved terrifyingly fast. A year ago, we were still debating whether they could reliably read invoice numbers. Now we're optimising routing strategies.

But one thing hasn't changed — there's no silver bullet.

The pricing strategies are also a bit maddening. OpenAI and Anthropic both complicate things with image size calculations, so end-of-month bills are often head-scratchers. Google's straightforward flat pricing is nice, but it doesn't matter much when the model hallucinates.

If you're doing your own evaluation, here's my actual advice: ignore benchmarks. Ignore official pricing pages. Run your own data. Your data distribution is unique, and those few percentage points on a public benchmark might mean something completely different in your context.

What multimodal model is your team using? What nightmares have you encountered? Drop a comment — I'm actually compiling a "Multimodal Model Production Horror Stories" collection and would love your contributions. Full test code and anonymised datasets are on GitHub, link pinned in the comments.

GPT4o #Claude3.5 #GeminiProVision #MultimodalAI #AIROI #LLMEvaluation #AITrends2025

Gemini Pro	$0.00125/image	$10.5/1M tokens

I Tested GPT-4o, Claude 3.5, and Gemini Pro Vision for a Week — Here's What Actually Works in Produc

I Tested GPT-4o, Claude 3.5, and Gemini Pro Vision for a Week — Here's What Actually Works in Produc

TL;DR (For the Skimmers)

Test Environment

Round 1: General Multimodal Understanding

Round 2: Invoice and Contract Processing

Round 3: Complex Chart Understanding

Cost: Cheap on Paper Doesn't Mean Cheap

My Recommendations

Final Thoughts

GPT4o #Claude3.5 #GeminiProVision #MultimodalAI #AIROI #LLMEvaluation #AITrends2025

Cael Lee

Ready to get started?