I Spent $1,247 Testing Multimodal AI APIs in 2025 — Here's What Actually Shipped

Last Tuesday, I stared at my Stripe dashboard with that familiar pit in my stomach. Churn had ticked up to 3.8%, and I knew exactly why — our product descriptions sucked at actually seeing products.

Merchants uploaded gorgeous lifestyle photos, and our AI spat back "blue shirt on white background." The real magic? The texture of the linen fabric, the way it drapes on a human body. We were leaving money on the table.

So I did what any slightly-obsessive bootstrapper does: two weeks and $1,247 later, I'd benchmarked every multimodal AI API worth testing. Pieter Levels once tweeted "the best API is the one you actually ship with," but I needed to know which one wouldn't bankrupt me at scale.

Actually — wait. I should clarify. $1,247 wasn't all API costs. About $400 went to freelancers evaluating outputs, and $150 was Amazon gift cards to bribe merchants into letting me record them. The actual API burn? Closer to $700. Still stings.

Here's the raw data, the surprises, and the one API I'm now betting my entire product on.

TL;DR

Claude 3.5 Sonnet crushed visual understanding — 4.7/5 vs GPT-4o's 4.4/5
Gemini 1.5 Pro dominated audio transcription (2.1s vs GPT-4o's 4.7s latency)
Video understanding is shelved entirely (14% hallucination rate is a dealbreaker)
My hybrid approach dropped churn from 3.8% → 2.1% in two weeks
Support tickets dropped 62% with better descriptions

The Setup: What I Actually Needed

My use case sounds simple: take a product image, return a rich description covering material, style, use-case, and emotional vibe. But I also tested audio transcription (for a "talk through your product" feature) and short video clips (those 5-second product demos everyone's adding now).

Four contenders:

OpenAI GPT-4o — the obvious default
Google Gemini 1.5 Pro — the "we have more context" play
Anthropic Claude 3.5 Sonnet — the "we care about nuance" option
Qwen-VL-Max — the dark horse from Alibaba Cloud Marc Lou mentioned in his newsletter

Visual Understanding: Where Claude Surprised Everyone

I ran 500 product images through each API. Same prompt, same temperature settings. Then I paid a freelance copywriter $200 to blind-rate outputs on "would this description help you buy this product?" (1-5 scale).

Results:

Claude 3.5 Sonnet: 4.7/5 average
GPT-4o: 4.4/5
Gemini 1.5 Pro: 4.1/5
Qwen-VL-Max: 3.8/5

Here's the thing — Claude noticed stuff the others completely whiffed on. For a linen shirt, Claude wrote: "The fabric shows intentional micro-wrinkling characteristic of pre-washed linen, suggesting it won't shrink dramatically after purchase."

That's the kind of detail that reduces returns.

GPT-4o gave me "relaxed-fit linen shirt in natural beige." Accurate. But not selling.

The real shocker? Gemini was consistently terrible at texture. Like, weirdly bad. It kept defaulting to "smooth fabric" even on obviously textured tweed. I don't know what's going on there — probably training data bias — but it's unusable for fashion.

Cost breakdown (per 1,000 images):

Claude 3.5 Sonnet: $3.75 (those image tokens add up)
GPT-4o: $2.88
Gemini 1.5 Pro: $1.50 (they're clearly subsidizing adoption)
Qwen-VL-Max: $0.96

I almost went with Gemini purely on economics. Then I calculated something I'd overlooked: support tickets per 100 descriptions. Merchants using Claude-generated descriptions filed 62% fewer "this description is wrong" tickets versus Gemini.

For a solo founder? That's roughly 14 hours/month I get back. That's basically an extra week of building.

Audio: The Feature I Didn't Know I Needed

This started as a curiosity test. Turned into a potential new feature.

I recorded 30 Shopify merchants describing their products naturally (with permission — yes, I bribed them with $25 Amazon gift cards). The task: transcribe rambling, accented, tangent-filled audio into clean product descriptions.

Winner by a mile: Gemini 1.5 Pro.

Its audio understanding is native — it doesn't just transcribe, it interprets tone and emphasis. One merchant said "this jacket is, like, REALLY warm, like you-won't-believe-it warm," and Gemini output: "Exceptionally warm jacket designed for extreme cold conditions." GPT-4o transcribed it literally and kept the "like" fillers.

I did not expect that gap. At all.

Latency test (60-second audio clip):

Gemini 1.5 Pro: 2.1 seconds
Qwen-VL-Max: 3.3 seconds
GPT-4o: 4.7 seconds

I'm now building a "speak your product story" feature based entirely on this. Estimated build time: 3 weeks. Potential ARPU increase: $8-12/month for premium tier. I'll report back if that actually pans out — probably late February if my roadmap doesn't explode.

Video: The Feature I'm Killing Before Launch

Planned to add automatic product demo descriptions from short video clips. After testing? Shelved entirely.

The problem isn't quality — it's consistency. Every API sometimes nailed a 5-second video and other times hallucinated objects that didn't exist. Claude described a rotating watch video as "a wristwatch with leather strap being turned to show the clasp mechanism" (correct!) and then, in the very next test, "a compass embedded in a leather bracelet."

A compass. Where did it even get that?

Error rate: roughly 1 in 7 videos had significant hallucinations. That means I'd need human-in-the-loop review, which kills unit economics for a bootstrapped product.

Danny Postma told me once: "the best AI feature is the one that doesn't require a 'report incorrect result' button." He's right. I'm not building a feature that needs a built-in apology mechanism.

Cost reality check: Processing 1,000 five-second videos:

GPT-4o: $18.20
Gemini 1.5 Pro: $11.40
Claude: Not fully supported (image frames only)
Qwen-VL-Max: $7.80

At 340 merchants, if even 20% used video weekly, that's $300-400/month in API costs. For a feature with a 14% hallucination rate.

Nope.

What I Actually Shipped

Hybrid approach:

Primary visual model: Claude 3.5 Sonnet (initial description generation)
Fallback/cost-saver: Gemini 1.5 Pro (bulk processing, or when Claude's latency spikes — which happens more than Anthropic's status page admits)
Audio feature: Gemini 1.5 Pro exclusively

My monthly API spend jumped from $420 to $580. But churn dropped from 3.8% to 2.1% in two weeks. That's roughly $1,200/month in retained revenue against a $160 cost increase.

I'll take that trade. Almost every day.

Claude had a weird outage last Wednesday that cost me three hours of scrambling. But still.

What I'd Do Differently

Start with hallucination tests. I spent days on latency benchmarks when accuracy in edge cases was the real killer. One bad description erodes trust faster than ten good ones build it. Learned that lesson when a merchant got "silk blend" for a 100% cotton sweater. She was not happy.

Factor in support costs earlier. My initial comparison was purely API pricing. The real cost includes time spent apologizing to customers. That completely changes the math.

Test Qwen sooner. It lost on quality, but for simple use cases (solid colors, standard angles), it's perfectly adequate at one-third the cost. I now use it for free tier, reserving Claude for paying customers. Tiered model quality is an underrated pricing strategy — shoutout to @levelsio for that, though I'm probably butchering his original point.

Don't believe the benchmarks. Every provider publishes impressive accuracy numbers. Real-world performance with quirky merchant photos — bad lighting, messy backgrounds, cats walking through shots — is completely different. My test set had exactly one cat photobomb. Claude handled it beautifully. Gemini called it "faux fur trim."

I mean... technically not wrong?

I'm curious: have any of you tested these APIs for your own use cases? Particularly interested if anyone's found a reliable video understanding model that doesn't hallucinate constantly. Drop your experiences in the comments — I read every single one. I'll share my full testing dataset with anyone actively building in this space. Just don't ask me to clean the cat photos out of it.

Building in public at myproduct.com — currently at $10,240 MRR with 340 customers and 2.1% churn. Hit $8k MRR in December after adding bulk generation. Next goal: $15k by March, which feels slightly insane but we'll see.

buildinpublic #multimodalai #saas #bootstrapping #indiehackers

I Spent $1,247 Testing Multimodal AI APIs in 2025 — Here's What Actually Shipped

I Spent $1,247 Testing Multimodal AI APIs in 2025 — Here's What Actually Shipped

TL;DR

The Setup: What I Actually Needed

Visual Understanding: Where Claude Surprised Everyone

Audio: The Feature I Didn't Know I Needed

Video: The Feature I'm Killing Before Launch

What I Actually Shipped

What I'd Do Differently

buildinpublic #multimodalai #saas #bootstrapping #indiehackers

Cael Lee

Ready to get started?