How I Slashed My AI Video Processing Costs 91% While Making Users Happier

Last month, I almost killed my own product.

My AWS bill hit $4,200—for a SaaS doing $8,100 MRR. I was literally paying users to use my app. The math was so bad that I'd lie awake at night doing the calculations in my head, hoping I'd somehow made a mistake.

Spoiler: I hadn't.

The culprit? Processing 10,000+ video minutes daily through GPT-4V and Gemini Pro Vision. Every frame, every second, burning tokens like there was no tomorrow. I'd built a money-incineration machine and called it a startup.

Here's the wild part: I fixed it in two weeks. Not by switching to cheaper models. Not by raising prices. But by finding what I now call the "golden ratio" of video analysis—exactly how many frames you actually need, and how little you can describe each one.

Product: ClipSense AI

Current MRR: $10,243

Cost before: $0.47/minute

Cost after: $0.042/minute (for the tier 70% of users choose)

The Problem Nobody Talks About

When I launched ClipSense in January 2024, I priced it at $29/month for 100 video minutes. My cost per minute was roughly $0.18. Margins looked fine on paper. I felt clever.

Then users started uploading hour-long webinar recordings. Security footage. 4K product demos. My "average 5-minute video" assumption went out the window—along with my margins.

By March, my per-minute cost had ballooned to $0.47. Users were processing 3x more video than I'd modeled. I was losing $0.18 on every single minute processed. That's not a business—that's a charity with extra steps.

Pieter Levels once said "charge more" is always the answer. And I tried. I tested $49/month. Conversion dropped 40%. The market wouldn't bear it. My users—mostly indie creators and small marketing teams—simply couldn't justify that price point.

I had to fix the cost side. And fast.

The Two Levers: Frames × Description Length

After two sleepless weeks of A/B testing (shoutout to @marc_louvion for the spreadsheet template that saved my sanity), I realized video analysis cost boils down to something stupidly simple:

Total Cost = (Frames extracted per minute) × (Tokens per frame description) × (Cost per token)

Most developers obsess over #3. "Just use Gemini Flash!" or "Wait for GPT-4o-mini!" But that's a 30-50% reduction at best. I needed 70%+. I needed a fundamental rethink.

The real leverage was in #1 and #2—and weirdly, nobody was talking about the tradeoff curve between them. Everyone was so focused on model selection that they'd forgotten basic arithmetic.

Here's what I mean: if you can cut frames by 80% AND shorten descriptions by 60%, you're looking at a 92% cost reduction. No model swap needed.

Experiment 1: How Many Frames Do You Actually Need?

I took 500 videos from my user base and ran them through different frame extraction rates. The results surprised me:

Frames/min	Avg Accuracy	Cost/min

60 (1/sec)	94.2%	$0.47

30 (1/2sec)	93.8%	$0.24

10 (1/6sec)	91.1%	$0.09

5 (1/12sec)	84.3%	$0.05

The cliff was at 10 frames/minute. Below that, accuracy tanked. Above that, I was paying for diminishing returns that made my accountant weep.

But here's where it gets interesting: accuracy wasn't just about frame count. It was about which frames. A uniform 10fps extraction is basically dumb sampling—you might grab 3 identical frames from a static shot while missing the one frame where something actually happened.

Actually, wait—I should clarify that "accuracy" here is kind of a weird metric. I measured it by having the system tag 50 known attributes per video and checking against human labels. It's not perfect. But it's directionally useful, I think. At the very least, it gave me something to optimize against that wasn't just vibes.

Experiment 2: Smart Keyframe Selection

Instead of uniform sampling, I built a simple scene detection algorithm. Nothing fancy—just some ffmpeg wizardry (thanks @levelsio for the tip) and basic histogram comparison:


# Not the full code, but the logic:
# 1. Extract all I-frames (natural keyframes in encoding)
# 2. Calculate histogram difference between consecutive I-frames 
# 3. Only keep frames where difference > threshold
# 4. If too few frames, lower threshold; if too many, raise it

This gave me 8-12 frames per minute on average, but concentrated around actual content changes. A talking head video might only need 3 frames/minute. A sports highlight reel might need 25. The algorithm adapts based on what's actually in the video.

Cost dropped to $0.07/min. Accuracy held at 92.8%.

Game changer.

Well... that's complicated. The 85% cost reduction was compared to the 60fps baseline. Compared to the 10fps uniform sampling I'd already been testing, it was more like 22%. Still meaningful. But I should be honest about the numbers—startup Twitter has enough exaggerated metrics floating around.

Experiment 3: Description Compression (The Scary Part)

Now I had fewer frames, but each one was still generating a 200-token description. You know the kind: "A man in a blue suit standing behind a wooden podium with a microphone, indoor lighting, brick wall background, slight shadow on left side..."

Do you need all that? For most use cases, no. My users weren't writing novels—they were trying to find "the part where someone mentions Q4 revenue" in a 45-minute all-hands meeting.

I tested four description formats:

Full natural language (200 tokens avg) - $0.07/min
Structured JSON (120 tokens) - $0.042/min
Keyword extraction (40 tokens) - $0.014/min
Embedding-only (no text) - $0.003/min

The JSON format was the sweet spot. It forced the model to be concise while preserving searchability. Here's what it looks like in practice:


{
 "scene": "conference_presentation",
 "objects": ["person", "podium", "screen"],
 "action": "speaking",
 "text_visible": "Q4 Revenue Growth",
 "sentiment": "neutral_professional"
}

Combined with smart keyframing, my cost dropped to $0.042/minute. Total reduction: 91% from the original $0.47.

But accuracy fell to 87.3%. Users noticed. Churn ticked up from 3.2% to 4.1%.

I'd cut too deep. That churn spike scared the hell out of me. I remember staring at my Stripe dashboard at 2am, watching three cancellation emails come in within an hour. One user wrote "the summaries feel generic now."

Ouch.

Finding the Golden Ratio

Here's what I discovered after 47 iterations (yes, I counted—I'm that person now):

The optimal point isn't fixed. It's use-case dependent. A lawyer reviewing deposition footage needs way more detail than a social media manager skimming UGC. One size fits none.

For my users, three tiers naturally emerged:

1 (1/min)	71.6%	$0.01

Tier	Frames/min	Description	Cost/min	Accuracy

Quick Scan	5 (smart)	Keywords	$0.012	84%

Standard	10 (smart)	JSON	$0.042	91%

I launched tiered pricing in April based on these tiers:

Basic ($19/mo): Quick Scan, 200 min
Pro ($39/mo): Standard, 500 min
Enterprise ($99/mo): Deep Analysis, 1,000 min

Funny thing—I almost didn't launch the Basic tier. Thought it would cannibalize Pro. Instead, 40% of Basic users upgraded within 60 days. They needed to see the value first before committing. Classic freemium psychology that I should've seen coming.

The Results (Real Numbers)

April 2024 (before):

MRR: $8,100
AI costs: $4,200 (51.9% of revenue)
Gross margin: 48.1%
Churn: 3.2%
My stress level: through the roof

June 2024 (after changes):

MRR: $10,243
AI costs: $1,870 (18.3% of revenue)
Gross margin: 81.7%
Churn: 2.8%
My stress level: manageable, finally

Revenue went up because the Basic tier captured price-sensitive users who'd previously churned. Costs plummeted. And churn actually improved because the tiered system let users self-select into the accuracy they needed.

That last part still surprises me. I assumed cheaper tiers = worse experience = more churn. But giving users control over the quality/cost tradeoff actually increased satisfaction. People hate overpaying more than they hate imperfect AI.

What I'd Do Differently

1. Build cost monitoring from day one. I didn't add per-user cost tracking until month three. That's two months of flying blind. I literally had a user processing 47 hours of dashcam footage who cost me $127 in a single day. Had no idea until the AWS bill came. Don't be me.

2. Test description compression before launch. I optimized for "wow, look how detailed the AI is!" when users actually wanted "just tell me what's in the damn video." Classic founder mistake—building for the demo, not the daily use case. Your users are busy. They want answers, not poetry.

3. Don't copy enterprise pricing models. I initially looked at how Google and AWS price video AI ($0.10-$0.50/min). But indie hackers can't compete on enterprise features. We compete on good enough, way cheaper. Took me way too long to internalize that. We're not selling to Fortune 500 procurement departments—we're selling to people who feel AWS bills in their personal bank accounts.

4. The golden ratio is a spectrum, not a point. I wasted two weeks trying to find one perfect setting. The tiered approach solved both the cost and accuracy problem simultaneously. Probably should've seen that coming—my users were never one homogenous group. Lesson: if you're optimizing for "the best" parameter, you're probably asking the wrong question.

The Bigger Lesson

Every AI SaaS I see on Indie Hackers is racing to add more features, more models, more "intelligence." But the winners in the next 12 months won't be the ones with the best AI—they'll be the ones who figured out how to deliver just enough AI at a sustainable margin.

@dagorenouf said it best in his last post: "The AI gold rush is over. Now it's about who can run the most efficient mine."

My costs are now $0.042/minute for the tier that 70% of users choose. At $39/month for 500 minutes, that's $21 in AI costs. An 86% gross margin on a product people actually use. That's a real business. Not a demo.

I think about this a lot now—how many AI wrappers are burning VC cash subsidizing inference costs, and what happens when that music stops. Probably nothing good. But for bootstrappers? This is actually a huge advantage. We have to be efficient. It's not optional. And that constraint might just save us.

TL;DR

AI video analysis costs come down to three levers: frames extracted, description length, and model cost
Most people only optimize model cost—that's leaving 60-70% savings on the table
Smart keyframe selection (scene detection) cut my frames by 80% without hurting accuracy
Structured JSON descriptions instead of natural language saved another 40%
Tiered pricing lets users choose their accuracy/cost tradeoff—and they actually love it
Result: 91% cost reduction, MRR up 26%, churn down, margins up
The future of AI SaaS isn't "better AI"—it's "efficient enough AI"

What's your AI cost per user? I'm collecting benchmarks for a follow-up post. Drop your numbers in the comments—I'll share the anonymized dataset with everyone who contributes. No gatekeeping here.

Follow my build-in-public journey: ClipSense AI

buildinpublic #ai #saas #bootstrapping #videotech #costoptimization #indiehacker

Deep Analysis	20 (smart)	Full + JSON	$0.11	95%

How I Slashed My AI Video Processing Costs 91% While Making Users Happier

How I Slashed My AI Video Processing Costs 91% While Making Users Happier

The Problem Nobody Talks About

The Two Levers: Frames × Description Length

Experiment 1: How Many Frames Do You Actually Need?

Experiment 2: Smart Keyframe Selection

Experiment 3: Description Compression (The Scary Part)

Finding the Golden Ratio

The Results (Real Numbers)

What I'd Do Differently

The Bigger Lesson

TL;DR

buildinpublic #ai #saas #bootstrapping #videotech #costoptimization #indiehacker

Cael Lee

Ready to get started?