I Processed 1.2M Records with OpenAI's Batch API for $3.70 — and I'm Never Going Back

Last month, I ran 1.2 million records through OpenAI's Batch API.

When the bill came, I nearly spat coffee all over my monitor. Three dollars and seventy cents. Not a typo. $3.70.

Three months earlier, the same workload with the real-time API? Ninety bucks. That's not a small difference — it's a 24x multiplier. I stared at that number for a solid ten seconds, then started cursing myself. What the hell was I doing?

So today, let me tell you about Batch API. Not the sanitized version from the docs, but the real stuff I've learned from months of production use — the scenarios where it absolutely shines, and the pitfalls that made me want to delete everything and reconsider my life choices.

What is this thing, actually?

OpenAI rolled out Batch API in April 2024. The concept is dead simple: you pack a bunch of requests into a JSONL file, toss it over, and OpenAI promises to process everything within 24 hours. Results get written back to you. Because they can run these tasks when GPU clusters have spare capacity, you get a 50% discount.

On everything. GPT-4o, GPT-4o-mini, GPT-4-turbo — all models, half price.

Take GPT-4o. Real-time API costs $2.50 per million input tokens and $10.00 per million output tokens. Batch slashes that to $1.25 and $5.00 respectively. Looks amazing, right?

Here's what the docs mention but don't emphasize enough: rate limits are tied to your usage tier. I was Tier 2 at the time, naively thinking I could submit whatever I wanted. My first batch of 50,000 requests sat in the queue for 18 hours. EIGHTEEN. I submitted in the morning and it finished after midnight.

Wait, let me correct myself — it's not that you can't submit. You absolutely can. Your batch just sits there... waiting... while you refresh the status page like an idiot. That distinction matters.

When should you actually use it?

I came up with a dead-simple rule: If you don't need results within 30 seconds, use Batch.

Here are three real scenarios from my work.

Large-Scale Text Classification

We had a project with 800,000 user feedback entries needing sentiment classification and intent recognition. With the real-time API, I was dealing with rate limits, writing retry logic, implementing checkpointing — the boilerplate alone was 200+ lines. And when things broke at 2 AM, my wife genuinely asked if I was having an affair.

Switching to Batch API? Cut the code to 60 lines. Cost dropped from an estimated $400 to $180. And I finally got a full night's sleep.

Here's what the submission code roughly looks like:


import json

tasks = []
for idx, feedback in enumerate(feedback_list):
 task = {
 "custom_id": f"task-{idx}",
 "method": "POST",
 "url": "/v1/chat/completions",
 "body": {
 "model": "gpt-4o-mini",
 "messages": [
 {"role": "system", "content": "Classify sentiment as: positive, negative, neutral"},
 {"role": "user", "content": feedback}
 ],
 "max_tokens": 50
 }
 }
 tasks.append(task)

with open("batch_input.jsonl", "w", encoding="utf-8") as f:
 for task in tasks:
 f.write(json.dumps(task, ensure_ascii=False) + "\n")

Notice that custom_id field. It's your only link between requests and responses — don't half-ass it. I use task-{index} or {project}-{timestamp}-{sequence} so when something breaks, I can trace back to the exact data point immediately.

Data Cleaning and Structured Extraction

This is where Batch API really flexes.

We extracted 500,000 contract clauses from PDFs — parties, amounts, dates. Dumped everything into Batch API, went to sleep, and reviewed results over coffee the next morning. Since a single batch file handles up to 50,000 requests, I split it across 10 batches and called it a day.

For extraction tasks, I can't stress this enough: use response_format to enforce JSON output.


{
 "body": {
 "model": "gpt-4o",
 "messages": [...],
 "response_format": {
 "type": "json_schema",
 "json_schema": {
 "name": "contract_extraction",
 "schema": {
 "type": "object",
 "properties": {
 "party_a": {"type": "string"},
 "party_b": {"type": "string"},
 "amount": {"type": "number"},
 "date": {"type": "string"}
 },
 "required": ["party_a", "party_b", "amount", "date"]
 }
 }
 }
 }
}

Strict JSON comes back. No regex gymnastics. Before I added this parameter, my parsing code was nearly 100 lines handling edge cases. After? Four lines. Four.

Model Evaluation and Comparison

This use case doesn't get enough attention.

When I was doing model selection late last year — right after the November 2024 GPT-4o updates — I needed to compare GPT-4o, GPT-4o-mini, and GPT-4-turbo across 1,000 test cases. Submitted three batches simultaneously, each model running the same prompts, and calculated accuracy the next day.

The costs surprised me:

GPT-4o Batch: $6.20
GPT-4o-mini Batch: $0.45
GPT-4-turbo Batch: $8.10

GPT-4o-mini was only 2.3 percentage points behind GPT-4o on our classification task. Two point three. But 13x cheaper.

I threw that data in front of my manager. He stared at it for three seconds. Said "switch to mini." That was it. The savings covered team dinner for a month — nothing fancy, just the Thai place downstairs, but hey, free food.

The stuff that made me want to scream

Success stories aside, let's talk about the pitfalls that had me working late.

Pitfall 1: Rate limits are trickier than you think

OpenAI's limits have multiple dimensions. Batch API is independent from the real-time API, but it has its own ceilings.

When I was Tier 2, I thought I could submit unlimited batches. First 50k went through fine. I'm sitting there feeling smug. Second batch? Immediate error. After digging through docs — which took way too long — I discovered there's a limit on concurrent processing batches. Tier 2 caps at something around 10, if I remember correctly. I'm Tier 3 now, but that caught me off guard.

The fix: a queue management script.


import time
from openai import OpenAI

client = OpenAI()

def submit_batches_with_retry(file_paths, max_concurrent=5):
 active_batches = []
 completed_batches = []
 
 for file_path in file_paths:
 while len(active_batches) >= max_concurrent:
 for batch in active_batches[:]:
 status = client.batches.retrieve(batch.id)
 if status.status in ["completed", "failed", "expired"]:
 active_batches.remove(batch)
 completed_batches.append(status)
 time.sleep(30)
 
 batch_input_file = client.files.create(
 file=open(file_path, "rb"),
 purpose="batch"
 )
 
 batch = client.batches.create(
 input_file_id=batch_input_file.id,
 endpoint="/v1/chat/completions",
 completion_window="24h"
 )
 active_batches.append(batch)
 print(f"Submitted batch {batch.id} for {file_path}")
 
 return completed_batches

Basic producer-consumer pattern, but it works. Haven't hit a limit since.

Pitfall 2: JSONL formatting — one mistake and everything burns

This one's nasty.

JSONL requires exactly one complete JSON object per line. No empty lines, no formatting errors. I once left an extra newline at the end of a file — that's it, just a stray carriage return — and the entire batch failed at validation.

Worse, if one record in the middle has bad JSON, OpenAI won't tell you which line. You get a vague error message and get to play detective.

I eventually wrote a validation script:


def validate_jsonl(file_path):
 errors = []
 with open(file_path, 'r', encoding='utf-8') as f:
 for line_num, line in enumerate(f, 1):
 line = line.strip()
 if not line:
 errors.append(f"Line {line_num}: Empty line")
 continue
 try:
 data = json.loads(line)
 if "custom_id" not in data:
 errors.append(f"Line {line_num}: Missing custom_id")
 if "body" not in data:
 errors.append(f"Line {line_num}: Missing body")
 except json.JSONDecodeError as e:
 errors.append(f"Line {line_num}: JSON decode error - {str(e)}")
 
 if errors:
 print(f"Found {len(errors)} errors:")
 for error in errors[:10]:
 print(f" {error}")
 return False
 return True

I run this before every upload now. Saved me countless retries. This should honestly be built into the platform, but since it's not, build your own.

Pitfall 3: 24 hours is a wish, not a promise

The docs say "within 24 hours." I believed them.

Then GPT-4o launched, and my tasks queued for 30+ hours. Everyone was testing the new model, the queue was slammed. Another time, I had a 180MB file that took significantly longer to process — file size matters more than you'd expect.

It's... complicated. Queue times depend on model popularity, file size, current demand. I now have a hard rule: if I need results in under 4 hours, I use the real-time API with concurrency control. If it can wait until tomorrow, Batch it is.

Don't gamble on this. I did. Lost.

Pitfall 4: Results are NOT in order

This was entirely my fault for skimming the docs.

The output file is JSONL with custom_id, response, and error fields. But the order? Not necessarily matching your input. I assumed sequential ordering, built index-based matching logic, and everything was... wrong. Completely wrong.

The correct approach: build a dictionary keyed by custom_id.


def parse_batch_results(output_file_path):
 results = {}
 with open(output_file_path, 'r', encoding='utf-8') as f:
 for line in f:
 data = json.loads(line)
 custom_id = data["custom_id"]
 if data.get("error"):
 results[custom_id] = {"error": data["error"]}
 else:
 response_body = data["response"]["body"]
 content = response_body["choices"][0]["message"]["content"]
 results[custom_id] = {"content": content, "usage": response_body.get("usage", {})}
 return results

Now I build a lookup map first thing. Lesson learned.

A real case study: 1M customer service conversations

Here's an actual project. One million customer service transcripts — extract key info and sentiment labels.

Option A: Real-time API + GPT-4o

Total tokens: 850M input + 120M output
Estimated cost: $2,125 + $1,200 = $3,325
Custom concurrency control needed, estimated 3-4 days runtime
Codebase full of retry logic and checkpointing — painful to maintain

Option B: Batch API + GPT-4o

Actual cost: $1,062.50 + $600 = $1,662.50
20 batches, everything done in 30 hours
Maybe 200 lines of clean code

Option C: Batch API + GPT-4o-mini

Actual cost: $127.50 + $72 = $199.50
Accuracy 1.8% lower than GPT-4o
But one-eighth the cost

We went with Option C. That 1.8% accuracy loss is totally acceptable for customer service transcripts — from what I've seen, human annotation error rates hover around 3-5% anyway. 1.8% barely registers.

This decision saved 90% on costs. My manager was thrilled. Actually took me out for a meal — nothing fancy, just that Hunan place downstairs, but it wasn't on my dime.

My battle-tested workflow

After months of doing this, here's what I've settled on:

File organization:

Keep JSONL files between 10,000-30,000 records each — don't get greedy
Cap file size at 100MB to leave breathing room
Use meaningful custom_id names for traceability

Error handling:

Validate every file before uploading — no exceptions
Parse results using custom_id as a key, never trust ordering
If your failure rate exceeds 5%, investigate immediately — don't ignore it

Cost control:

Default to GPT-4o-mini — it handles most classification and extraction tasks fine
Set max_tokens aggressively to cap output costs
Review usage data regularly — I check on the 5th of every month

Monitoring:

Log submission time and completion time for every batch
Track average processing duration for future estimates
Separate out failed tasks and analyze patterns

I wrote a little notification script that pings Slack when batches finish:


def notify_batch_completion(batch_id):
 batch = client.batches.retrieve(batch_id)
 status = batch.status
 request_counts = batch.request_counts
 
 message = f"""
 📊 Batch {batch_id} complete
 Status: {status}
 Total requests: {request_counts.total}
 Succeeded: {request_counts.completed}
 Failed: {request_counts.failed}
 Completed at: {batch.completed_at}
 """
 
 send_slack_notification(message)

No more staring at the status page. Go live your life.

Don't blindly use Batch for everything

I've been singing Batch API's praises, but here's when you shouldn't use it:

Real-time conversations. Users won't wait 24 hours for a response. Non-negotiable.
Small, high-frequency batches. If you're submitting dozens of small batches daily, the management overhead might cost more than you save.
Streaming responses. Batch API doesn't support streaming — no typewriter effects.
Sequential, dependent tasks. If each request depends on the previous one's output, Batch can't handle that workflow.

Don't sacrifice user experience just to save a few bucks. I learned this the hard way — forced Batch onto a feature that needed sub-second responses. Users complained for an entire week.

Key Takeaways

Batch API = 50% cost reduction on all OpenAI models. No catches.
Use it for anything that can wait 4-24 hours — classification, extraction, evaluation, data processing at scale.
Validate your JSONL files before uploading, or you'll hate yourself.
Build results lookup by custom_id, never assume sequential ordering.
GPT-4o-mini on Batch is absurdly cheap — $0.45 for 1,000 evaluation runs. Test your assumptions about which model you actually need.

The bottom line

Batch API is seriously underrated.

I still see developers running massive workloads through the real-time API. When I ask why, they either don't know Batch exists or assume it's complicated. Truth is, switching from real-time to Batch typically requires less than 20% code changes — and cuts your bill in half.

If you're processing text at scale, try it this week. Seriously.

What's your approach to optimizing LLM API costs? Hit any Batch API pitfalls I missed? Drop a comment — I read and respond to everything.

OpenAI #BatchAPI #CostOptimization #GPT4o #DevTools

I Processed 1.2M Records with OpenAI's Batch API for $3.70 — and I'm Never Going Back

I Processed 1.2M Records with OpenAI's Batch API for $3.70 — and I'm Never Going Back

What is this thing, actually?

When should you actually use it?

Large-Scale Text Classification

Data Cleaning and Structured Extraction

Model Evaluation and Comparison

The stuff that made me want to scream

Pitfall 1: Rate limits are trickier than you think

Pitfall 2: JSONL formatting — one mistake and everything burns

Pitfall 3: 24 hours is a wish, not a promise

Pitfall 4: Results are NOT in order

A real case study: 1M customer service conversations

My battle-tested workflow

Don't blindly use Batch for everything

Key Takeaways

The bottom line

OpenAI #BatchAPI #CostOptimization #GPT4o #DevTools

Cael Lee

Ready to get started?