Home / Blog / I Almost Burned $4,000 in Tokens Last Week — Lesso...

I Almost Burned $4,000 in Tokens Last Week — Lessons from OpenAI's Agents SDK

By CaelLee | | 6 min read

I Almost Burned $4,000 in Tokens Last Week — Lessons from OpenAI's Agents SDK

Last Thursday, I watched my Datadog latency graph spike like a heart monitor during a panic attack. Eight seconds. That's how long a customer waited for "Where's my order?" to return a response. Behind the scenes, my shiny new AI agent had called 17 — yes, seventeen — different tools for a single conversation turn.

Seventeen.

I was using OpenAI's Agents SDK (v0.3.7, pulled the Docker image on March 12th), rebuilding our customer service system. The token budget was literally going up in flames. Today I want to share what I learned about designing dynamic tool calling that doesn't implode, and how to actually implement human-in-the-loop workflows that work in production.

What Dynamic Tool Calling Actually Means

Here's the thing about traditional agent design: you hardcode a list of tools, and the agent follows a predefined flow. Works fine for simple stuff. But real conversations don't follow scripts.

When someone says "Where are the shoes I bought last month?" versus "I want to return something," the tool chains are completely different. The first needs order lookup → shipment tracking → maybe a notification service. The second needs order verification → return eligibility check → label generation.

OpenAI's Agents SDK handles this with two key mechanisms:

I first played with this back in December using Swarm (which later got merged into the Agents SDK). My initial reaction was "Oh, it's just fancy function calling." But after a few test runs, I realized the real magic is automatic context propagation — tool results get injected into the conversation history without manually stitching prompts together.

Actually, wait. Let me correct myself.

Saying "without manual stitching" isn't quite accurate. What I mean is you don't have to explicitly manage message history like you do with LangChain. But you absolutely still need to control the format of what your tools return. I learned this the hard way — one of my tools returned a full 3KB API response JSON, and the agent got confused and started calling the same tool repeatedly. It thought it hadn't gotten results yet. Adding response_format constraints to return just {"status": "ok", "summary": "..."} fixed it instantly.

Three War Stories from Production

Case 1: The E-Commerce "Tool Explosion"

We hooked up 47 tool functions — order queries, shipment tracking, refunds, coupon validation, the works.

First week in production: average 6.3 tool calls per conversation, P99 latency of 11 seconds. The problem? Our agent was too eager. A customer would say "Let me check my delivery status," and the agent would: query the order → grab the tracking number → call the shipping API → notice no update → trigger a notification service → then fire off a satisfaction survey.

Ridiculous.

The fix: a tool call budget. Maximum 3 calls per turn. If it needs more, it has to output intermediate results and ask the user for confirmation. Latency dropped to 2.1 seconds, and — here's the counterintuitive part — user satisfaction went up. Speed beats completeness. Every time.

Case 2: Medical Triage with Human-in-the-Loop

Another project involved healthcare — 23 knowledge base tools and 5 appointment scheduling APIs. But anything resembling a diagnosis needed human review. We designed a three-tier loop:

  1. Full auto: Drug information lookups, appointment times — agent handles it directly
  2. Semi-auto: Symptom analysis generates a draft, human clicks "approve" before it goes to the patient
  3. Human-only: Anything involving prescriptions or urgent symptoms — agent just collects info and hands off

This is where the SDK's handoff mechanism shines. When the agent decides it needs a human, it packages the complete context — all previous tool call results, structured data, everything — and transfers it to a human agent. Not just a "please help this customer" flag. The human doesn't have to re-ask a single question.

Though... there's a gotcha here. The SDK serializes all tool call history by default. If any of your tools returned sensitive data (like a patient's ID number), you need to filter that before handoff. We shipped a version in early February that missed this. Thank god QA caught it in staging — that would've been a P0 incident.

Case 3: When Your Tool Descriptions Suck

This one's embarrassing. We had an "inventory check" tool with the description: "Query product inventory."

The result? When customers asked "Is this in stock?", the agent would first call the product details tool, then the pricing tool, and finally the inventory tool — because it wasn't sure if "inventory" meant the same thing as "in stock."

We rewrote the description to: "Call this tool when users ask if a product is available, check stock levels, or inquire about size/color availability. Requires skuid and warehousecode parameters."

Accuracy jumped from 67% to 94%.

Seriously — tool descriptions are for LLMs, not developers. This mental shift alone saved me at least two weeks of debugging. Anthropic published a tool use best practices doc in November 2024 that's actually more practical than OpenAI's official docs on this topic. Go read it.

Human-in-the-Loop Design Principles

After these projects, I've settled on a few principles:

1. Define Clear "Trust Boundaries"

We drew three lines:

2. Lossless Context Transfer

Nothing's worse than losing information during handoff. The SDK's handoff packages tool call history, intermediate results, and the original user input. We added a "summary layer" on top — gpt-4o-mini generates a 200-character summary before transfer, so human agents can grasp the full picture in about 5 seconds.

3. Set a "Loop Timeout"

Humans get busy. We set a 30-minute timeout — if no human responds, the system auto-notifies the customer that "we're expediting your case" and escalates to a supervisor queue. This dropped our 48-hour repeat complaint rate by roughly 40%.

A Debugging Trick That Saved Me

The SDK's trace feature lets you replay every tool call decision. I keep traces wide open in dev and watch for two patterns:

Enable it with:


agents.tracing.set_tracing_export_enabled(True)

Then check the OpenAI Dashboard for the full call graph. This has helped me catch at least 5 subtle bugs. A friend building a fintech customer service bot had a similar issue — turns out Tool A's API response format changed (third-party API, undocumented change), so the agent couldn't find expected fields and started frantically calling Tool B instead. The trace graph made it obvious.

TL;DR / Key Takeaways

Here's what I've landed on after all this: let agents do what they're good at — fast retrieval and batch processing. Leave judgment and edge cases to humans.

The OpenAI Agents SDK provides solid infrastructure, but the real challenge is design — how you describe tools, where you draw trust boundaries, how you architect the feedback loops. There's no silver bullet. You just have to grind through it, scenario by scenario.

What agent framework is your team using? Ever had tool calling spiral out of control? Drop a comment — I'm writing about cost control strategies for agent tool calls next week, and I'd love to include real-world examples from the community. Funny enough, a friend told me last week they're seeing the same token explosion problem with CrewAI, so this isn't a framework-specific issue. It's a design pattern problem.

OpenAI #AgentsSDK #AIEngineering #ProductionAI #HumanInTheLoop

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free