We Deployed 3,000 AI Agents on Black Friday—Here's What Broke and How We Fixed It
We Deployed 3,000 AI Agents on Black Friday—Here's What Broke and How We Fixed It
Last Black Friday, our intelligent customer service system collapsed at peak traffic. 3,000 AI agents went offline simultaneously. I was staring at a screen full of 429 error logs when our operations director called, and honestly—I had nothing useful to say.
That outage cost us roughly £160,000 in lost GMV.
It also taught me something brutal: deploying agents in production is nothing like running a demo.
Today I want to walk you through the nightmare of deploying OpenAI's Agents SDK (we're on v0.12.3) at scale—the traps you absolutely must avoid, and the architecture we eventually built that actually survives traffic surges. It took about six months of trial and error to stabilise.
Demos Lie: The Real Challenges of Production Agents
Let me burst your bubble right now.
I'd estimate 90% of OpenAI's official Agent SDK examples are designed for single-user scenarios. Those "Build Your First Agent in 5 Minutes" tutorials? They're all synchronous calls and in-memory sessions underneath—a ticking time bomb in production.
Last September, our team ran a stress test. We used Locust to simulate 100 concurrent users against a customer service agent built from the official example. The results were... educational. Average response time shot from 1.2 seconds to 47 seconds. Token consumption exploded 8x. And 23% of requests simply timed out.
Root cause analysis pointed to three problems.
Memory leaks in session management. The SDK's default ConversationHistory stores everything in memory. Each session gobbles up 2-5MB. Hit 5,000 concurrent sessions, and our 8GB instances went straight to OOM hell. You'd never catch this in development—it's just you testing, after all. We later discovered an AgentSession object that wasn't being properly released, so even the garbage collector couldn't reclaim it. GitHub issue #342 mentions this, but we hadn't spotted it at the time.
Unbounded tool-calling chains. When agents tackle complex tasks, they sometimes—wait, I should correct myself. Not "sometimes." Frequently. They frequently get stuck in a "call tool → analyse result → call another tool" loop. The most ridiculous case we saw: a user asked "When will my order arrive?" and the agent made 17 tool calls, including 3 weather queries and 2 logistics API retries, before returning a result worse than a direct database lookup. We later analysed 100,000 production logs and found roughly 7% of requests showed some degree of tool abuse.
The economics of model selection. Lots of teams default to GPT-4o for everything, thinking "it's only a few pence more." But when you're handling 500,000 conversations a day, those pennies become roughly £95,000 per month. We eventually did the maths: shifting 70% of simple intent recognition to GPT-4o-mini cut costs by 62% while actually improving response speed by 40%. I didn't expect that number myself.
Agents aren't about being as clever as possible—they're about being as appropriate as possible. Use the right model in the right place. That's the first principle of production deployment. My team lead said that, and I think she's spot on.
Architecture: From Single-Node to Distributed
After that Black Friday disaster, we completely rebuilt the agent service. Our current architecture—"three layers plus two queues"—has been running stably for eight months now, surviving both the summer sales and Boxing Day rushes.
Layer 1: Agent Gateway
This is the entry point for all requests. But its job goes way beyond simple routing.
Intelligent rate limiting. Not just a basic token bucket—we do dynamic tiering based on user profiles. VIP customers get higher priority for their agent requests. Free-tier users get downgraded to lighter models during peak hours. This strategy kept core user availability at 99.7% during major sales events while only increasing overall resource consumption by 15%. We built it on Nginx Plus, and it took about four configuration iterations to get right.
Request normalisation. External requests arrive in all sorts of weird formats—some with message history, some without, some carrying bizarre parameters. The Gateway standardises everything into a clean AgentRequest object, fills in defaults, and strips out dangerous parameters. This step alone blocks at least 30% of anomalous requests. I remember one time, an outdated client SDK version sent parameters that actually crashed our JSON parser. After we fixed that, the problem never came back.
Session-sticky routing. Agent context is stateful. If two requests from the same conversation get load-balanced to different instances, the experience is a disaster. We implemented hash-based routing on user IDs to ensure continuous conversations always land on the same agent instance. It's fairly basic stuff, honestly, but the official docs don't mention it at all.
Layer 2: Agent Execution Engine
This is the core layer—and where we stepped on the most landmines. The OpenAI Agents SDK provides great abstractions, but at scale you need to bolt on a lot of "industrial-grade" capabilities yourself.
Context window management is a classic example. The SDK dumps the entire conversation history into the context by default. Long conversations rapidly blow through token limits. Our approach implements a sliding window plus summarisation mechanism: keep the last 10 turns in full, auto-generate summaries of earlier conversations using GPT-4o-mini, and cap those summaries at 500 tokens. This dropped average context length from 8,500 tokens to 3,200, improving response speed by 35%.
# Core context management logic
class ContextManager:
def __init__(self, max_recent_turns=10, summary_model="gpt-4o-mini"):
self.max_recent_turns = max_recent_turns
self.summary_model = summary_model
def optimize_context(self, conversation_history):
if len(conversation_history) <= self.max_recent_turns * 2:
return conversation_history
recent = conversation_history[-(self.max_recent_turns * 2):]
older = conversation_history[:-(self.max_recent_turns * 2)]
summary = self._generate_summary(older)
return [{"role": "system", "content": f"Conversation summary: {summary}"}] + recent
But honestly? Summaries sometimes lose critical details—like when the model "summarises away" an order number the user mentioned earlier. We haven't fully solved this yet. The current workaround is forcing retention of numeric entities, which is... fine, I suppose. Not elegant.
Circuit breakers for tool calls came from painful experience. We now enforce strict tool call limits per agent instance—maximum 8 calls per request. Exceed that, and we force-return partial results with an alert. We arrived at this number after analysing 100,000 real conversations: 95% of user intents resolve within 6 tool calls, and 8 covers 99% of scenarios.
This is a bit complex, but the short version is: don't let your agent become a perpetual tool-calling machine.
Layer 3: Model Routing Layer
This layer distributes agent requests to specific model endpoints. Sounds simple, but doing it well requires real finesse.
We maintain a real-time model performance table tracking latency, error rates, and available capacity for each endpoint—Azure East US, OpenAI direct connection, and a fine-tuned Qwen2.5-72B we deployed ourselves using vLLM. When an endpoint gets jittery, the router switches within 50 milliseconds. Users don't notice a thing.
Last month this saved us. OpenAI's API in the US East region experienced a latency spike lasting 12 minutes. Our router automatically shifted traffic to our Azure deployment. Looking at the monitoring afterwards, we handled roughly 80,000 requests during those 12 minutes. Without automatic failover, I reckon at least 30% would have timed out. From what I understand, OpenAI had a similar outage this January that lasted about 3 hours—that one had a much bigger blast radius.
Monitoring: You Can't Improve What You Can't See
The scariest thing about large-scale agent deployment isn't problems—it's problems you don't know about. Our monitoring now covers three dimensions.
Business metrics: Task completion rate, user satisfaction scores, human handoff rate. These ultimately measure whether your agent is any good. We discovered something fascinating—when agent response time exceeds 3.5 seconds, user satisfaction falls off a cliff, even if the final answer is correct. So we now set our P99 latency alert at exactly 3.5 seconds.
Technical metrics: Token consumption, tool call counts, model latency, error rates. These are your diagnostic handles. We built a Grafana dashboard showing a "health score" for each agent instance. Drop below 60, and the instance gets automatically removed while a fresh one spins up. The auto-removal logic had a bug initially—once removed all healthy instances, causing a cascade failure... We added a minimum survival ratio after that to stabilise things.
Cost metrics: Broken down by agent type, by customer, by time period. This data directly drove our model selection strategy. For example, we discovered that 80% of customer service requests between 3-6 AM are simple "check my order status" queries—perfectly handled by GPT-4o-mini. After switching, costs during those hours dropped 73%. Our finance team was genuinely shocked.
Monitoring isn't about assigning blame after things break—it's about catching problems before they explode. A good monitoring system should be like your car's dashboard, not the black box from a crash investigation.
Problems We're Still Wrestling With
Honestly, even though our current setup is reasonably stable, plenty of issues remain unsolved.
State synchronisation in multi-agent collaboration is a massive headache. We now have customer service agents, recommendation agents, and after-sales agents that need to share user context, but each manages its own state independently. We're currently using Redis as a shared state store, but we haven't found the sweet spot between consistency and latency. I read Meta's late-2024 paper on multi-agent collaboration—their approach is far too idealised to work in real business scenarios.
Automated agent evaluation is another rabbit hole. Most testing still relies on manual test case execution, which is painfully slow. We're experimenting with using one agent to evaluate another agent's production performance, but then you've got the problem of evaluator bias. It's turtles all the way down.
Security granularity needs a complete rethink. Traditional web security approaches are woefully inadequate for agent scenarios. Prompt injection, indirect prompt attacks—these new attack vectors have pretty primitive defences right now. We've already seen cases where users derailed agents with "ignore previous instructions" prompts. We added a Prompt Filter at the Gateway layer, which mostly blocks it, but there are definitely ways around it.
Key Takeaways
TL;DR for the skimmers:
- Demos are not production. Synchronous calls and in-memory sessions will destroy you at scale.
- Context windows need active management. Sliding windows + summarisation cut our token usage by 62%.
- Circuit-break your tool calls. Cap them at 8 per request—99% of intents resolve within that.
- Model routing isn't optional. Real-time failover saved us during a 12-minute OpenAI outage.
- Monitor like your job depends on it. Because it probably does.
- Have a degradation path. When your agent fails (not if—when), what happens? Rule engine? Human handoff? Figure this out before launch.
Final Thoughts
If you're thinking about deploying agents to production, here's the one piece of advice I'd hammer home: figure out your degradation path first. When your agent crashes, slows down, or just gets things wrong—what does your system do? Fall back to a rules engine? Escalate to a human? Return a friendly error message?
If you can't answer that question clearly, don't rush to launch.
Our team now has an unwritten rule: every agent must pass "chaos testing" before going live. We randomly inject network latency, model errors, and tool timeouts using Chaos Mesh, then watch whether the system degrades gracefully. This testing has saved us more than once.
What bizarre problems have you hit deploying agents? Got any clever approaches to multi-agent state management? Drop a comment—I read every single one.
#OpenAI #AgentsSDK #ProductionEngineering #AIDeployment #DistributedSystems #SiteReliability
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.