The 3-Hour MCP Outage That Taught Me Monitoring Is About Knowing What to Look For

Last November 10th, 11:47 PM. I remember the exact time because I'd just poured myself a coffee, ready for the usual pre-shopping-festival all-nighter. Sat down, took one sip—and my phone started buzzing. Payment success rate had dropped from 99.7% to 98.2%.

That's 1.5 percentage points. Sounds tiny, right? At our transaction volume, that meant losing 340 orders every minute.

My team stared at Grafana for half an hour. Every single panel was green. CPU? Fine. Memory? Fine. P99 latency was actually lower than usual by 12ms. I was losing my mind—all the metrics were screaming "everything's fine" while the business was screaming "everything's on fire."

Here's what actually happened: a fraud detection model inside our MCP toolchain had a timeout bug. Requests were getting stuck for 8 seconds before timing out, but that node wasn't instrumented with Prometheus. Our monitoring saw "no slow requests" because the slow requests were literally invisible to the system.

At 3 AM, sitting at my desk doing the post-mortem, I had a painfully obvious realization: observability isn't about collecting metrics. It's about knowing what to look at before things break. We had 200+ dashboards. Not a single one helped when it mattered.

Why MCP Toolchains Are a Monitoring Nightmare

Quick background. MCP—Model Context Protocol—is what Anthropic released in late 2024 to standardize tool calling. When an LLM needs to query a database, call an API, or read a file, those operations get wrapped as MCP Servers, with an MCP Client handling orchestration. Think of it as a function-calling protocol for AI applications.

So what's the problem? The traditional monitoring holy trinity—Metrics, Logs, Traces—is full of holes when it comes to MCP toolchains.

A typical MCP call chain looks like this: User asks a question → LLM reasons → decides to call a tool → MCP Client sends request → MCP Server executes → returns result → LLM reasons again → outputs answer. Looks a lot like a regular RPC call, doesn't it?

Actually, let me correct myself. I said "looks a lot like," but they're completely different. Regular microservice calls are deterministic. Service A calls Service B—the code path is fixed. In an MCP toolchain, the model decides which tool to call, when to call it, and how many times. None of that is predetermined. You can't know the call graph in advance.

There are at least four monitoring blind spots in this chain:

The LLM reasoning phase: How did the model decide to call a tool? Why tool A instead of tool B? This decision process is a black box. You dig through logs and find one line: toolcalled: searchdatabase. But you have no idea why. I once added a line to a prompt telling the model to "prefer the cache tool"—it started calling the cache in completely inappropriate scenarios, and P99 latency tripled. Classic butterfly effect from a tiny prompt tweak. Traditional monitoring won't catch this.

Tool selection bias: An MCP Server exposes 8 tools. The model might pick different tools based on subtle wording changes in the user's input. Someone says "look up" versus "find"—the model picks search versus query. These two tools return different data structures, and downstream processing falls apart. Can infrastructure monitoring detect this? Nope.

Chain-call explosion: One user request might trigger 5 to 10 tool calls. Each call, individually, looks fine—80ms, 120ms, 90ms. But chained together, the P99 hits 40 seconds. Look at each span separately, all green. Put them together, and the user has already closed the tab.

Partial failure mode: This one's tricky. Out of 10 tool calls, 9 succeed and 1 times out. The LLM might use those 9 successful results to generate an answer that looks correct. No error is thrown. The user doesn't complain. But the data is wrong. And good luck reproducing it—next time, all 10 calls might succeed, and the answer will be right.

At Stripe, I worked on an internal LLM tooling system where we hit that fourth blind spot hard. A financial query tool occasionally returned an empty array (timeout was silently swallowed), and the model generated "You have no transactions this month"—for a customer with over 3,000 transactions. By the time we caught it, 72 hours had passed and customer complaints had piled up. Direct loss? Around $470,000. Indirect trust damage? Impossible to calculate.

Case 1: The Ghost Latency in a Tool Node

Back to the story I started with.

The payment gateway's MCP architecture worked like this: incoming request hits a fraud detection MCP Server with three tools—risk assessment, historical behavior lookup, and rule engine matching. Under normal conditions, all three run in parallel, keeping total latency under 200ms.

On the night of November 10th, Jaeger showed tons of requests spending 8 to 12 seconds on the "risk assessment" node. But here's the weird part: that node's P99 metric showed only 180ms.

What was going on?

Root cause: The risk assessment tool had retry logic—500ms timeout, 3 retries. That night, an upstream PostgreSQL database was running a slow query (autovacuum, we later found out). Some requests were hitting the database for 3 to 5 seconds. The 500ms timeout should have triggered a fast failure and retry, but here's the bug: we were using Python's requests library, and timeout=0.5 only controlled the connection timeout, not the read timeout.

So the request went out, the connection was established, and then it just sat there waiting for the database to return results. For 8 seconds.

I can write this bug from memory:


# Wrong—this is what we had
response = requests.post(url, json=payload, timeout=0.5)

# Correct
response = requests.post(url, json=payload, timeout=(0.5, 0.5))
# connect_timeout=0.5, read_timeout=0.5

One tuple. That's the difference between a working payment gateway and a broken one.

What made this fatal: the metrics collection point was after the function returned. During those 8 seconds of waiting, Prometheus had no idea the request existed. We were monitoring successfully returned requests, not in-flight requests.

It's like monitoring a restaurant by counting dishes that reach the table. Fifty orders are piling up in the kitchen, and you have no clue.

The fix was simple—change timeout to the (connecttimeout, readtimeout) tuple. But more importantly, we added new instrumentation: entry and exit points on every tool call, with an inflightrequests gauge exposing the current count of executing requests. Now, even if a request gets stuck, we can see "this tool's in-flight count just spiked" in our alerts.

**First principle of observability: monitor the full lifecycle of requests, not just the successful endings.**

Case 2: Token Consumption Spiked 400% Because Someone Changed 3 Words

This one hit us directly in the wallet.

Last March, finance messaged me: "LLM API costs are up 410% month-over-month. Are we under attack?"

First thing I checked: call volume.

Not up. Actually down 8%.

So the problem had to be token consumption per call. Digging through logs, I found that starting March 12th, average token consumption per customer service conversation jumped from 1,200 to 4,800.

Drilling deeper: the knowledge base search tool was returning 15 results instead of 3. I found the engineer responsible, and he said, "Oh, I optimized the search strategy last week—changed top_k from 3 to 15. Better recall, right?"

I nearly spit out my coffee.

Recall was better, alright. And the token bill exploded. Even more ironic: customer satisfaction scores dropped 2 percentage points. Why? With 15 results, the LLM needed more time to read and summarize, slowing down responses. And with too much information, the model sometimes mixed up content from different documents, giving contradictory answers. A customer would ask "When will my order arrive?" and the model would first quote the shipping doc saying "expected tomorrow," then quote the policy doc saying "possible 3-5 day delay." The customer was left completely confused.

This case exposed a real problem: most teams only monitor tool call success/failure. They never look at what gets returned. The size of returned data directly impacts LLM cost and answer quality, but both dimensions are blind spots in traditional monitoring.

We added three things:

Return data size per tool call (token count using tiktoken, accurate to hundreds)
Cumulative token consumption across all tool calls in a single conversation
Correlation between tool return data size and user satisfaction scores

That third one was fascinating. After a month of data, a simple regression analysis showed that returning 5-7 results from knowledge base searches gave the highest satisfaction. Above 10, satisfaction actually dropped. This data directly drove a prompt change—we added one line to the system prompt: "If search results exceed 7 items, use only the 5 most relevant."

That one sentence cut our monthly bill by 37%.

**Monitoring what tools return matters more than monitoring whether they return. The former determines cost and quality; the latter is just an ops metric.**

Case 3: Trace ID Collisions—6 Hours of Debugging Hell

This is the one I most want to tell you about.

It was so subtle.

January this year. A user reported that their order status showed "shipped" but the tracking number was empty. We looked up the trace by user ID—the call chain was perfectly normal. Check order, check logistics, generate response. Every span returned correct data.

But the user's screenshot clearly showed no tracking number.

Six hours of investigation later, we found something that made me want to throw my laptop out the window: the trace we were looking at wasn't this user's request at all. Two completely different requests had the exact same trace ID.

Root cause: a concurrency bug in the MCP Client. We were using Go 1.21.3. Under high concurrency, context propagation in goroutines was broken—new requests were reusing the context from previous requests, overwriting trace IDs. It happened roughly once every 300 requests.

The impact was way bigger than we initially thought. Backtracking a week of trace data, we found about 0.3% of requests had trace ID collisions. That meant, in all the requests we'd been calling "normal," about 3 in 1,000 were actually masking various weird errors.

Fixing the bug took 2 days. Building the prevention mechanisms took 2 weeks. We did three things:

Trace ID uniqueness validation at the MCP Client level: after generating a trace ID, check a Bloom filter of the last 10,000 IDs. If there's a hit, regenerate and alert. False positive rate set at 0.01%, memory footprint around 200KB.
Log both trace ID and business ID (user ID + request timestamp): even if trace IDs collide, we can reconstruct the correct call chain using the business ID.
A nightly offline check: at 3 AM every day, randomly sample 1,000 traces and check if span start/end times, parent-child relationships, and node counts make sense. If a user request trace contains two different user_id values, trigger an alert immediately.

**Distributed tracing only works if IDs are actually unique. If you trust trace IDs 100%, sooner or later you'll pay for that trust.**

My MCP Observability Checklist (What I Actually Use Now)

After all these scars, I've put together a checklist. It's mandatory for my team now. Sharing it here for reference:

1. Tool-Level Base Metrics

Call count, success rate, latency percentiles (P50/P95/P99)
In-flight request count—this matters more than success rate. Catches backlogs early
Return data size distribution (token count via tiktoken, tracked at P50/P95/P99)

2. Trace-Level Tracking

End-to-end traces from user request to final response, covering LLM reasoning and all tool calls
Tool call sequence visualization—which tool the model called first, which came next. The order itself has diagnostic value
Retry and fallback paths get their own spans. Don't overwrite the original

3. Business-Level Quality Metrics

Tool selection accuracy (evaluated via user feedback or human annotation)
Tool call "idle ratio"—how often the model calls a tool but doesn't use the result
Partial failure impact—when some tools fail, how much does final answer accuracy drop?

4. Cost Metrics

Tool call count distribution per conversation
Token consumption per conversation (input/output tracked separately)
Tool return data "waste ratio"—tokens returned but not used in the final response

This list might look long, but it boils down to one thing: treat MCP toolchains as semi-autonomous systems, not regular microservices. Model choices, tool combinations, return data quality—these are what determine user experience. CPU and memory?

Honestly, those barely matter.

Key Takeaways

1. Monitor in-flight, not just completed. The biggest lie your dashboard tells you is that everything's fine because successful requests look good. What about the ones still waiting?

2. Watch what tools return, not just whether they return. Data size drives cost and quality. A "successful" tool call returning 15KB instead of 3KB can blow up your bill and confuse your model.

3. Never trust trace IDs blindly. Validate uniqueness. Log business IDs alongside trace IDs. Random-sample your traces for sanity. The moment you assume IDs are always unique is the moment they won't be.

4. Start with business and cost metrics, not infrastructure. CPU and memory are the least interesting signals in an MCP system. Model behavior, tool selection patterns, and return data quality are where the real problems hide.

Lately I've been doing architecture consulting for a few companies building MCP toolchains, and there's a pattern I keep seeing: everyone's focused on writing MCP Servers and making tool calling smoother, but almost nobody plans observability upfront. They start adding monitoring after incidents, and even then, they only add infrastructure metrics—completely ignoring model behavior and tool interaction quality.

If you're building an MCP toolchain, I'd suggest figuring out items 3 and 4 from the checklist above before you write a single line of code. Business metrics and cost metrics are the hardest to retrofit. They need to accumulate data from real user interactions, and the earlier you start, the more patterns you'll uncover.

How far along is your MCP monitoring? Have you run into "looks normal but the answer is wrong" situations? Drop a comment—I'm genuinely curious how other teams handle the partial failure problem.

MCP #observability #LLMOps #distributedtracing #AIOps #engineering

The 3-Hour MCP Outage That Taught Me Monitoring Is About Knowing What to Look For

The 3-Hour MCP Outage That Taught Me Monitoring Is About Knowing What to Look For

Why MCP Toolchains Are a Monitoring Nightmare

Case 1: The Ghost Latency in a Tool Node

Case 2: Token Consumption Spiked 400% Because Someone Changed 3 Words

Case 3: Trace ID Collisions—6 Hours of Debugging Hell

My MCP Observability Checklist (What I Actually Use Now)

1. Tool-Level Base Metrics

2. Trace-Level Tracking

3. Business-Level Quality Metrics

4. Cost Metrics

Key Takeaways

MCP #observability #LLMOps #distributedtracing #AIOps #engineering

Cael Lee

Ready to get started?