We Took an MCP Agent to Production—Here's What the Hype Doesn't Tell You
We Took an MCP Agent to Production—Here's What the Hype Doesn't Tell You
I've been lurking here for years but finally feel like I have something worth sharing that isn't just another "look at this cool thing I built in a weekend" post. Because honestly? The weekend projects are easy. Making MCP work when your job depends on it is a whole different beast.
About six months ago—actually, wait, I just checked my commit history and it was late October 2024, so closer to eight months now—my team got tapped to build an internal agent that could handle cross-system workflows. Think "update the CRM, file a Jira ticket, and ping the right Slack channel when a customer escalates." Classic enterprise glue work. We'd been watching the MCP hype with interest because the alternative was maintaining fifteen different API integrations with bespoke auth logic and praying nothing broke when a vendor updated their SDK.
Spoiler: MCP helped. But not in the way the shiny demos suggest.
The Gap Nobody Talks About
Every MCP tutorial follows the same arc: install the SDK, write a server that exposes getweather() or searchdocs(), connect Claude Desktop, and bask in the glory of an LLM that can now tell you it's raining. This is the "toy demo" phase, and it's genuinely useful for understanding the protocol shape. But it teaches you approximately nothing about running this in production.
The moment you cross into "this needs to work when I'm on vacation" territory, you hit a wall. Questions the quickstart guides conveniently ignore:
- What happens when an MCP server times out mid-tool-call and the agent has already mutated state in another system?
- How do you version MCP servers when the agent's prompt expects a specific tool signature?
- Who audits what the agent actually did across five different APIs at 3 AM?
We learned these the hard way. Usually via PagerDuty at terrible hours. I think it was the third 2 AM alert that made our CTO ask if maybe we should slow down and think through the failure modes properly. He wasn't wrong.
Thing We Learned #1: The Transport Layer Is Everything (and stdio Is a Trap)
The quickstart examples love stdio transport. It's dead simple to demo. One process, one connection, no network headaches.
We made the mistake of building our first "real" version the same way, with the MCP client spawning server subprocesses inside our agent runtime.
This falls apart spectacularly when:
- Your agent process crashes and orphans half a dozen MCP servers still holding database connections. We found 14 orphaned Postgres connections one morning. Fun times.
- You need to scale agent instances horizontally and suddenly have 47 copies of the same MCP server fighting over rate limits. Salesforce was not amused.
- You want to update an MCP server without restarting every agent that depends on it. Which is... always.
We switched everything to HTTP+SSE transport with standalone MCP server deployments behind a lightweight gateway. Each server gets its own health checks, its own scaling policy, and crucially, its own deploy pipeline decoupled from the agent. It's more infrastructure to manage, but the alternative was restarting agents at 2 AM because some intern pushed a bad update to the Salesforce connector. That actually happened. Intern's name was Kevin. Kevin learned a lot that night.
I've seen people on r/selfhosted arguing about stdio vs HTTP for MCP and honestly, if you're building anything that matters, just go HTTP from day one. The overhead is minimal. The operational benefits are massive. I'll die on this hill.
Thing We Learned #2: Tool Contracts Need Versioning (and Your Agent Prompt Will Break If They Don't)
This one bit us hard.
MCP servers advertise their tools with JSON schemas. Your agent prompt describes what those tools do and when to use them. If you update a server and change a parameter name from customeremail to emailaddress, the schema changes, but your prompt still references the old name. The LLM gets confused. Hallucinates parameters. Silently fails in ways that are maddening to debug.
We spent three days chasing a bug where the agent kept sending customeremail to a server that expected emailaddress. The LLM would just... make up a value? Sometimes it'd grab something from context that looked email-adjacent. Absolutely cursed.
We ended up treating MCP tool schemas like API contracts. Every server has a /tools endpoint that our CI pipeline fetches during agent builds. We diff the schemas against the previous version, and if there's a breaking change, the build fails until someone updates the corresponding prompt template. It's basically protobuf-style schema evolution but for LLM tool descriptions.
Not glamorous. Absolutely necessary.
Well... actually, I should clarify that we're not doing full semantic versioning on the schemas yet. We just hash the JSON and compare. Crude but effective. If anyone has a better approach I'm all ears.
Thing We Learned #3: Observability Is Embarrassingly Bad Out of the Box
MCP gives you request/response logging if you squint at the debug output. That's it.
In production you need to answer questions like:
- Which tool call in a 12-step agent workflow caused the final failure?
- How long did each tool call take, and is the CRM connector getting slower over time?
- What was the actual payload sent to the billing system when the agent "helpfully" applied a discount?
We ended up wrapping every MCP client call with OpenTelemetry spans and shipping the tool call inputs/outputs to a dedicated audit store. Each agent run gets a trace_id that ties together every MCP interaction. This saved us when accounting asked for proof that the agent hadn't done something insane during a billing reconciliation—we could actually show them the exact tool calls with timestamps. The auditor's face when we pulled up a trace showing exactly which discount got applied and when was... something. I think she was impressed? Hard to tell with auditors.
If you're building MCP agents right now, instrument this stuff before you need it.
You will need it.
The Architecture We Landed On (So You Can Tell Me What We Did Wrong)
After much trial and error, our setup looks like this:
- MCP servers are standalone FastAPI services (Python 3.12, because our team is lazy and it works) deployed as containers on our k8s cluster. Nothing exotic.
- An MCP gateway—basically Envoy 1.29 with some custom Lua—handles routing, auth token injection, and rate limiting. We wrote maybe 200 lines of Lua total.
- The agent runtime is a Python process that loads prompt templates from a git repo, connects to the gateway, and executes tool calls within traced spans. We're using the official
mcpPython package, version 0.9.1 as of last week. - Tool schemas are versioned in that same git repo, with CI checks for breaking changes. GitHub Actions, nothing fancy.
- Everything ships structured logs and traces to Grafana + Tempo. We were already using those for other stuff so it wasn't a lift.
Is this over-engineered? Probably. Some of you are going to tell me we should have just used LangChain or whatever. We tried. God, we tried. LangChain v0.3 specifically. The abstractions leaked constantly and debugging felt like archaeology. Rolling our own thin orchestration layer on top of raw MCP clients was less code and infinitely more debuggable. Like, 400 lines of Python vs fighting LangChain's callback system for a week.
The Part Where I Admit We're Still Figuring This Out
We haven't solved auth elegantly. Right now MCP servers accept opaque bearer tokens that the gateway injects, but the token represents the user the agent is acting on behalf of, and some downstream APIs need that context for audit trails. Passing user identity through MCP tool calls without it becoming a security nightmare is still an open question for us.
We also haven't found a good pattern for long-running tool calls. If the agent kicks off a 20-minute report generation, the MCP connection needs to stay alive or support async callbacks, and the protocol spec is vague on this. We've hacked around it with polling endpoints but it feels gross. I keep meaning to write up a proposal for the spec repo but haven't had time. Maybe after this sprint.
Oh, and one more thing that's been bugging me—the MCP spec says servers "SHOULD" support cancellation but doesn't really define what that looks like in practice. We had an agent spin up a BigQuery job that ran for 45 minutes because the user had already gotten their answer and moved on. $14 of compute down the drain. Not a huge deal once, but multiply that by hundreds of agent runs and suddenly finance is asking pointed questions.
TL;DR
- Demos make MCP look trivial. Production makes you deal with transport reliability, tool schema versioning, and observability that the protocol doesn't give you for free.
- Use HTTP transport, not stdio. Unless you enjoy restarting agents constantly.
- Treat tool schemas like API contracts with versioning and CI checks.
- Instrument everything with tracing before you need to debug a production failure.
- The MCP spec is solid but incomplete for real-world patterns like async operations and user identity propagation.
What's your team doing for MCP auth? I've seen the OAuth draft in the spec repo but it feels heavy for internal agents. Genuinely curious if anyone has a simpler pattern that doesn't involve service account sprawl. Roast our architecture in the comments—I'm here for it.
Edit: A few people DMed asking about the gateway setup. It's literally just Envoy with a Lua filter that reads tool schemas from a configmap and enforces rate limits per tool per client. Nothing fancy. Happy to share the config if there's interest.
Edit 2: Thanks for the gold, kind stranger. First time getting one of those. Glad this resonated.
Edit 3: Since people keep asking—yes, we looked at Anthropic's reference implementation. It's fine for getting started but doesn't handle any of the operational stuff I mentioned. Don't build your production infra on it unless you hate sleep.
mcp #aiagents #production #devops #llm
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.