Home / Blog / The Night I Killed Our API Gateway With 47 Duplica...

The Night I Killed Our API Gateway With 47 Duplicate Cache Keys (And What I Learned)

By CaelLee | | 12 min read

The Night I Killed Our API Gateway With 47 Duplicate Cache Keys (And What I Learned)

It was 3 AM. My phone wouldn't stop buzzing. The API gateway was down, the database connection pool was a smoking crater, and our customer support team had stopped answering their phones entirely—which, honestly, I don't blame them for. The culprit? Cache keys. Those innocent-looking strings we'd been slapping together with zero thought for three years.

That night, I made myself a promise: never again would I treat cache key design as an afterthought. It's not just string concatenation—it's one of those sneaky architectural decisions that looks trivial until it explodes in your face at the worst possible moment.

Here's what I've learned about cache key design and invalidation in large-scale API systems. This isn't textbook stuff—it's the hard-won knowledge that comes from restarting services at 3 AM while your boss is texting you "status update?" every four minutes.

TL;DR

## Cache Keys Aren't Strings—They're Architecture Diagrams

Most developers treat cache keys as unique identifiers. Just mash together the user ID, endpoint name, and parameters, right?

Wrong.

In interviews, I'd say 8 out of 10 candidates take this approach. It's the default mindset. And in a low-traffic app, it works fine. But when you're dealing with tens of thousands of requests per second across hundreds of microservices all reading and writing to the same cache, "just mash it together" turns into a disaster.

Here's a real data point from our team's pre-Black Friday optimization in 2024. We were tuning a product detail API and discovered that the same SPU data was stored 47 different times in Redis. Forty-seven. The culprit? Clients sending parameters in different orders, with different casing, and sometimes with stray whitespace characters.

Cache hit rate: 31%. Redis memory consumption: nearly 40 GB. We were running Redis 7.2.4 in cluster mode with 8 nodes at 16 GB each, and this single API was eating up almost 4 nodes' worth of capacity.

The moment it clicked: I had the team dig through logs to figure out why cached data was "there but not usable." What we found was almost comical. Frontend developers were sending color=red&size=XL, while backend developers testing the same endpoint were using size=XL&color=red. Different cache keys, identical data.

And that wasn't even the worst of it. One version of the client was passing boolean values as true and True interchangeably, effectively doubling our cache keys. I remember staring at these two log lines at 2:47 AM:


2024-11-03 02:47:23.421 ERROR CacheKeyMismatch: key=product:spu:882391:color=red&size=XL
2024-11-03 02:47:23.438 ERROR CacheKeyMismatch: key=product:spu:882391:size=XL&color=red

Our DevOps engineer, looking over my shoulder, let out a very creative string of profanity. I couldn't agree more.

What we did about it: We established an ironclad rule—every cache key must pass through a normalization layer. No exceptions. Here's the approach:

Wait—I need to correct myself. We initially used MD5, but switched to SHA-256 after discovering that with our traffic volume, MD5 collision probability was higher than we'd assumed. In production, we were averaging 300 million cache writes per day, and MD5 produced two collisions. Two. That's a statistically tiny probability, but debugging those collisions was an absolute nightmare. Days of "why is this returning the wrong data?" hair-pulling.

After the normalization overhaul: cache hit rate jumped from 31% to 89%. Redis memory usage dropped by 60%. The ROI on this refactor was orders of magnitude better than throwing more hardware at the problem. I still think about what would've happened if we'd discovered this on Black Friday itself.

## Cache Key Granularity: The Eternal Debate

I've had more arguments about cache granularity than I care to count. The "coarse-grained" camp says "one cache entry per endpoint—simple and done." The "fine-grained" camp says "cache per parameter combination—higher hit rates." Both are right. Both are wrong. It depends on your data dependencies.

This gets complicated, but I'll try to make it concrete.

**Case 1: Coarse-Grained Caching by User**

We had a user info API that was called constantly—pretty much every other endpoint needed basic user data. Our initial approach was userinfo:{userid} with a 30-minute TTL. It looked sensible on paper.

Then we noticed something weird. When users updated their nicknames, some pages showed the new name immediately, while others took 30 minutes to update.

The problem? Cache invalidation wasn't aligned with data dependencies. When a user updated their profile, we only invalidated userinfo:{userid}. But user info was also embedded inside other cached objects—like articledetail:{articleid}, which included the author's cached nickname. Those composite caches weren't getting invalidated.

This is one of those things that seems obvious in retrospect but is surprisingly easy to miss during design.

The fix: We introduced the concept of a cache dependency graph. Every cache key, at creation time, registers what base data it depends on. When that base data changes, a message goes out via message queue to all dependent services telling them to invalidate their caches.

We used RocketMQ 5.1.3 with broadcast messages, tagging by data dimension. For example, a user info change sends a message with tag userupdate:{userid}. Every service that has cached anything containing that user's data receives the message and clears both its local cache and Redis entries.

It sounds complex, but it was only a few hundred lines of code. Worth every line.

**Case 2: Fine-Grained Caching for Product Search**

The other extreme was our product search API, which had dozens of parameter combinations: keywords, categories, price ranges, sort orders, pagination offsets... If we cached every unique combination, the key space would explode. We estimated that with 2 million daily active users, 10 searches each on average, and 20% of parameter combinations being unique—we'd hit over 120 million keys within a month. Our Redis cluster would've turned into a key pig.

Our solution was a two-tier caching strategy:

For hot pattern analysis, we used Elasticsearch log aggregation with a nightly offline job to calculate the previous day's top query patterns. The results after the first week? Cache hit rate went from 41% to 78%. Total key count stayed under 18,000 (compared to the millions we'd projected). P99 latency dropped from 200ms to 12ms.

I remember that night. After we deployed, I sat staring at the Grafana dashboard for a solid 30 minutes, making sure the curves had stabilized before I let myself sleep. They did. I slept.

## Cache Invalidation: Don't Let Your Cache Become a Museum of Stale Data

Cache invalidation is where things really go sideways. It's harder to get right than cache population, and the failure modes are nastier. I've cataloged three classic failure patterns. See how many you recognize.

**Failure Pattern 1: Update vs. Delete**

A lot of teams, when data changes, choose to update the cache rather than delete it. It feels more "efficient"—the cache is always warm, no need to wait for the next query to rebuild.

The problem? Concurrency. When two update operations race, they can execute in the wrong order, and your cache ends up with stale data permanently.

I learned this the hard way. During the 2025 Lunar New Year period (which, for our non-Asian customers, is a massive e-commerce event), two services nearly simultaneously updated the cache for an order status. Result: the cache showed "shipped" while the actual order was already "delivered." Customer support melted down. We got pulled into an emergency fix on, I kid you not, the third day of the new year.

The logs told the story:


2025-01-31 14:23:11.156 ServiceA: update order_status=shipped for order_id=2938471
2025-01-31 14:23:11.189 ServiceB: update order_status=signed for order_id=2938471
2025-01-31 14:23:11.201 Redis: set order:2938471 status=shipped (the later write overwrote the earlier one)

Classic write-after-write race condition.

The lesson: Unless you can guarantee strict ordering of update operations (with distributed locks or version numbers), deleting the cache is always safer than updating it. Let the next read rebuild the cache. Yes, it costs an extra database query, but data consistency is non-negotiable. We switched to a "delete cache → update DB → delete cache again" pattern, which is a variant of the Cache Aside pattern. From what I've seen, this is what most major tech companies do in practice.

**Failure Pattern 2: The Cache Avalanche**

This one's a classic. In September 2024, we ran a mid-autumn promotion. All the caches for a hot event expired simultaneously, and within seconds, tens of thousands of QPS slammed directly into our PostgreSQL database. The connection pool—configured at maxconnections=500 with PgBouncer poolsize=200—filled up instantly. Then every service depending on that database started timing out. Cascade failure.

That outage lasted 23 minutes. I will never forget that number.

The fix: Add random jitter to your TTLs. If your base TTL is 10 minutes, actually set it to somewhere between 10:00 and 12:00. This simple change spreads out the expiration times so they don't all land at once.

For critical data, we went further with a "never-expire + async refresh" strategy. The cache has no TTL at all. Instead, a background job checks the data source every minute and refreshes the cache only if something changed. We scheduled this with XXL-Job 2.4.1 and monitored data changes using Canal 1.1.7 listening to MySQL binlog.

**Failure Pattern 3: Cache Penetration as an Attack Vector**

If someone deliberately queries for data that doesn't exist, there's nothing to cache, so every request punches through to the database. We caught a competitor doing exactly this—their script was iterating through product IDs, specifically targeting delisted products. From our Nginx logs, we found an IP block that requested roughly 800,000 non-existent product IDs between 3 AM and 5 AM.

Countermeasure: Cache null values for non-existent data with a short TTL (like 1 minute). Going further, use a Bloom filter as a fast "does this data even exist?" check in front of your cache. We used Redisson 3.23.5's built-in Bloom filter implementation, sized for 100 million entries with a 0.01% false positive rate. Cache penetration dropped by 99.7% after deployment. I remember that number specifically because I screenshotted it for my boss's presentation.

## Advanced Tactics for Massive-Scale APIs

When your API call volume goes from thousands to millions per second, basic caching strategies need an upgrade. Here are two approaches we've battle-tested.

Multi-Level Cache Architecture:

We added a local cache layer (Caffeine 3.1.8, configured with maximumSize=10000 and expireAfterWrite=15s) in front of our API gateway. Redis became the second-level cache, and the database was the last resort.

Local cache TTLs are deliberately short, but the hit rate still exceeded 70% because hot data gets hammered repeatedly and never leaves this layer.

The numbers: after adding local caching, API P99 latency dropped from 45ms to 8ms. Redis QPS fell from 120,000 to 30,000. But here's the catch—local cache invalidation consistency is tricky. We used Redis Pub/Sub to broadcast invalidation messages so all instances could clear their local caches.

And there's a subtle gotcha here. Pub/Sub messages can get dropped, so your local cache TTL must always be shorter than your Redis TTL. Otherwise, you'll serve stale data.

Actually—I said we used 15-second TTLs for Caffeine, but in production we later changed it to a random value between 10 and 20 seconds. Same reason as before: prevent thundering herds.

Cache Preheating and Degradation:

For major sales events, we preload hot data into the cache before the event starts. We analyze 7 days of user behavior logs, predict which data will be hammered during the event, and run a preheating script 30 minutes before go-time to pump that data into Redis.

We also built a degradation switch using Sentinel 1.8.6 to monitor cache hit rates. If the hit rate drops below 80%, it automatically triggers rate limiting. Non-critical endpoints start returning fallback data—for example, the product listing API defaults to showing the top 100 bestsellers—keeping the core purchase flow alive.

This setup held during Black Friday 2024's midnight traffic spike. Peak QPS was 17 times our daily average, but the system stayed stable. P99 never exceeded 200ms. My boss sent a $20 red packet to the group chat the next day. Generous, I know.

## Key Takeaways

## Let's Talk War Stories

Cache key design and invalidation look like implementation details, but they're really a test of how deeply you understand your data flows and business logic. When I interview senior engineers, one of my go-to questions is: "If you had to design a caching system from scratch, how would you balance consistency and performance?" The people who answer well aren't just coders—they're systems thinkers.

Honestly, I arrived at most of these lessons the hard way. Like I said at the start: the feeling of restarting services at 3 AM is one I don't ever want to repeat.

What cache disasters have you survived? Key collisions? Invalidation storms? Weirder edge cases? Drop a comment—I'm genuinely curious if anyone has a more absurd story than 47 duplicate caches for the same data. And what's your stack? Pure Redis, Memcached, or mostly local caching? Share your setup.

Also, if you're prepping for system design interviews or planning a cache architecture upgrade, and you want me to share our internal technical design doc, let me know in the comments. I'll clean it up and post it.

systemdesign #caching #backend #redis #performanceoptimization #apigateway

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free