How Roblox’s Cache Sustained 1.38B QPS Beyond Redis Limits

Navigating a 10x Traffic Surge With Reliability and Efficiency in Mind

In Roblox’s massive-scale multiplayer gaming environment, users read and write data at a volume and velocity that would crush standard caching patterns. A user might not notice a 200ms delay on a standard webpage that caches static data, but a 200ms delay in retrieving interaction data or server metadata on Roblox could stall gameplay and disrupt the simulated game world. Roblox operates on an ecosystem of unpredictable, high-velocity, ephemeral data that must be retrieved within milliseconds to preserve immersion.

Our caching layer is designed to act as a massive shock absorber between millions of concurrent users and our persistent storage. It handles a peak load of 1.38 billion queries per second (QPS), with single logical clusters topping 100 million QPS. In this post, we’ll pull back the curtain on how we architected a federated caching service to scale beyond Redis® node limits for massive throughput, all while navigating an unprecedented 10x traffic surge on our largest cluster.

Scaling Beyond the Industry Limit to Meet Ever-Growing Demands

At Roblox, our caching service is built on top of the Redis® technology, and one Redis® cluster has the potential to support up to 1,000 nodes. However, the empirical and industry accepted limit is much lower—around 400–500 nodes.1 This is because of Redis®’ decentralized Gossip protocol, which uses internode communication to share health, status, and shard mapping. In clusters larger than a few hundred nodes, the chatter from the Gossip protocol begins to consume substantial CPU and network resources. Each node must process status updates from an ever-growing number of peers, stealing resources from the primary job of serving data at low latency. This control plane overhead can lead to instability and unpredictable performance, which is why managed cloud providers enforce node limits. It’s a practical ceiling that cannot be breached without compromising reliability.

This practical single cluster limit is not enough to support a platform running at the scale of Roblox, where the largest backend service demanded over 100 million QPS from one of our logical Redis® clusters at peak, up from 10 million within three months.

We adopted a federated architecture to overcome this architectural limitation, positioning a client reverse proxy in front of multiple smaller, independent Redis® clusters. We use a 400-node limit per cluster to ensure predictable performance.

From the perspective of our thousands of applications and services, the client reverse proxy presents these disparate clusters as a singular, unified, and scalable caching service. This federation allows us to scale horizontally almost without limit by simply adding more clusters to the federation, bypassing the Gossip protocol’s constraints entirely. Today, our largest single caching deployment consists of over 6,000 Redis® nodes, which are backed by over 15 independent Redis® clusters—a scale made possible only by this federated model.

The Three-Stage Migration Process

Moving from a traditional setup to a federated architecture while managing hundreds of active production clusters was a significant undertaking. At the end of 2024, many of these clusters were already nearing our 400-node limit, making a transition to a higher-capacity model urgent. The primary challenge was that standard Redis® instances operate in isolation and have no inherent awareness of other clusters, necessitating a custom method to replicate data during the move.

To solve this, we developed a transparent, three-stage migration pipeline designed to transition services without any customer involvement or application-side changes.

The Migration Workflow

  1. Dual-writing: We configured the client reverse proxy to perform dual writes, sending incoming data to both the original cluster and the new federated destination simultaneously.

  2. Parity and read switch: Once the dual-write process populated the new cluster with identical data, as all data are bounded by a time to live (TTL), we transitioned the read traffic from the legacy cluster to the new federated architecture.

  3. Decommissioning: After confirming the stability of the new cluster, we ceased all write operations to the original cluster and retired it.

We successfully moved our most critical services into a federated model that supports over 6,000 nodes, far exceeding the limitations of a single Redis® cluster.

Handling Grow a Garden’s Explosive Growth

Starting in April 2025, we faced a capacity challenge of a magnitude few services ever encounter. The catalyst was the phenomenal success of Grow a Garden. Within just a few weeks of launch, Grow a Garden began to break not only Roblox records but all-time gaming industry records for concurrent users (CCU). In just three months, the game’s audience surged past our previous single-game CCU record of 2.8 million players to a peak of over 21 million. This explosive growth sent 10x more traffic to our largest Redis® cluster and 3x more traffic to overall caching.

Roblox runs our caching systems on physical machines in our own data centers, and this unprecedented growth was not part of our long-term capacity plan. To survive this tidal wave and keep the platform stable, we had to radically improve the efficiency of our existing fleet while prioritizing reliability.

Improving Efficiency During the Surge

First, we needed to identify the most impactful bottlenecks in our system so we could prioritize optimization work. Roblox Caching runs on two different hardware pools: one for Redis® backend nodes and one for Envoy. Our Redis® fleet is primarily memory-bound, while client proxy is compute-bound.

For our memory-bound Redis® fleet, we implemented a new scheduling policy to better balance the memory utilization across our physical machine groups. We also fine-tuned the container size to reduce memory fragmentation.

To improve CPU utilization, we implemented two key optimizations to our cache system:

  1. Client proxy autoscaling: We adjusted the autoscaling policy for our compute-bound client proxy fleet. By lowering the allowed deviation from the target CPU, we achieved a higher average CPU utilization for jobs without changing the scale-up threshold.

  2. Internal cluster health checks: We optimized internal health checks to eliminate unnecessary heartbeats, preserving the original detection interval while saving compute resources.

These changes resulted in a nearly 10% improvement in CPU utilization. Finally, we colocated these two vastly different workloads into the same machine pool with cgroup container-level isolation, which reduced our overall capacity needs by 25%.

Engineering for Unwavering Reliability

Reliability remained our top priority, even as we raced to meet record-breaking demand.

End-to-End Observability

We first focused on understanding potential failures by heavily investing in our observability stack for comprehensive, end-to-end health monitoring of our caching system. By tracking server success and client-side failures across the entire request lifecycle, we quickly detected and diagnosed issues, often preempting user impact. For example, we added hot key detection, which samples live traffic to identify data that is excessively accessed and strains the caching layer. When the system flags a hot key, we can proactively throttle traffic to these keys on both the server and client side to prevent system overload.

Failure Handling

In any distributed system of this scale, partial failures sometimes happen. We’ve engineered our system to handle these failures gracefully and minimize their impact.

  • Partial failures in batched requests: We rearchitected the system to handle partial failures in batched requests. Previously, a single request failure in a sharded batch request would fail the entire batch. Now requests can succeed even if some keys fail retrieval.

  • Mitigating head-of-line blocking: Caching systems are expected to be fast. Even a minor network slowdown can cause requests to pile up and time out, a phenomenon known as head-of-line blocking. We’ve implemented intelligent retry mechanisms and connection pooling strategies in both the client and proxy layer to prevent a single slow request from impacting others.

  • Fault and latency injection: To test resilience, we regularly inject controlled failures (node outages, rack outages, network latency, API errors) into production using our fault injection framework. By analyzing system behavior under these conditions, we proactively identify and fix weaknesses, ensuring continuous reliability.

Client-Side Resiliency

To prevent our caching system from being overwhelmed by sudden surges in traffic, we implemented robust client-side failure resiliency. This mechanism operates at the application level, ensuring that no single service can monopolize cache resources.

  • Rate limiting for flow control: Our client has a limit on total outbound traffic, ensuring that it does not send more requests than the backend service can handle. Requests are throttled when this limit is exceeded, which keeps the service from being overwhelmed.

  • Retry and timeout mechanisms: Our client libraries use carefully tuned retry mechanisms, including timeouts and a retry budget, to prevent thundering herd issues. This limits the blast radius, frees up resources, and improves overall system responsiveness by preventing indefinite hangs and simultaneous retries from overwhelming the system.

The Future of Roblox Caching

Roblox’s caching journey started with bare-metal Redis® clusters managed directly by each application owner, then moved to Memcached clusters managed by our caching team, and then finally transitioned to a Redis®-as-a-service model, where a Redis® cluster is provided as a managed service. While this model has proved highly effective at our current scale, we recognize its limitations in terms of service efficiency and resource isolation for a platform as dynamic as Roblox.

As the platform continues to grow, we plan to make a significant investment in a next-generation multitenant caching service built upon ValKey. This architecture is designed to drastically reduce costs by sharing common resources through multitenancy while ensuring data, resource, and failure isolation for each tenant. Our goal is to greatly enhance the performance, scalability, reliability, and cost-efficiency of our caching infrastructure to meet the demands of increasingly massive experiences on Roblox.

Acknowledgement:

This achievement was a collaborative effort. Thank you to other team members who contributed significantly: Utkarsh Singh, Karthick Ramachandran, Vineesha Kasireddy, Weiji Hu, and Leon Gao.

1AWS limits ElasticCache to 500 nodes.

*Redis is a trademark of Redis Labs Ltd. Any rights therein are reserved to Redis Labs Ltd.