Introducing the Roblox Hybrid Architecture: Democratizing Photorealistic, Multiplayer Gaming

Our Vision: Roblox Reality

Today we are sharing technical insight into an internal project called Roblox Reality to combine hyperscale multiplayer gaming with photorealism. We believe this is a fundamental shift in how multiplayer immersive worlds will be created and experienced. Available in an early version later this year or early next, Roblox Reality is a hybrid architecture combining our distributed Game Engine's structured simulation with edge-based Video World Models for supersampling. This architecture will empower creators of all sizes to author and maintain interactive worlds that blend unprecedented visual fidelity and motion on top of traditional persistence and structure, without increasing development costs.

Roblox Reality is a hybrid architecture blending the capabilities of the Roblox Cloud and Game Engine, with the photorealism of Video World Models. Core world state is durably and efficiently stored on the server to ensure consistency across clients and support consistency over time, sessions, and days using cost and space-efficient storage. Multiplayer gameplay is supported via strong server authority for fairness and consistency, alongside speculative client-side simulation to achieve low latency. For rendering, cloud-based level of detail (LOD) and compositing systems generate high-fidelity assets delivered via a content delivery network (CDN). The Roblox Video Model (Super Sampler) leverages rendered video and rich data model context to produce stochastic visuals and striking realism, operating on the edge for every player with optimal performance powered by cloud-edge GPU infrastructure. The rich Roblox client would then render this video feed and, in the future, optionally overlay a locally rendered upsampled avatar to maintain very low latency on foreground actions.

In the demos below, we show four videos of different games. The video on the top left is Roblox content recorded using the Roblox rendering engine today, the video on the top right is a representation of the 3D data we can use to condition the video generation. The video on the bottom left shows the current Roblox upsample video model running in our lab, which does not yet run in real time and the bottom right video shows a mockup of our product vision and what is possible in the future with this technology.

Video World Models: Strengths and Constraints

Video World Models excel at generating plausible, high-dimensional behaviors without the need to explicitly simulate every individual interaction.

Operating Video World Models within the video latent space faces specific technical limitations: The process is currently cost-intensive, and achieving high-fidelity, real-time performance, such as 2K resolution at 60 Hz, remains a development challenge. Crucially, with the world state represented in video space, these models are not currently multiplayer. A key constraint is the fidelity of simulation versus visual plausibility: Merely seeing 500 people moving in a video does not imply they are individualized agents or "avatars with brains". It is not anticipated that the current video model scale will inherently support the complex, individualized agent simulation required for a true multiplayer experience.

This capability is crucial when managing a living crowd of 20,000 people reacting in real time. But, a Video World Model alone cannot reliably manage the interactions between multiple players over a two-hour session. A world model struggles with strict rule enforcement and persistent state due to a lack of long-term memory and consistent logic. Video World Models lack user input control data, which is why playing a Video World Model is not fun. Because Video World Models struggle with persistent state, consistent logic, user input control, and true multiplayer agent simulation, current models are more like guided dreams.

The interactive video models we’re seeing today are impressive, but basically vivid dreams—spectacular to look at, but fleeting and incredibly lonely. They lack interactivity, challenge, reward, and persistence—anything that makes a game a game.

Pure neural world models alone cannot deliver on the promise of an expansive, persistent multiplayer experience. While neural world models are impressive in many ways, they fail in many critical areas. Some of these include coherence over time in a single session, long-term memory across sessions, latency, and fine-grained creator control. Less obvious gaps appear when you think about consistent multiplayer simulation, exacting competitive gameplay, highly intelligent NPCs, testing, and incremental refinement.

We shouldn't ask a neural engine to become a game engine.

Game Engines: Strengths and Constraints

The Roblox Cloud and Engine are strongly complementary to Video World Models. With replayable precision, consistent state across sessions, and persistence across time. Take for instance a creator building a Formula 1 Monaco Grand Prix game. They are modeling exacting scoring and penalty systems, roads, crowds, nature, and instant synchronization across multiple drivers. However this precision comes at an implementation and runtime cost. Increasing visual fidelity requires heavy assets, complex lighting, and simulation.

Over the next decade, high-end game engine outputs will continue to advance in realism, but so will the requirements for developer sophistication and consumer hardware.

The challenge the industry has not been able to address to date is how to deliver hyperrealism at scale, while making it accessible to developers large and small, and on broadly available consumer hardware.

This is because the real world has exquisite detail. Surrounding the core game is everything else—unscripted, naturalistic elements like blades of grass, leaves, and branches blowing gently in the wind, clouds of dust billowing and swirling behind the cars, glowing embers and sparks shooting from a fire, and raindrops quietly splashing in an oily iridescent puddle. This content is very difficult to author and to render. Traditional game engines struggle with this visual complexity, looking for shortcuts to capture a simpler realism, as the memory overhead for high-resolution textures and geometry strain available resources. Simulation costs also spiral to exorbitancy with the volumetric lighting, binaural audio, physics, and character simulation that together constitute photorealism.

We believe the best way for creators to build, and for engines to render, this complexity will be leveraging a hybrid architecture in which a post-trained Video World Model will generate textures, lighting, and fine-scale dynamics on top of the engine’s underlying camera motion, geometry, and contextual state.

The Architecture: Syncing Game Logic and Video Pixels

We believe a hybrid approach is needed to allow creators to provide high-fidelity multiplayer interaction with photorealistic output. We call this approach Roblox Reality, which combines the Roblox Game Engine, Roblox Cloud, and a Super Sampler Roblox Video World Model.

The Roblox Reality hybrid architecture divides responsibilities between the Roblox Game Engine and the Roblox Video World Model.

The Roblox Game Engine handles the structured and logical aspects of the world, providing stable long-term memory, symbolic logic, and repeatable simulation. It is also responsible for fundamental physical operations like collision and behaviors. Primary movement of objects is managed in the engine, for example the location and velocity of a car, its wheels, shocks, and steering. Building on this, the Video World Model layers on additional visual and generative components, like the beads of water streaming along the windshield and the fluttering of leaves as the car zooms by, delivering breathtaking visuals. This approach allows the Game Engine to maintain the data model (the shared and consistent state) while the Video World Model generates the Pixels (the visual dream).

Capability

Game Engine
(Roblox Cloud)

Super Sampler
(Roblox Video Model)

Primary Function

Handles all state synchronization to keep the world consistent (data model, the shared and consistent state).

Manages the visual and generative components (Pixels, the visual dream).

Core Responsibilities

Provides stable long-term memory, symbolic logic, and repeatable simulation. Responsible for fundamental physical properties (materials and locations) and operations (collision and ray tracing).

Delivers stochastic visuals and breathtaking realism, secondary motion, natural dynamic environments, and fluid physics. Generates higher fidelity textures, more realistic lighting, and fine-scale dynamics.

World Consistency

Provides precision, consistent state, and guaranteed consistency. Centralizes the state into one source of truth.

Excels at generating plausible, high-dimensional behaviors without explicit simulation (e.g., managing a living crowd). Operates on the edge for every player.

Data Handled

Everything that is consistent among all players (players, positions, cars, birds, buildings, 3D scene).

Ephemeral things that players do not need to see exactly the same (rusty cans, flock of birds, cloud shapes, sand grains, grass).

Memory Storage

Data model

Video latents

Standalone Constraint

Struggles with visual complexity and high computational demands for photorealism.

Struggles with strict rule enforcement, long-term memory, consistent logic, and user input control data.

Runtime Infrastructure

26+ edge data centers world wide, running millions of game instances, close to users for low latency, peaking at 45+ million concurrents.

Super Sampler runs in adjacent edge datacenters, and optimal performance, powered by H200/B200-class GPUs (or equivalent accelerators)

Together, this platform supports infinite content creation with deep creator control.

Our development goals for Roblox Reality involve creating a Roblox Video Model capable of delivering 2K resolution at 60 Hz by pulling source of truth from the Roblox Game Engine: both rendered video and 3D spatial data. Roblox Reality will be optimized to run on cloud edge GPU infrastructure coupled with video streaming, while eventually integrating with the Roblox client to support local avatar control and simulation.

Summary

Roblox Reality represents a major step in democratizing creation, allowing any creator to build photorealistic games by leveraging the Roblox Game Engine and Video Model, significantly reducing the development time, cost, and compute that is traditionally required for high-fidelity graphics. This makes creating photorealistic games faster and more cost and compute efficient for our creators. Given the high compute cost, we realize there are challenges we need to solve before we can scale the Roblox Reality architecture. We are already working on solutions to help us optimize and increase efficiency for this architecture so that we can more affordably scale this to millions of concurrent players.

Most of all, we are excited to build a platform to unlock games that let our creators build amazing multiplayer photorealistic experiences!