Cloud Blog – Google Cloud – Gemini Omni: The World Model Arrives

Welcome to the special Google I/O ’26 blog series

Google Antigravity 2.0: Architecting Software with Multi-Agent Systems 01

Inside Gemini Spark: Google’s Always-On Agent 02

Gemini Omni: The World Model Arrives 03

Google Cloud 23.06.2026

Gemini Omni: The World Model Arrives

The Foundation of World Models and the Quest for AGI
Multimodality and Scientific Reasoning
The “Nano Banana for Video” Era and Conversational Editing
Real-World Creative Workflows in the Gemini App
Advanced Video Manipulation in Google Flow
Genie 3: Mastering Interactive Environments and Fluid Dynamics
Availability, Ecosystem Integration, and the Omni Mercial
What Gemini Omni Means for Business Teams
The Path to Artificial General Intelligence

At Google I/O ’26, Google DeepMind CEO Demis Hassabis took the stage to introduce Gemini Omni, and the announcement landed differently from the usual product launches. This wasn’t a faster model or a new interface. It was a different kind of AI altogether.

Omni is what researchers call a “world model,” or a system that doesn’t just predict text or generate static images, but actively understands and simulates how reality works. Hassabis framed it as a critical stepping stone toward Artificial General Intelligence (AGI), which he believes is now only a few years away.

The groundwork being laid here—AI that can reason about physics, spatial relationships, and the dynamics of the real world—is what makes everything from advanced robotics to genuinely proactive AI assistants possible.

As a global, Premier-tier Google Cloud Partner, we’re watching this one closely. You should be too.

The Foundation of World Models and the Quest for AGI

To understand what makes Omni different, you need to understand what previous AI models could and couldn’t do.

LLMs are genuinely good at what they do. Feed them text, and they’ll process and generate it at a level that still catches people off guard. But text is only part of the picture. A true world model needs to understand physics, spatial relationships, and how objects actually behave in an environment. That’s a different problem entirely, and predicting the next token doesn’t get you there.

Gemini Omni gets there by combining the native multimodal intelligence of the Gemini architecture with Google’s best generative media models. Before Omni, Google had already built some capable specialized tools like Veo for video generation, Nano Banana for image generation and editing, and Genie for interactive simulations.

Each of these showed glimpses of physics awareness and world understanding. But Omni, oh, it can simulate complex physical concepts like kinetic energy, gravity, and fluid dynamics at a level of accuracy that previous generative systems couldn’t touch.

And because it was designed from the start to be natively multimodal, the goal was always ambitious: generate any output from any input. That was a harder path to take. But according to the DeepMind team, the architectural investment is paying off.

Multimodality and Scientific Reasoning

One of the most striking things Omni can do is blend rigorous scientific accuracy and visual creativity. Because the model draws on Gemini’s deep knowledge base and reasoning capabilities, it can take abstract or complex scientific concepts and turn them into accurate, stylized videos.

The keynote demo made this concrete with a deceptively simple prompt: “Make a claymation explainer of protein folding.” That’s a request that exposes the limits of most AI systems pretty quickly. A standard video generation model would struggle with the scientific precision required. A text model can’t generate the visuals at all.

Gemini Omni handled both. It produced an accurate educational video walking through how proteins start as chains of amino acids and fold into complex structural patterns—the alpha helix, flat sections called beta sheets—until they form a functional three-dimensional shape.

And it did all of that in claymation style, without losing any of the scientific substance. For educators and science communicators, that combination of accuracy and creative execution is genuinely new territory.

The “Nano Banana for Video” Era and Conversational Editing

Google’s Nano Banana model changed what image editing looked like. Omni is set to do the same thing for video, and Google product leads said as much during the keynote, explicitly calling Omni the “Nano Banana for video” moment.

In development terms, think of it as Veo++: the raw video generation capabilities of Veo, combined with deep cognitive reasoning and natural-language editing in one system.

The editing experience is where this gets interesting for most users. Instead of working through a node-based timeline with a steep learning curve, you talk to the model. You bring in your own footage and describe what you want changed. Demis Hassabis demonstrated this with a selfie video where a circle drawn by the user became a physics-accurate black hole on screen.

In another example, a straightforward evening walk video was transformed with entirely new environmental elements that changed the mood of the scene completely.

The developer panel demo made that clear. A podcast intro clip, a group of people talking—and a cat and a plant flying around the scene. Absurd? Completely. But that was the point. Omni was compositing wildly different elements into a single, coherent, realistic video stream. One of the developers said that’s the moment the model’s full capability finally clicked for them. Any video becomes a starting point for something entirely new.

Real-World Creative Workflows in the Gemini App

Gemini Omni is now built directly into the redesigned Gemini application, available today to Google AI Plus, Pro, and Ultra subscribers worldwide. You can bring any combination of text, images, and video into the app and use them together in the same workflow.

The keynote showed what this looks like in practice through the story of a musician named Sashu. She was working on a new song and wanted to put together a quick video teaser to share with fans. She uploaded raw footage of herself walking, added a few reference visuals to indicate the visual style she was going for, and used Omni within the app to transform the whole thing through a few conversational prompts.

The results were significant. Omni changed the visual style of the footage, and Sashu was even able to ask it to switch the camera angle to a full 360-degree panning shot. By the way, that would normally require specialized camera equipment, a dedicated crew, and extensive post-production work.

Throughout all of it, Omni kept the physics of her movement accurate and preserved the pacing and feel of the original performance. The visual layer changed. The human element underneath it didn’t.

Soon, Omni will also hit Gemini Enterprise as a tool you can call via API.

Advanced Video Manipulation in Google Flow

Professional creators working inside Google Flow—Google’s dedicated platform for artists making images, films, and music—now have access to the same Omni capabilities with more granular control.

Raw footage in, finished scene out, without losing what made the original worth keeping. That’s what the Flow demo showed. A person walking, a performance the creator didn’t want to touch. A prompt and a style reference told Omni to rework the surrounding environment and layer in complex visual effects. It separated the subject from everything else accurately, and the core of the shot came through intact.

Flow users can also add entirely new AI-generated characters into existing scenes, with the model maintaining consistency with everything else in the environment. And the level of situational awareness Omni brings to large-scale edits is worth noting specifically.

Ask it to turn a scene from early morning to late at night, and it doesn’t just darken the sky. It turns on the vehicle’s headlights and simulates the way those headlights realistically illuminate dust particles in the air. That’s lighting physics right there, not just some basic color grading.

Flow users can also use Omni-powered models to build custom creative tools tailored to their specific workflows, essentially vibe-coding their own production utilities on top of the model.

Genie 3: Mastering Interactive Environments and Fluid Dynamics

Behind Omni’s world-understanding capabilities is Genie 3, a specialized world model that gives the AI its grounding in how objects and environments actually behave. This is what makes the video coherence in Omni noticeably better than in previous generations.

Gemini Omni isn’t generating pixels that statistically look correct. It’s generating scenes based on an underlying model of how the physical world operates—gravity, momentum, light transport, the behavior of fluids.

The Genie 3 demo put this on full display. A user prompted the model to generate a “tranquil waterfall cliff area featuring dynamic water physics” and introduced a high-speed paper airplane as the main playable character.

The generated environment was then navigable in real time using keyboard arrow keys. As the user flew the airplane through the scene and it interacted with the waterfall splashes, the fluid dynamics responded accurately. The light bouncing off the river’s surface changed dynamically as the airplane passed overhead. This was a real-time physics simulation inside a generated world, not a pre-rendered clip.

Genie 3 is available today for Google AI Ultra subscribers.

Availability, Ecosystem Integration, and the Omni Mercial

Given the computational demands of running world models, Google is rolling Omni out carefully. Gemini Omni Flash is available today across Google’s product suite, bringing the world-understanding capabilities, multimodality, and conversational video editing to users now.

Starting with video—historically the hardest modality to get right—sets the direction clearly: the goal is a model that can generate any output from any input without degradation in quality.

Google also confirmed that Gemini Omni Pro is in active development. More details on its professional-grade capabilities are coming soon.

To mark the launch and get developers hands-on immediately, Google set up an “Omni Mercial” booth at the I/O ‘26 demo garden, where attendees could use Omni to star in and generate their own high-quality commercials.

More than a fun activation, it made the point that professional-quality video generation is no longer something reserved for high-end studios.

What Gemini Omni Means for Business Teams

Most of the world model conversation obsesses over filmmakers and digital artists. And don’t get us wrong, that’s totally fine. But want to know who actually has the most to gain right now?

Marketing manager who needs the same campaign to land in six different markets, each with its own language, tone, and context.
Sales rep who’s walking into a call with a demo that wasn’t built for this prospect, and everyone in the room knows it.
Customer success lead whose onboarding videos are so out of date they’re probably doing more harm than good.

So what does the current production cycle look like? You brief an agency or an internal team, you wait, you go through feedback rounds, and you ship something that’s already a little stale by the time it’s live. For teams running across multiple markets and product lines, that’s not a minor headache.

That’s where Omni comes in and delivers polished, stylized assets out of a few conversational prompts. And no, you don’t need any editing expertise whatsoever, and there’s no queue to join, either. What’s more, it’s already built into the Gemini app, part of the same Google Workspace your teams are in every day. Meaning there’s nothing new to roll out, no training program to kick off. The capability’s just there, waiting.

The creative call stays with you. The execution doesn’t have to.

Now think about what that actually changes. When a camera angle or a visual style costs one sentence to try instead of an hour of work, you run more experiments. You find what works faster. That’s the real advantage.

The teams that move first will produce more content, get it out faster, and do it with a fraction of the overhead they’re carrying today. And that compounds quickly.

The Path to Artificial General Intelligence

Gemini Omni is a technical product launch, but Demis Hassabis was direct about the bigger picture it fits into.

The development of models that can truly understand and simulate the physical world is, in his view, a non-negotiable prerequisite for AGI. As AI systems take on more autonomous roles—managing schedules, operating as physical robots in real environments, making decisions in the real world—they need a working model of how that world operates. A system that can’t reason about gravity, momentum, or the behavior of objects in space isn’t ready for that.

Omni simulating reality is what makes that possible. The AI-native era isn’t a concept on a roadmap anymore. It’s already in the app.

Omni Creates from Anything. Gemini Runs Everything. Omni makes video creation feel like a conversation. Gemini takes that same intelligence further—into every Workspace tool your teams rely on daily. Cloudfresh makes sure they’re ready for all of it. Get Gemini Consulting & Training →