Hello, World Model
World Models are emerging AI systems that aim to understand and simulate the real world in 3D and in motion. While still mostly in the experimental phase, big tech players like OpenAI, Google, and Meta are investing heavily in this domain. Beyond the obvious industries like gaming, robotics and XR, these models have the potential to have a massive impact on every sector by enabling AI to grasp spatial dynamics, physical laws, and human context.
Some call it Geospatial AI or Spatial and Physical Intelligence. Others refer to it as (Large or General) World Models or World simulators. But they all want to build the very same thing: AI that understands and simulates the physical 3D world in motion. And this might very well be one of the most underestimated domains of the moment.
Easy to overlook
The reason for this underestimation is threefold. The first one is simply semantics: there isn’t one Big (or should I say Large) common name for it like we have for the Large Language Models or LLMs. So, if you aren’t paying much attention, it’s easy to overlook that players like Google, OpenAI, Niantic, Meta, Amazon and several lesser known names like Archetype AI and 1X have all made announcements in this very domain in 2024. The second one is the lack of real breakthroughs. Most are in the experimentation phase and have very few functional applications to show.
The third is that these World Models are moving at the edges. The domains where the first real applications are envisioned - like gaming and the creative industry - could seem “frivolous” at first sight, and too far removed from other B2C and B2B environments. But, if anything, these are always the industries that other companies should pay attention to, because they are the most fertile playground for all things AI. Games and imitation are how children learn, with minimal risk and “investment” involved, and so do computers. Just think of Google Deepmind’s breakthroughs with Atari games and AlphaGo. Or, further back in time, IBM’s Deep Blue beating chess grandmaster Garry Kasparov and IBM Watson defeating Jeopardy's two foremost all-time champions.
The big kids are doing it
So why should you be paying attention (if you didn’t already)? Well, the amount of really big tech players that have made announcements this year is staggering. It all started when OpenAI launched a preview of Sora, which had many people looking in the wrong direction. Most of them thought, “Hey, look, AI can make freakishly realistic movies now, that’s kinda cool”, and that was all. Our thoughts and prayers went out to the creative industries that would be disrupted, but that was pretty much it.
On OpenAI’s Sora page, however, it was not so much the tagline “Creating video from text” that caught my eye, but the sentence a bit further down: “We’re teaching AI to understand and simulate the physical world in motion, with the goal of training models that help people solve problems that require real-world interaction.”

That was February of this year, after which it stayed calm a few months until the “Godmother of AI” and Stanford Professor Fei-Fei Li took World Labs out of stealth mode in September to announce they would create AI that can interact with and understand the 3D world, building “Large World Models”. Then we saw Amazon licensing Covariant’s robotic foundation models that advance state-of-the-art intelligent and safe robots. Meta’s Yann LeCun also started to get really chatty about Meta’s Fundamental AI Research lab (Fair) investing in world modelling and “creating AI that can develop common sense and learn how the world works in similar ways to humans”. In November, Niantic - best known for its AR mobile game Pokémon Go, which obviously amassed an enormous amount of real-world data in the process - announced a Large Geospatial Model, clearly describing applications beyond its own gaming core, like AR and robotics. In December, Google launched world model Genie 2 “for training and evaluating embodied agents”. Elon Musk announced he was planning to launch an "AI-driven game studio" under xAI. In the final stages of writing this, OpenAI finally released Sora to the public, as part of its 12-day “ship-mas” product release series, actually stating that it could “be a while” for a launch in “most of Europe and the UK.”
Different stories, same ballpark. And these were just the flashier announcements.
We’ve reached the limits
The reason why we’ve seen all of these announcements, especially in the second half of 2024, is because everyone wants a piece of the (investment) pie and wants to show that they are taking the next step in AI. LLMs might have reached a point of diminishing returns, as newer models—even those trained on larger datasets—show only incremental improvements over their predecessors. As mentioned in the article “The agents are here. (Don’t) Panic.” in this report, that is a solvable problem.
What is not solvable, however, is the fact that computing systems cannot reach a full understanding of the world in all its 3D and physics-driven glory, through language, text and even 2D video alone. They will never bring us AGI that way. Or deliver general purpose robots that are safe companions for us very fragile humans. This quote from Archetype AI's website perfectly sums this up:
"AI today has been reduced to a chatbot. The biggest problems in the world are physical, not digital."
It’s the “You had to be there” syndrome. Words cannot convey how truly awesome that party you went to was or how deeply humbling it is to look from a spacecraft down at the earth. Context is everything. That’s just as relevant for computers as it is for humans. Just to give an example, ChatGPT can provide turn-by-turn driving directions in New York City with near-perfect accuracy, even without having formed an accurate internal map of the city. But MIT researchers uncovered that when they closed some streets and added detours, the model’s performance deeply plummeted.
Another example is that it’s pretty simple for humans to envision what structures like a church look like from other angles that cannot be seen because we have “spatial understanding”, while this is extraordinarily difficult for machines. But it’s also about understanding the “invisible” physical, social, animal, etc. laws that rule our world, which are a part of our collective subconscious. Fei-Fei Li often gives the example of a cat pushing a glass to the edge of the table. All humans know what will happen and will try to prevent it, machines can’t (yet).
The good news is that the amount of available “physical” real world training data is pretty huge: coming from (semi)autonomous vehicles, drones, AR glasses, AR games, manufacturing robots, the AI devices mentioned in the article “Hey ChatGPT, invent a better smartphone” further on, even aviation (sensors on an aircraft collect over 300,000 parameters) and many more environments from the “trillion sensor economy”. In fact, seeing that Google has so much available street view 3D data - and video footage from YouTube - it’s almost weird how little announcements they’ve made in this domain.

Depth is the new gold
LLMs work by predicting the next token, usually a letter or a short word. They are one-dimensional predictors. Similarly, 2D image/video models predict the next pixel. They are two-dimensional predictors. Both have become pretty great at predicting in their respective dimensions, but they don’t really understand the three-dimensional world.
The dream is that World Models will be able to “understand the physical structure, the physical dynamics of the physical world”, as Fei-Fei Li describes. And that they will “remember things, have intuition, have common sense, reason and plan to the same level as humans”, to put it in the words of Meta’s LeCun.
It’s actually very similar to kuuki o yomu, which translates to something like “reading the air” or situational awareness. This concept is crucial in high-context countries like Japan, where it originated. When you step on an escalator, do you stand to one side to let others pass? When someone in the room says it’s hot, do you open a window? If you ask someone on a date and they stare at you blankly, do you withdraw the invitation? That’s kuuki o yomu. And it’s also what World Models need to learn.
Yes, but what can it do?
For now, World Models are mostly experimental. There’s still a lot that they can’t do. That does not mean that we should set this development aside, though. Don’t forget, OpenAI started with their GPT model in 2018 and went through several iterations, before they finally released version 3.5 to the public in 2022.
We’ll see the first applications emerge in gaming and video production for the entertainment and marketing industry. These are low-risk test environments with a high potential for cost savings (manually creating virtual, interactive worlds costs hundreds of millions of dollars and a ton of development time) and revenue creation. Manufacturing too, is slightly ahead of the game, because there are so many sensors available that need to understand the world around them for safety reasons, especially if something unexpected happens.
Beyond that, all the tech companies involved agree that this will have far reaching real-life implications for rapid prototyping, robotics, manufacturing, autonomous systems and Augmented Intelligence (read “XR & AI: The power couple of our times” in this report, too). Imagine what a fully functional World Model would mean for Tesla Optimus (which is why it’s not surprising that xAI would move into gaming).
Niantic, which adds a “geospatial” flavour to the game (Thanks, Pokémon Go), is looking at “spatial planning and design, logistics, audience engagement, and remote collaboration”. AI-decision making will also be a really big one, because we’ll never reach advanced or trust-worthy Large Action Models or AI Agents (read the article by Peter and Nina about that, too) if they don’t have a 100% understanding of the rules and laws of physics and society. And you’ll never have fully functional and advanced robots without AI agents - that can make their own decisions - either. Imagine sending a household robot to the supermarket for groceries if it is not able to make its own decisions when certain items are missing.
What about your sector?
As easy as it is to understand the implications for gaming, robotics, AR, VR, advertising, etc., as difficult it is to try and envisage what this will mean for companies operating in other industries over the long term. One way or another, it will affect you when it matures, if only because it might completely change consumer behaviour.
Imagine a household robot running errands. No matter how your store is designed, this purely rational system will never be able to be persuaded to buy on impulse like an impressionable human. Or if you’re an insurance company, what could it mean if your World Model “apps” (in robots, AR, etc.) could understand their context and feed that directly to you? Construction companies will also be able to design buildings so much faster with systems that understand human taste, norms and physics.
And what - as we learned in the “Digital strategy in the age of acceleration” article which taught us to look beyond the individual at the collective - will it mean if all World Model driven robots are connected into one giant network? The knowledge they amass will be gigantic. Not just in scale, but in type, seeing that AI systems are learning to understand emotions. Let’s say that you’re a hospital. You might design a robot doctor app or a geriatric nurse app for humanoid robots. That could be useful, right? But imagine what the connected intelligence of this type of healthcare robots could mean for cancer or Alzheimer’s research?
Like a kid in a candy store
Last, but certainly not least, we need to talk about ethics. “Don’t be evil” used to be Google’s motto, but that will become increasingly hard for all companies in a world where our AI systems can understand their context, reason across domains, and autonomously act on that. (Wait, what, isn’t that AGI?). It’s like sending a kid to a candy store with an unlimited budget while telling them that their health is important.
Not just for privacy and safety but also for instance in industries with defence tech where companies like Anduril and Palantir (what is it with military tech companies and Lord of the Rings?) are operating. Oh, neat, a drone that understands the difference between the enemy, a neutral party or an ally. Or is it? Also, the energy consumption and the CO2 emitted will only rise exponentially compared to LLMs, which is already problematic.
More importantly, we should never lose track of the human aspect of all of this, as you will (or have) read in David De Cremer’s piece. If technology takes over an increasingly big space in our world and all of it is centered upon speed and efficiency, the human space for emotion, wellbeing, creativity and value will shrink. And that's something we want to avoid at all costs.
All challenges aside, World Models have the potential for massive impact. Maybe they could even lead us to AGI, who knows. For many techies “Hello, World” - the first program in almost every programming language tutorial - represents the first step -toward a new adventure of possibilities. World Models embody these very same possibilities but on a much greater scale. (Inserts video of a Frankensteinian laboratory where a robot lies on its back - instead of the usual patchwork being - suddenly opens its eyes and thinks “Hello, World”.)