Jun 19, 2026 6 min read

World models and interpretability are two sides of the same coin

There has been a lot of confusion around world models. Depending on who you ask, a world model is a generative model, a 3D reconstruction model, or a latent-space prediction model. I would like to propose a fourth definition– one that goes back to the original- the Internal World Model.

The claim is simple: world models and interpretability are two sides of the same coin.

Under this framing, the world model is not a separate module sitting alongside a foundation model, nor is it the model’s generated output. The world model is the internal state variables and dynamics learned by the foundation model itself. The Internal World Model is a foundation model's internal causal structure.

This reframes the central question of both the fields of interpretability and world models. Instead of asking whether a model can generate realistic videos or predict future latent states, we ask whether it has learned variables that correspond to meaningful structure in the world. Has it learned representations of velocity, object permanence, causality, social interaction, or physical dynamics? How do those variables interact to produce predictions and actions? The object of study is no longer generation or latent prediction. Instead, the object of study is the model’s internal mechanics.

To get there, I think we should recast interpretability in two more precise ways.

The first is causal discovery. Instead of treating interpretability as a vague effort to “understand models,” we can view it as the process of identifying the latent variables that causally generate a prediction. This framing is immediately familiar to scientists of the physical world. Climate researchers search for the variables generating typhoons. Cosmologists search for the variables behind star formation. These communities have been asking causal-discovery-esque questions for decades long before deep learning became a thing.

My own understanding of this connection emerged through conversations with Christina Last (@last_christina), the causality researcher working on climate systems. While sitting in a Montreal cafe, we realized that we were asking the same question using different scientific languages. I was coming from mechanistic interpretability and physical reasoning through JEPA models on Newtonian dynamics. She was coming from pre-deep-learning climate science. Both of us were interested in discovering the latent variables generating a physical system. It took us a surprisingly long time to realize we were talking about the same thing.

The second reformulation is white-box evaluation. Rather than evaluating a model solely through its outputs, we inspect the internal representations that produced those outputs. In language models, this might mean identifying internal signatures associated with deception, hallucination, or planning. In embodied systems, the equivalent dangerous features might correspond to a child interacting with an electrical outlet, or a manufacturing failure like a robot arm dropping a container of liquid.

The key shift is that we are no longer asking only what the model predicted. We are asking how it arrived there. We want access to the model’s internal reasoning trace– the chain of latent variables that combined to produce the next action, prediction, or frame.

Viewed this way, causal discovery and white-box evaluation become closely related disciplines. Both are fundamentally concerned with identifying the variables that causally generate a prediction.

This leads to a more provocative possibility: what if foundation models are not merely predictors, but simulators?

Importantly, I do not mean simulators in the generative sense (e.g. Sora, Genie, or World Labs lineages), or even in the Yann LeCun JEPA latent space sense. These methods apply to both pixel reconstruction and latent prediction. Latent prediction may produce more factorized variables, because a model not forced to reconstruct irrelevant detail can spend its capacity on the relevant causal variables instead. But that's an efficiency advantage, not a different object of study.

Instead, I mean that the internal dynamics of the model begin to resemble a simulator. Suppose a video model develops an internal representation of velocity. If we can steer that velocity representation and observe corresponding counterfactual changes in future predictions, then the model starts looking less like a statistical predictor and more like a simulator whose latent variables correspond to physical quantities in a factorized way. Think about the “dials” of a physics simulator, but the “dials” were learned in an unsupervised way, from raw observation.

Here is where cognitive maps become relevant.

Neuroscientists remain divided on whether modern foundation models possess cognitive maps– what cognitive science would call an internal model of the environment. On one end of the spectrum is a jellyfish. A jellyfish does not appear to need a detailed cognitive map of the ocean. It reacts to stimuli. It is largely a stimulus-response machine.

On the other end is a person navigating a familiar bedroom. You can probably close your eyes and mentally simulate walking through the room. You can imagine moving around the bed, knocking over a laundry basket, or opening a drawer. You can run counterfactuals entirely in your head. That ability suggests the existence of an internal world model acting as a simulator.

Something similar appears in the famous transformer trained on New York taxi trajectories. The model develops a distorted but recognizable cognitive map of the city. Taxi drivers develop one too. Experts often seem to carry internal simulations of the environments they know best.

A transformer learns a legible but distorted internal world model of New York's taxi cab routes. From Teoh et al (2026).

The open question is whether large foundation models develop analogous structures, and under what conditions.

The answer is not obvious. In fact, some of our findings in JEPA point against the strongest versions of this hypothesis. The jury is still out. Roughly half the experts I talk to lean one way and half the other, a divide reflected in discussions such as this summer’s GAC workshop at Columbia on cognitive maps in world models, organized by Eivinas Butkus (@eivinasbutkus). We are very much operating at the frontier.

But if these internal cognitive maps do emerge, the implications are enormous.

The first application is physical verification. In a conversation with Dileep George (@dileeplearning), he gave the example of a flight simulator. Imagine a pilot whose brain contains all the right representations of altitude, velocity, and instrument readings. Traditional interpretability might tell us those representations exist. Yet the pilot can still crash the plane if those representations fail to combine correctly in a particular situation.

The same problem exists in AI systems. A model may contain the right knowledge while failing to recruit it correctly when making a decision. Understanding the causal interactions between internal variables in specific contexts is far beyond simply locating a representation with a linear probe.

The second application is scientific simulation.

Climate science provides some of the strongest early evidence. Several groups, including work from Theodore MacMillan (@thzro) at Stanford, researchers at Polymathic (@PolymathicAI), Google DeepMind, and others, have found that unsupervised foundation models like Walrus and GraphCast trained on climate data can begin to recover interpretable variables corresponding to climate dynamics like typhoons. One can imagine steering vectors corresponding to phenomena such as typhoons. In another domain, work supervised by @wendlerch and @davidbau has shown interpretable features corresponding to protein-folding.

The obvious question is how far this idea scales. If cosmologists train models on the evolution of baryonic matter (as proposed by astrophysics PhD student Linda Jin), can they recover the latent variables that govern downstream processes such as star formation? Could a foundation model trained directly on physical observations discover causal variables that current theories miss? In the most extreme case, could it force revisions to our understanding of cosmology or even general relativity?

The underlying scientific technique is surprisingly similar whether we are verifying a robot arm, modeling a supply chain, simulating an energy grid, understanding climate systems, or studying the early universe. The same framework applies from flight simulators to the Big Bang.

There are still many open questions. Do foundation models actually develop cognitive maps / Internal World Models, or can they succeed through more stimulus-response strategies, like a jellyfish? What are the scaling laws governing these Internal World Models? How do compute, data, architecture, and domain expertise affect their emergence? Which domains lend themselves to easy testing? What distinguishes an Einstein-level Internal World Model from a model trained on much narrower distributions?

I do not think we know the answers yet. But if interpretability can be recast as causal discovery, and if causal discovery reveals the internal variables hidden inside foundation models, then world models and interpretability may ultimately turn out to be a very similar research program.

When the goal is physical verification, we trace how sensor observations combine into internal variables, and how those variables interact to produce the robot's next action. When the goal is scientific discovery, we trace how physical observations combine into latent variables governing typhoons, protein folding, star formation, or the evolution of galaxies. In both, the underlying question is the same: what are the network's internal variables, and how do they combine through the network's internal circuitry to generate behavior?

We would not simply generate worlds, or even just predict them. We would write down their mechanics.

Acknowledgements

Thank you to the various discussions behind this post, including but not limited to Christina Last, Dileep George, Thomas Fel, Matthew Kowal, Nicolas Ballas, Blake Richards, Linda Jin, Manjari Narayan, Rico Rodriguez, Eli Bessert, Calvin Li, and Delia McGrath.