
Over the past two years, large language models (LLMs) have triggered a familiar split in reactions: enthusiasm, alarm, and a growing chorus of criticism. Among the sceptics, Yann LeCun occupies a particular place. As one of the architects of deep learning, he speaks not from the outside but from the very centre of the field, arguing that LLMs “don’t understand the world” and that real advances will come from systems that build “world models” from perception and action rather than from text alone.
This article does not try to adjudicate between architectures. It takes LeCun’s remarks as an opportunity to ask a more basic question: what do we presuppose when we oppose “language” and “world” in this way? Drawing on a bit of history of philosophy and of the sciences, I sketch how modern knowledge can be pictured as a triangle whose three vertices are instruments, symbolic languages, and internal models, and how both human beings and today’s AI systems approach that triangle by different, messy paths. We will see that one can agree with LeCun on the importance of perception, memory, and action, while still questioning the tendency to treat language – and with it, LLMs – as if they stood outside the world they speak about.
LeCun, LLMs, and the question of “knowing the world”
Yann LeCun has become one of the loudest critics of LLMs as a path to “real” intelligence. His core claim, repeated in various forms,1 is simple enough: LLMs learn only from language, and so they model correlations between words rather than the structure of the world. To reach autonomous intelligence, he argues, we need systems that build an internal “world model”2 from perception and action, and only then perhaps use language as an interface.
This picture is both insightful and misleading.
LeCun’s insistence on perception, memory, and action is justified: an AI that never interacts with its environment, never tests its expectations, and never accumulates its own experiences will remain dependent on the partial, biased traces that language offers. But it is misleading to oppose “words” and “world” as if they belonged to separate universes, and as if language were an essentially superficial layer.
In modern societies, language is the main medium through which experience is abstracted, accumulated, and transmitted. Much of what we know about the world does not come from our own senses, but from stories, explanations, proofs, tables of results, diagrams, code. If you train a system on that symbolic layer, you do not feed it with empty chatter: you expose it to millennia of organised experience.
The question, then, is not whether LLMs “see the world” directly. It is how language, instruments, perception, and internal organisation combine to produce something that deserves to be called a world model at all.
In what follows, I will argue three things:
- Modern science already shows that knowledge can be pictured as a triangle whose three vertices are instruments, formal languages, and internal models.
- Language is not a meta-sense but an abstraction device and a cumulative memory for human experience.
- Human beings and current AI systems are approaching the same triangular structure from opposite directions: humans from bodies to language to models; LLM-based systems from language to models to sensors.
Seen from this angle, LeCun’s world-model programme looks less like a refutation of LLMs and more like a complementary trajectory that underestimates the epistemic weight of the linguistic layer.
Language as abstraction and cumulative memory
Language does not replace experience; it reorganises it.
Without speech, writing, and formal symbolism, each human generation would have to rediscover almost everything. You would learn to walk, to manipulate objects, perhaps to hunt, but not celestial mechanics, microbiology, or democratic constitutions. At best, small islands of craft knowledge could be transmitted by imitation.
As soon as language appears, experiences can be narrated, detached from the situation in which they occurred; regularities can be named and stabilised: “iron rusts”, “fevers return”, “the river floods in spring”; procedures can be described step by step: how to sail, how to irrigate, how to build a vault; explanations can be proposed, criticised, and replaced.
Over time, this abstraction and codification of experience gives rise to layered symbolic systems: myths, laws, recipes, proverbs, treatises, theorems, algorithms. Language, in the broad sense (natural language, mathematical notation, diagrams, code), becomes an external social memory in which past experiences are not only recorded but reorganised into concepts, rules, and models.
Crucially, this memory is both practical and theoretical:
- Practical, because it stores know-how: how to cultivate wheat, how to navigate by the stars, how to organise a workshop.
- Theoretical, because it preserves attempts at “seeing the whole”: cosmogonies, philosophies, scientific frameworks.
When an LLM is trained on text, code, and formal materials, it is not simply exposed to arbitrary word sequences. It is fed with this second-order layer in which human societies have deposited abstracted and cumulative experience. Of course, this layer is uneven, full of errors and biases. But it is not empty. It is already saturated with relations to the world.
That does not mean language is a “meta-sensor” in any literal sense: it does not receive signals directly from the environment. It is an experience abstractor. It operates on top of perception and practice, compressing and recombining them into reusable form. In human history, it is through this symbolic medium that robust world pictures have emerged.
Instruments and the decline of sensory scepticism
Classical scepticism took aim at the senses. The favourite examples are well known: the straight stick that looks bent in water, the mirage of the oasis in the desert, the distant tower that appears round but is square. From these illusions, sceptics inferred that the senses are unreliable and that certain knowledge of the physical world is impossible.
Modern science did not respond by rehabilitating naked senses. It largely bypassed them. From the 17th century onwards, observation becomes increasingly instrumented:
- Telescopes and microscopes extend our range of vision far beyond what the eye can see.
- Clocks, balances, thermometers, barometers, and later spectrometers, detectors, sensors of all kinds, replace vague impressions with calibrated readings.
- Experimental protocols are designed to isolate effects, minimise interference, and make procedures repeatable.
Today, much of what we call “observation” consists in reading outputs from devices that no unaided sense can access directly. Nobody has a sense for radio waves, gravitational waves, or neutrinos. Yet we confidently talk about their detection because instruments transform them into signals we can record, count, and interpret.
A self-driving car is becoming a familiar illustration of this shift. It does not rely on something like human sight plus a human “feel” for distance. It fuses multiple channels:
- Video streams from cameras;
- Radar or lidar, GPS, inertial sensors, wheel encoders, which humans do not possess at all.
For that system, “sensing the environment” is already a matter of combining non-human modalities through algorithms, then using symbolic structures (maps, object categories, rules) to interpret those signals.
In this instrumented world, scepticism about the senses loses much of its bite. The epistemic work is done by:
- Devices and protocols that produce stable, sharable data;
- Symbolic frameworks that interpret and compress that data (tables, graphs, equations, models);
- Institutions that criticise, reproduce, and refine both.
Language plays a central role at each step. Measurements become meaningful only when inscribed in logs, compared, plotted, and discussed. The experimental side and the linguistic side intertwine from the start.
From this perspective, setting “experiment” against “language” in a simple opposition does not do justice to how modern science operates. Experiment, in practice, means instrument + protocol + inscription. Language and formalisms are not the negation of contact with the world; they are how contact is stabilised and accumulated.
Two developmental paths toward the same three-layer architecture
To clarify the relation between perception, language, and “world models”, it helps to distinguish three layers:
- A sensory / instrumental layer: bodies, senses, and devices that receive raw or preprocessed signals from the environment. For humans: eyes, ears, skin, vestibular system, plus telescopes, microscopes, sensors. For machines: cameras, microphones, radar, logs, databases.
- A linguistic / formal layer: the abstraction and codification of experience in shared symbols: natural language, mathematical notation, diagrams, programming languages, data schemas.
- An internal world-model layer (in a strong sense): a structured picture of how the world “hangs together”, articulated in concepts and relations, that supports explanation and deliberate planning: cosmologies, philosophies, scientific theories, but also more modest conceptual schemes (naive physics, naive psychology).
With these layers in mind, human development and current AI development can be compared more precisely. For humans:
- We start from the first layer. A newborn has a body, senses, some genetically specified reflexes, and enormous plasticity. But it does not have a world model in a strong sense. It cannot project trajectories, reason about causes, or even clearly separate objects. Coordination of limbs, control of gaze, and basic expectations about solidity and continuity take months and years to emerge, and require constant trial and error.
- Very early, the second layer comes into play. Long before understanding words, the child is immersed in a linguistic environment. Others speak around and to her, name things, comment on actions, scold and encourage. Gestures, routines, and proto-narratives surround perception. Later, she acquires words, simple constructions, more complex sentences, then stories, explanations, rules. Language and practice co-shape what is noticed, remembered, and expected.
- Only on top of this entanglement between embodied life and language does something like a third layer slowly form. Children develop a “naive physics” and “naive psychology”: expectations about how objects and people behave. Historically, human societies turn stories into more systematic pictures of the world: religious cosmologies, philosophical systems, and eventually scientific theories. These world models are not inscribed in genes. They are late products of linguistic labour, debate, and abstraction.
Babies, therefore, do not carry ready-made world models. They have predispositions and learning mechanisms that, together with a linguistic and social environment, can eventually support the construction of such models.
When engineers speak of a “world model” inside an animal or a robot, they often mean something weaker: an internal organisation of perception and action that supports short-term prediction and control. Such structures are real and important. But they are not yet world models in the strong, conceptual sense that matters for science and philosophy.
For current AI systems built around LLMs, the path is almost inverted:
- They begin at the second layer. An LLM is trained on a vast corpus of language, code, and formalisms. It directly ingests the symbolic layer in which human societies have condensed experience. It does not have its own senses, but it sees how humans talk about senses, instruments, and experiments.
- From this, it reconstructs a kind of third layer. It learns to reproduce and recombine patterns that embody regularities about the world: what usually happens when a glass falls, how debts accumulate, how elections work, which arguments tend to go together. This is not a clean, explicit theory, but it is more than a bare table of word frequencies: a distributed, implicit model encoded in parameters and activations.
- Only now are the first layer and richer interaction being added. Vision, audio, tools, code execution, web access, robotic control: all of these give the system ways to acquire and manipulate signals it did not see during training. They are not biological senses, but they play the role of channels through which information circulates. Beyond that, long-term user-specific memories and environmental traces give a history that is no longer only what is found in books.
In short: humans move from senses to language to models; LLM-centric AI starts from language, then builds models, and is now acquiring sensors and actions. Neither path is clean or purely sequential: children are surrounded by speech from the first days of life; LLMs are being encased in tool-using shells almost as soon as they exist. But the contrast of starting points is real.
This does not vindicate the idea that “LLMs already understand the world exactly as we do”. It does challenge the intuition that “language = no world, perception = real world”. The two development paths share the same triangular architecture, and both are historically messy.
An idealistic “world model”
LeCun’s critique seems to rely on a relatively neat picture: first a system learns to predict its sensory inputs and the consequences of its actions; from this it derives an internal world model; then, optionally, it acquires language as a convenient way of communicating and querying that model.
Such a pipeline is conceptually attractive. It resonates with empiricist narratives: perception first, then concepts, then words. But the history of science and philosophy suggests that this order is more idealisation than description.
When philosophers first attempted systematic “world models” – in Greece, in India, in early modern Europe – they did not start from a neutral, pre-theoretical perceptual space. They started from mythologies, from ordinary language, from inherited distinctions and stories, and they tried to re-order them into more coherent conceptual structures. They worked in and on language.
With Bacon and his successors, experimental practices begin to play a more central role. But again, these practices arrive bundled with words, tables, diagrams, and social norms. One does not first collect sense impressions, then later invent symbols. One designs experiments by articulating hypotheses that are already expressed in some language, builds instruments to test them, and expresses the results symbolically. Theory and observation evolve together.
The “world models” that mark turning points – Newtonian mechanics, Maxwell’s electromagnetism, Darwinian evolution, quantum mechanics, general relativity – are not the output of a single, uniform optimisation process running on raw perception. They are retrospective unifications of multiple, historically contingent lines: experimental results, mathematical techniques, metaphors, and conceptual tensions. They emerge from decades of tinkering, revision, and controversy.
From this standpoint, the project of a neatly learned world model constructed directly from sensorimotor prediction looks somewhat idealist. It abstracts away from the messy interplay of instruments, conceptual frameworks, and language that actually produced our best theories.
This is not to say that LeCun’s technical programme is misguided. Richer predictive architectures that operate on raw or minimally processed sensory streams are almost certainly necessary for robust action in the physical world. The point is rather that:
- Such architectures will themselves rely on internal representations that are partly shaped by the linguistic and formal scaffolds in which they are embedded.
- The high-level “world model” that matters for explanation will still be articulated and criticised in language and formalisms, not directly in the opaque geometry of a representation space.
If we reserve the term “world model” for these high-level conceptual pictures, then, in the human case, language is not an optional interface added after we have understood the world. It is the main medium in which understanding takes shape.
Language as spine, not ornament
Once we stop opposing categorically “words” and “world”, the landscape looks different.
On the one hand, LeCun’s insistence on perception, memory, and action remains important. A system that learns only from archived language will inherit our blind spots and will struggle to cope with domains where the textual record is thin, biased, or systematically silent. Connecting AI systems to sensors, instruments, and environments is not an optional luxury; it is part of giving them a life of their own.
On the other hand, language is not a decorative layer. It is how human societies abstract, accumulate, and construe experiences. It is the symbolic medium in which philosophies and scientific theories – our clearest world models – take shape. An AI that has no access to this layer would have to rediscover everything from scratch, and it would still need some internal symbolic or quasi-symbolic medium to stabilise what it learns.
LLMs are one way of entering this layer. They start from the linguistic and formal exoskeleton that humanity has built around its dealings with the world. From there, they begin to reconstruct implicit models and, increasingly, to act through tools and sensors. Systems built in the image of LeCun’s world-model programme start from the other side: they try to learn predictive structure from perception and action, and will then have to interface with our languages if they are to participate in human knowledge.
In both cases, the endgame is not pure embodiment without symbols, nor pure language without contact: it is a triangle built from sensory and instrumental channels, internal organisation, and shared symbolic media. The paths differ, but the architecture converges.
Rather than choosing between LLMs and “world models”, it may be more fruitful to see them as emphasising different edges of this triangle. The real challenge is to understand how perception-driven and language-driven learning can be made to correct and enrich one another, much as experiments and theories have done throughout the history of science.
Notes
1 See for example Yann LeCun, “Meta’s AI Chief Yann LeCun on AGI, Open-Source, and AI Risk,” TIME, 14 June 2023, where he argues that current LLMs “don’t really understand the real world” and are “not a road towards what people call ‘AGI’,” and contrasts them with future systems that would learn “world models” from perception and interaction; and “Meta’s Yann LeCun Asks How AIs will Match — and Exceed — Human-level Intelligence,” Columbia Engineering Lecture Series in AI, 18 October 2024, which describes language models as next-word predictors that “don’t understand the world as well as a housecat” and calls for architectures built around predictive “world models” rather than text alone.
2 https://www.linkedin.com/posts/yann-lecun_lots-of-confusion-about-what-a-world-model-activity-7165738293223931904-vdgR