With the advent of LLM, machines have gained impressive capabilities. What's more, their pace of development has accelerated, with new models appearing every day that make machines even more efficient and give them even better capabilities. However, upon closer inspection, this technology has only just enabled machines to think in one dimension. The world we live in, however, is three-dimensional based on human perception. It is not difficult for a human to determine that something is under or behind a chair, or where a ball flying towards us will land. According to many artificial intelligence researchers, in order for AGI, or artificial general intelligence, to be born, machines must be able to think in three dimensions, and for this, spatial intelligence must be developed.
What does spatial intelligence mean?
Spatial intelligence essentially means that an artificial system is able to perceive, understand, and manipulate three-dimensional data, as well as navigate in a 3D environment. This is much more than mere object recognition, which today's AIs are already excellent at. It is about machines recognizing depth, volume, relationships between objects, and spatial context—similar to how we humans interpret the space around us. Dr. Fei-Fei Li, a pioneer in the field of AI and an expert referred to as the “godmother of artificial intelligence,” emphasizes that this ability is just as fundamental to the future of AI as language processing. Just as language laid the foundation for communication, understanding 3D space will enable AI to truly interact meaningfully with our physical environment.
However, achieving this is a serious challenge and does not follow straightforwardly from existing LLM technology. One element of the problem is that language is fundamentally one-dimensional (1D), as linguistic information arrives sequentially, in order—for example, words and syllables come one after another in speech or writing. For this reason, models suitable for language processing, such as LLMs, work well with sequence-based learning (e.g., sequence-to-sequence models). The other problem is that language is a purely generative phenomenon: it is not tangible, we cannot see or touch it, but it originates from the human mind – it is a completely internal construct that we only record afterwards (e.g. in writing).
In contrast, the visual world is three-dimensional (3D), and if we include time, it is four-dimensional (4D). During visual perception, the 3D world is reduced to a two-dimensional projection (e.g., on our retina or in a camera image) – this is a mathematically ill-posed problem (there is no clear solution). In addition, the visual world is not only generative but also reconstructive – bound by real physical laws – and its uses are more diverse, ranging from metaverse generation to robotics. Therefore, according to Fei-Fei Li, modeling spatial intelligence (e.g., 3D world models) is a much more complex and difficult challenge than developing LLMs.
Google Geospatial Reasoning Framework: is this spatial intelligence?
There are a bunch of ways to build spatial intelligence these days. Computer vision and 3D processing play a key role. Lidar, stereo cameras, and structured light sensors are used to collect depth information, which is then processed by neural algorithms. These technologies are already being used in autonomous systems, robotics, and geospatial applications.
The Geospatial Reasoning Framework developed by Google is a significant technological step towards the application of spatial intelligence, building on the company's global geodata infrastructure and advanced generative AI capabilities (for more information, see my previous article Google Geospatial Reasoning: A New AI Tool for Solving Geospatial Problems). The system aims to uncover and interpret complex spatial relationships based on various data, such as satellite images, maps, and mobility patterns. At its core are basic models such as the Population Dynamics Foundation Model, which models population changes, and trajectory-based mobility models, which analyze the movement of people over large areas. These models work in close integration with Google's existing systems (Google Maps, Earth Engine, Street View), giving them access to hundreds of millions of locations and extensive geographic data.
This framework enables, for example, the modeling of urban planning scenarios, spatial analysis of disaster situations, mapping of climate vulnerabilities, and tracking of public health trends. The system uses AI—specifically Gemini capabilities—to automatically perform GIS operations from natural language queries, generate new spatial data content, or present complex geographic relationships.
At the same time, it is important to note that this approach does not cover the entire spectrum of spatial intelligence, especially not the kind of 3D world understanding that Fei-Fei Li refers to. Google's system is fundamentally built on 2D maps and geographic plane models, which are excellent for large-scale, aggregated spatial analysis, but not suited to dealing with fine-grained, object-level 3D relationships, physical laws, or embodied AI tasks. True spatial intelligence—such as when a robot needs to navigate a room, identify objects, or manipulate them—requires much more than on-site data processing: it requires dynamic world modeling, handling of perceptual uncertainty, and understanding of time-varying physical interactions.
According to Dr. Fei-Fei Li, the development of vision took 540 million years of evolution, while language developed in just half a million years — which shows just how fundamental and complex a task it is.
The Paths of the Future
Although remarkable results are already visible in specialized applications, achieving human-level spatial intelligence remains an ambitious goal. Initiatives such as World Labs, which attract huge investments, show that the industry sees great potential in this area. In the future, the effective integration of different types of spatial intelligence—from fine 3D object manipulation to large-scale geographic reasoning—will be key. In addition, standardized measurement and evaluation frameworks need to be developed to accurately track progress. Collaboration between experts in computer vision, robotics, cognitive science, and geography is essential for success. This is because training models with spatial intelligence is extremely difficult. While there is a wealth of text and images on the web for training LLM models, acquiring such a large amount of data about the 3D world is not only a major challenge, but also requires completely new approaches.
But how long will all this take? Obviously, no one knows the answer, as the task is so complex that even the researchers themselves are reluctant to make predictions. However, there is one noteworthy story worth mentioning in this regard. In an interview, Dr. Fei-Fei Li said that when she graduated from university, her dream was that, with her life's work, she might be able to create software that could describe what is in a picture in words. In 2015, she and her colleagues and students (Andrej Karpathy, Justin Johnson, etc.) suddenly found themselves with a ready-made solution. Dr. Li was a little disappointed and wondered what the hell she was going to do with the rest of her life. She jokingly remarked to Andrej Karpathy that now they should create the reverse of the software, i.e., generate an image from text. Andrej laughed at this funny absurdity, and Dr. Li probably chuckled to herself, but those of us who haven't spent the last few years living in a cave or under a rock know how the story ended.