Spatial intelligence is the next hurdle for AGI to overcome

With the advent of LLM, machines have gained impressive capabilities. What's more, their pace of development has accelerated, with new models appearing every day that make machines even more efficient and give them even better capabilities. However, upon closer inspection, this technology has only just enabled machines to think in one dimension. The world we live in, however, is three-dimensional based on human perception. It is not difficult for a human to determine that something is under or behind a chair, or where a ball flying towards us will land. According to many artificial intelligence researchers, in order for AGI, or artificial general intelligence, to be born, machines must be able to think in three dimensions, and for this, spatial intelligence must be developed.

What does spatial intelligence mean?

Spatial intelligence essentially means that an artificial system is able to perceive, understand, and manipulate three-dimensional data, as well as navigate in a 3D environment. This is much more than mere object recognition, which today's AIs are already excellent at. It is about machines recognizing depth, volume, relationships between objects, and spatial context—similar to how we humans interpret the space around us. Dr. Fei-Fei Li, a pioneer in the field of AI and an expert referred to as the “godmother of artificial intelligence,” emphasizes that this ability is just as fundamental to the future of AI as language processing. Just as language laid the foundation for communication, understanding 3D space will enable AI to truly interact meaningfully with our physical environment.

However, achieving this is a serious challenge and does not follow straightforwardly from existing LLM technology. One element of the problem is that language is fundamentally one-dimensional (1D), as linguistic information arrives sequentially, in order—for example, words and syllables come one after another in speech or writing. For this reason, models suitable for language processing, such as LLMs, work well with sequence-based learning (e.g., sequence-to-sequence models). The other problem is that language is a purely generative phenomenon: it is not tangible, we cannot see or touch it, but it originates from the human mind – it is a completely internal construct that we only record afterwards (e.g. in writing).

In contrast, the visual world is three-dimensional (3D), and if we include time, it is four-dimensional (4D). During visual perception, the 3D world is reduced to a two-dimensional projection (e.g., on our retina or in a camera image) – this is a mathematically ill-posed problem (there is no clear solution). In addition, the visual world is not only generative but also reconstructive – bound by real physical laws – and its uses are more diverse, ranging from metaverse generation to robotics. Therefore, according to Fei-Fei Li, modeling spatial intelligence (e.g., 3D world models) is a much more complex and difficult challenge than developing LLMs.

Google Geospatial Reasoning Framework: is this spatial intelligence?

There are a bunch of ways to build spatial intelligence these days. Computer vision and 3D processing play a key role. Lidar, stereo cameras, and structured light sensors are used to collect depth information, which is then processed by neural algorithms. These technologies are already being used in autonomous systems, robotics, and geospatial applications.

The Geospatial Reasoning Framework developed by Google is a significant technological step towards the application of spatial intelligence, building on the company's global geodata infrastructure and advanced generative AI capabilities (for more information, see my previous article Google Geospatial Reasoning: A New AI Tool for Solving Geospatial Problems). The system aims to uncover and interpret complex spatial relationships based on various data, such as satellite images, maps, and mobility patterns. At its core are basic models such as the Population Dynamics Foundation Model, which models population changes, and trajectory-based mobility models, which analyze the movement of people over large areas. These models work in close integration with Google's existing systems (Google Maps, Earth Engine, Street View), giving them access to hundreds of millions of locations and extensive geographic data.

This framework enables, for example, the modeling of urban planning scenarios, spatial analysis of disaster situations, mapping of climate vulnerabilities, and tracking of public health trends. The system uses AI—specifically Gemini capabilities—to automatically perform GIS operations from natural language queries, generate new spatial data content, or present complex geographic relationships.

At the same time, it is important to note that this approach does not cover the entire spectrum of spatial intelligence, especially not the kind of 3D world understanding that Fei-Fei Li refers to. Google's system is fundamentally built on 2D maps and geographic plane models, which are excellent for large-scale, aggregated spatial analysis, but not suited to dealing with fine-grained, object-level 3D relationships, physical laws, or embodied AI tasks. True spatial intelligence—such as when a robot needs to navigate a room, identify objects, or manipulate them—requires much more than on-site data processing: it requires dynamic world modeling, handling of perceptual uncertainty, and understanding of time-varying physical interactions.

According to Dr. Fei-Fei Li, the development of vision took 540 million years of evolution, while language developed in just half a million years — which shows just how fundamental and complex a task it is.

The Paths of the Future

Although remarkable results are already visible in specialized applications, achieving human-level spatial intelligence remains an ambitious goal. Initiatives such as World Labs, which attract huge investments, show that the industry sees great potential in this area. In the future, the effective integration of different types of spatial intelligence—from fine 3D object manipulation to large-scale geographic reasoning—will be key. In addition, standardized measurement and evaluation frameworks need to be developed to accurately track progress. Collaboration between experts in computer vision, robotics, cognitive science, and geography is essential for success. This is because training models with spatial intelligence is extremely difficult. While there is a wealth of text and images on the web for training LLM models, acquiring such a large amount of data about the 3D world is not only a major challenge, but also requires completely new approaches.

But how long will all this take? Obviously, no one knows the answer, as the task is so complex that even the researchers themselves are reluctant to make predictions. However, there is one noteworthy story worth mentioning in this regard. In an interview, Dr. Fei-Fei Li said that when she graduated from university, her dream was that, with her life's work, she might be able to create software that could describe what is in a picture in words. In 2015, she and her colleagues and students (Andrej Karpathy, Justin Johnson, etc.) suddenly found themselves with a ready-made solution. Dr. Li was a little disappointed and wondered what the hell she was going to do with the rest of her life. She jokingly remarked to Andrej Karpathy that now they should create the reverse of the software, i.e., generate an image from text. Andrej laughed at this funny absurdity, and Dr. Li probably chuckled to herself, but those of us who haven't spent the last few years living in a cave or under a rock know how the story ended. 

Share this post
ALT Linux 11.0 Education is the foundation of Russian educational institutions
ALT Linux is a Russian-based Linux distribution built on the RPM package manager, based on the Sisyphus repository. It initially grew out of Russian localization efforts, collaborating with international distributions such as Mandrake and SUSE Linux, with a particular focus on supporting the Cyrillic alphabet.
What lies behind Meta's artificial intelligence reorganization?
Mark Zuckerberg, CEO of Meta, is not taking a bold step for the first time, but this time he is carrying out a more comprehensive reorganization than ever before in the company's artificial intelligence divisions. All existing AI teams, including research and development, product development, and basic model building, will fall under the newly created division called Meta Superintelligence Labs (MSL). The goal is not only to create artificial intelligence (AGI) that can compete with human thinking, but also to create a system-level superintelligence that surpasses human capabilities.
Sovereign AI, secret share sales – what is going on behind the scenes at NVIDIA?
The artificial intelligence industry has experienced unprecedented momentum in recent years, and one of the biggest winners of this wave is undoubtedly NVIDIA. Known for its graphics processors, the company is now not only a favorite among gamers and engineers, but has also become a central player in international technology strategies. Its shares are hitting historic highs on the US stock market, while more and more government cooperation and geopolitical threads are beginning to weave around it. But what does all this tell us about the future, and how well-founded is the current optimism?
GNOME 49 will no longer support X11
Although GNOME is perhaps the most commonly used desktop environment for individual Linux distributions, the developers have decided to make deeper structural changes in GNOME 49, which will affect distribution support.
Facebook's new AI feature quietly opens the door to mass analysis of personal photos
Users who want to share a post on Facebook are greeted with a new warning: a pop-up window asking for permission for “cloud-based processing.” If we approve, the system can access our entire phone photo library—including photos we've never uploaded to the social network. The goal: to generate creative ideas using artificial intelligence, such as collages, themed selections, or stylized versions.
openEuler 24.03-LTS-SP2 is the platform of choice for large enterprises in China
The future of digital infrastructure is increasingly based on operating systems that can meet the stability, innovation and compatibility requirements of different industries. openEuler, China's first community open source operating system, is not just a technology product, but the result of a long-term strategic effort to create an independent and diverse technology ecosystem. The latest major milestone in this development is openEuler 24.03 LTS SP2.

Linux distribution updates released in the last few days