All it takes is a photo and a voice recording – Alibaba's new artificial intelligence creates a full-body avatar from them

A single voice recording and a photo are enough to create lifelike, full-body virtual characters with facial expressions and emotions – without a studio, actor, or green screen. Alibaba's latest development, an open-source artificial intelligence model called OmniAvatar, promises to do just that. Although the technology is still evolving, it is already worth paying attention to what it enables – and what new questions it raises.

OmniAvatar is based on a multi-channel learning approach: the model processes data from voice, image, and text prompts simultaneously. It breaks down speech into smaller units and uses them to infer the emotional charge, emphasis, and rhythm of the moment. The model then generates a moving, talking character video that reflects emotions, working in conjunction with the specified image and text prompts. The system is not only capable of synchronizing mouth movements, but also of harmonizing body language and facial expressions with what is being said—in fact, the character can even interact with objects, for example, pointing, lifting something, or gesturing.

One of the important innovations of the development is that the user can control all this with simple text commands. For example, we can specify that the character should smile, be angry or surprised, or that the scene should take place in an office or even under a lemon tree. This opens up new possibilities in content creation: educational videos, virtual tours, customer service role-playing, and even the creation of singing avatars become easier – without motion capture or actors.

However, the model's uniqueness lies not only in its technological flexibility, but also in the fact that it has been made available as open source. This is a rare step in the world of cutting-edge technologies developed at the corporate level. With this decision, Alibaba and Zhejiang University, which collaborated on the development, are giving researchers, developers, and creative professionals around the world the opportunity to experiment with it, customize it, and even integrate it into their own applications.

It is important to note, however, that the characters seen in the current demonstration videos are not yet completely free of artificial effects. Some observers report a “plastic” visual world that is somewhat distant from realism. However, this is not necessarily a disadvantage: the characters may still be suitable for informational, educational, or promotional purposes, especially in situations where the goal is not realism but effective content delivery. Moreover, with the advancement of technical details, this visual limitation may gradually disappear.

The research team has only published partial technical documentation on the construction of the underlying system, but based on the published scientific communication, the model works with so-called cross-modal (multisensory) learning. This means that it achieves the rich movement and emotion output presented in the demonstration videos through the combined interpretation of sound and visual signals.

The future of this technology depends on a number of factors, primarily on how successful it will be in making avatars look more natural and how well it can be integrated into various industry practices. At the same time, the direction it is taking is already clear: we are moving increasingly towards automated yet personal digital communication with body language and emotions.

Due to the accessibility and versatility of the tool, it offers exciting opportunities for both research and practical applications. The key question for the coming years will be how we exploit this opportunity: will we be able to integrate it into everyday digital communication in a value-creating, thoughtful way, or will it remain just another spectacular technological promise? The answer is still open – but the tool is already in our hands, and anyone can download it from the official GitHub repository.

Share this post
ALT Linux 11.0 Education is the foundation of Russian educational institutions
ALT Linux is a Russian-based Linux distribution built on the RPM package manager, based on the Sisyphus repository. It initially grew out of Russian localization efforts, collaborating with international distributions such as Mandrake and SUSE Linux, with a particular focus on supporting the Cyrillic alphabet.
Spatial intelligence is the next hurdle for AGI to overcome
With the advent of LLM, machines have gained impressive capabilities. What's more, their pace of development has accelerated, with new models appearing every day that make machines even more efficient and give them even better capabilities. However, upon closer inspection, this technology has only just enabled machines to think in one dimension. The world we live in, however, is three-dimensional based on human perception. It is not difficult for a human to determine that something is under or behind a chair, or where a ball flying towards us will land. According to many artificial intelligence researchers, in order for AGI, or artificial general intelligence, to be born, machines must be able to think in three dimensions, and for this, spatial intelligence must be developed.
What lies behind Meta's artificial intelligence reorganization?
Mark Zuckerberg, CEO of Meta, is not taking a bold step for the first time, but this time he is carrying out a more comprehensive reorganization than ever before in the company's artificial intelligence divisions. All existing AI teams, including research and development, product development, and basic model building, will fall under the newly created division called Meta Superintelligence Labs (MSL). The goal is not only to create artificial intelligence (AGI) that can compete with human thinking, but also to create a system-level superintelligence that surpasses human capabilities.
Sovereign AI, secret share sales – what is going on behind the scenes at NVIDIA?
The artificial intelligence industry has experienced unprecedented momentum in recent years, and one of the biggest winners of this wave is undoubtedly NVIDIA. Known for its graphics processors, the company is now not only a favorite among gamers and engineers, but has also become a central player in international technology strategies. Its shares are hitting historic highs on the US stock market, while more and more government cooperation and geopolitical threads are beginning to weave around it. But what does all this tell us about the future, and how well-founded is the current optimism?
GNOME 49 will no longer support X11
Although GNOME is perhaps the most commonly used desktop environment for individual Linux distributions, the developers have decided to make deeper structural changes in GNOME 49, which will affect distribution support.
Facebook's new AI feature quietly opens the door to mass analysis of personal photos
Users who want to share a post on Facebook are greeted with a new warning: a pop-up window asking for permission for “cloud-based processing.” If we approve, the system can access our entire phone photo library—including photos we've never uploaded to the social network. The goal: to generate creative ideas using artificial intelligence, such as collages, themed selections, or stylized versions.

Linux distribution updates released in the last few days