All it takes is a photo and a voice recording – Alibaba's new artificial intelligence creates a full-body avatar from them

A single voice recording and a photo are enough to create lifelike, full-body virtual characters with facial expressions and emotions – without a studio, actor, or green screen. Alibaba's latest development, an open-source artificial intelligence model called OmniAvatar, promises to do just that. Although the technology is still evolving, it is already worth paying attention to what it enables – and what new questions it raises.

OmniAvatar is based on a multi-channel learning approach: the model processes data from voice, image, and text prompts simultaneously. It breaks down speech into smaller units and uses them to infer the emotional charge, emphasis, and rhythm of the moment. The model then generates a moving, talking character video that reflects emotions, working in conjunction with the specified image and text prompts. The system is not only capable of synchronizing mouth movements, but also of harmonizing body language and facial expressions with what is being said—in fact, the character can even interact with objects, for example, pointing, lifting something, or gesturing.

One of the important innovations of the development is that the user can control all this with simple text commands. For example, we can specify that the character should smile, be angry or surprised, or that the scene should take place in an office or even under a lemon tree. This opens up new possibilities in content creation: educational videos, virtual tours, customer service role-playing, and even the creation of singing avatars become easier – without motion capture or actors.

However, the model's uniqueness lies not only in its technological flexibility, but also in the fact that it has been made available as open source. This is a rare step in the world of cutting-edge technologies developed at the corporate level. With this decision, Alibaba and Zhejiang University, which collaborated on the development, are giving researchers, developers, and creative professionals around the world the opportunity to experiment with it, customize it, and even integrate it into their own applications.

It is important to note, however, that the characters seen in the current demonstration videos are not yet completely free of artificial effects. Some observers report a “plastic” visual world that is somewhat distant from realism. However, this is not necessarily a disadvantage: the characters may still be suitable for informational, educational, or promotional purposes, especially in situations where the goal is not realism but effective content delivery. Moreover, with the advancement of technical details, this visual limitation may gradually disappear.

The research team has only published partial technical documentation on the construction of the underlying system, but based on the published scientific communication, the model works with so-called cross-modal (multisensory) learning. This means that it achieves the rich movement and emotion output presented in the demonstration videos through the combined interpretation of sound and visual signals.

The future of this technology depends on a number of factors, primarily on how successful it will be in making avatars look more natural and how well it can be integrated into various industry practices. At the same time, the direction it is taking is already clear: we are moving increasingly towards automated yet personal digital communication with body language and emotions.

Due to the accessibility and versatility of the tool, it offers exciting opportunities for both research and practical applications. The key question for the coming years will be how we exploit this opportunity: will we be able to integrate it into everyday digital communication in a value-creating, thoughtful way, or will it remain just another spectacular technological promise? The answer is still open – but the tool is already in our hands, and anyone can download it from the official GitHub repository.

Share this post
Where is Artificial Intelligence Really Today?
The development of artificial intelligence has produced spectacular and often impressive results in recent years. Systems like ChatGPT can generate natural-sounding language, solve problems, and in many tasks, even surpass human performance. However, a growing number of prominent researchers and technology leaders — including John Carmack and François Chollet — caution that these achievements don’t necessarily indicate that artificial general intelligence (AGI) is just around the corner. Behind the impressive performances, new types of challenges and limitations are emerging that go far beyond raw capability.
Rhino Linux Releases New Version: 2025.3
In the world of Linux distributions, two main approaches dominate: on one side, stable systems that are updated infrequently but offer predictability and security; on the other, rolling-release distributions that provide the latest software at the cost of occasional instability. Rhino Linux aims to bridge this divide by combining the up-to-dateness of rolling releases with the stability offered by Ubuntu as its base.
SEAL: The Harbinger of Self-Taught Artificial Intelligence
For years, the dominant belief was that human instruction—through data, labels, fine-tuning, and carefully designed interventions—was the key to advancing artificial intelligence. Today, however, a new paradigm is taking shape. In a recent breakthrough, researchers at MIT introduced SEAL (Self-Adapting Language Models), a system that allows language models to teach themselves. This is not only a technological milestone—it also raises a fundamental question: what role will humans play in the training of intelligent systems in the future?
ALT Linux 11.0 Education is the foundation of Russian educational institutions
ALT Linux is a Russian-based Linux distribution built on the RPM package manager, based on the Sisyphus repository. It initially grew out of Russian localization efforts, collaborating with international distributions such as Mandrake and SUSE Linux, with a particular focus on supporting the Cyrillic alphabet.
Spatial intelligence is the next hurdle for AGI to overcome
With the advent of LLM, machines have gained impressive capabilities. What's more, their pace of development has accelerated, with new models appearing every day that make machines even more efficient and give them even better capabilities. However, upon closer inspection, this technology has only just enabled machines to think in one dimension. The world we live in, however, is three-dimensional based on human perception. It is not difficult for a human to determine that something is under or behind a chair, or where a ball flying towards us will land. According to many artificial intelligence researchers, in order for AGI, or artificial general intelligence, to be born, machines must be able to think in three dimensions, and for this, spatial intelligence must be developed.
What lies behind Meta's artificial intelligence reorganization?
Mark Zuckerberg, CEO of Meta, is not taking a bold step for the first time, but this time he is carrying out a more comprehensive reorganization than ever before in the company's artificial intelligence divisions. All existing AI teams, including research and development, product development, and basic model building, will fall under the newly created division called Meta Superintelligence Labs (MSL). The goal is not only to create artificial intelligence (AGI) that can compete with human thinking, but also to create a system-level superintelligence that surpasses human capabilities.

Linux distribution updates released in the last few days