A single voice recording and a photo are enough to create lifelike, full-body virtual characters with facial expressions and emotions – without a studio, actor, or green screen. Alibaba's latest development, an open-source artificial intelligence model called OmniAvatar, promises to do just that. Although the technology is still evolving, it is already worth paying attention to what it enables – and what new questions it raises.
OmniAvatar is based on a multi-channel learning approach: the model processes data from voice, image, and text prompts simultaneously. It breaks down speech into smaller units and uses them to infer the emotional charge, emphasis, and rhythm of the moment. The model then generates a moving, talking character video that reflects emotions, working in conjunction with the specified image and text prompts. The system is not only capable of synchronizing mouth movements, but also of harmonizing body language and facial expressions with what is being said—in fact, the character can even interact with objects, for example, pointing, lifting something, or gesturing.
One of the important innovations of the development is that the user can control all this with simple text commands. For example, we can specify that the character should smile, be angry or surprised, or that the scene should take place in an office or even under a lemon tree. This opens up new possibilities in content creation: educational videos, virtual tours, customer service role-playing, and even the creation of singing avatars become easier – without motion capture or actors.
However, the model's uniqueness lies not only in its technological flexibility, but also in the fact that it has been made available as open source. This is a rare step in the world of cutting-edge technologies developed at the corporate level. With this decision, Alibaba and Zhejiang University, which collaborated on the development, are giving researchers, developers, and creative professionals around the world the opportunity to experiment with it, customize it, and even integrate it into their own applications.
Emotion Control
— Angry Tom (@AngryTomtweets) July 1, 2025
OmniAvatar can control the emotions through prompts, like happy, angry, surprise and sad. pic.twitter.com/fcJQ4ZmSVV
It is important to note, however, that the characters seen in the current demonstration videos are not yet completely free of artificial effects. Some observers report a “plastic” visual world that is somewhat distant from realism. However, this is not necessarily a disadvantage: the characters may still be suitable for informational, educational, or promotional purposes, especially in situations where the goal is not realism but effective content delivery. Moreover, with the advancement of technical details, this visual limitation may gradually disappear.
The research team has only published partial technical documentation on the construction of the underlying system, but based on the published scientific communication, the model works with so-called cross-modal (multisensory) learning. This means that it achieves the rich movement and emotion output presented in the demonstration videos through the combined interpretation of sound and visual signals.
The future of this technology depends on a number of factors, primarily on how successful it will be in making avatars look more natural and how well it can be integrated into various industry practices. At the same time, the direction it is taking is already clear: we are moving increasingly towards automated yet personal digital communication with body language and emotions.
Due to the accessibility and versatility of the tool, it offers exciting opportunities for both research and practical applications. The key question for the coming years will be how we exploit this opportunity: will we be able to integrate it into everyday digital communication in a value-creating, thoughtful way, or will it remain just another spectacular technological promise? The answer is still open – but the tool is already in our hands, and anyone can download it from the official GitHub repository.