With the development of artificial intelligence systems, there is a growing demand for models that are not only capable of interpreting language, but also of carrying out complex, multi-step thought processes. Such models can be crucial not only in theoretical tasks, but also in software development or real-time decision-making, for example. However, these applications are particularly sensitive to computational costs, which are often difficult to control using traditional approaches.
The computational load of currently widely used transformer-based models increases rapidly with input length, as the so-called softmax attention mechanism scales quadratically. This means that working with longer texts dramatically increases resource requirements, which is simply unsustainable in many applications. Although several research directions have attempted to solve this problem—such as sparse or linear attention mechanisms and feedback-based networks—these approaches have typically not proven to be sufficiently stable or scalable at the level of the largest systems.
In this challenging environment, the MiniMax AI research group presented its new model, MiniMax-M1, which strives for both computational efficiency and practical applicability to real-world problems. An important feature of the model is that it is open-source, meaning it is not exclusively designed for corporate use, but is also available for research purposes. MiniMax-M1 is based on a multi-expert architecture and is capable of handling long text contexts through a hybrid attention system. It consists of a total of 456 billion parameters, of which approximately 45.9 billion are activated per token.
The system can handle inputs up to one million tokens in length, which is eight times the capacity of some previous models. To optimize the attention mechanism, the researchers introduced a so-called “lightning attention” procedure, which is more efficient than the traditional softmax approach. In the case of MiniMax-M1, the classic method is still used in every seventh transformer block, while the new, linear attention-based method is used in the other blocks. This hybrid structure allows for the handling of large inputs while keeping the computational requirements at an acceptable level.
A new reinforcement learning algorithm called CISPO was also developed to train the model. This algorithm does not limit the updating of generated tokens, but rather the so-called important sampling weights, resulting in a more stable learning process. The training process took place over three weeks with 512 H800 graphics processors, which cost approximately $534,000 to rent.
The model's performance was also evaluated in various tests. Based on the results, MiniMax-M1 performed particularly well in software development and tasks requiring long text contexts, but also showed outstanding results in the area of so-called “agentic” tool use. Although it was surpassed by some newer models in mathematics and coding competitions, it outperformed several widely used systems in working with long texts.
MiniMax-M1 is therefore not just another large model in the history of artificial intelligence development, but an initiative that combines practical considerations with openness to research. Although the technology is still evolving, this development shows promise for the scalable and transparent implementation of systems capable of deep thinking in long contexts.