Large language models doubled their performance last month

2025-02-18T05:00:00.000+00:00 2025 February 18. 05:00 Attila Fodor

It's been about a month since Scale AI published the first results of its "Humanity's Final Exam," a test designed to measure artificial intelligence's expert-level knowledge and reasoning abilities across various domains. In addition to these aspects, the test also evaluates the calibration of AI models. The exam covers both the sciences and the humanities, though for understandable reasons, the sciences—particularly mathematics—dominate, as they are the most likely to provide objective truths.

During the initial round of testing, several advanced models were evaluated, including OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and DeepSeek R1. None of the models managed to surpass the 10% threshold, though OpenAI's o1 and DeepSeek R1 came very close. As for calibration, there is still significant room for improvement, as the models exhibited high calibration errors—meaning they were overly confident even when generating incorrect answers.

Following the first results, or in some cases outright failures, various models introduced Deep Research functionalities. These typically brought improvements in reasoning, data analysis, and structured information processing. As expected, these features first appeared in paid models, but soon afterward, free versions also emerged. Recently, Perplexity AI made Deep Research available on its free chat platform as well.

Equipped with Deep Research, the models once again attempted Humanity's Final Exam. As a result, OpenAI now leads with a score of 26.6%, closely followed by Perplexity Deep Research at 21.1%. This marks a significant leap in a short period. However, it's important to note that not all models have progressed at the same rate, so drawing broad conclusions at this stage would be premature. The Center for AI Safety predicts that some models may surpass the 50% mark by the end of the year.

While Humanity's Final Exam is an important milestone and offers a fascinating glimpse into AI's progress, it is not the sole metric for evaluating model development. The real breakthrough will likely come in the form of creative problem-solving and handling complex, open-ended tasks.

Share this post

2025. June 30.

Sovereign AI, secret share sales – what is going on behind the scenes at NVIDIA?

The artificial intelligence industry has experienced unprecedented momentum in recent years, and one of the biggest winners of this wave is undoubtedly NVIDIA. Known for its graphics processors, the company is now not only a favorite among gamers and engineers, but has also become a central player in international technology strategies. Its shares are hitting historic highs on the US stock market, while more and more government cooperation and geopolitical threads are beginning to weave around it. But what does all this tell us about the future, and how well-founded is the current optimism?

2025. June 30.

GNOME 49 will no longer support X11

Although GNOME is perhaps the most commonly used desktop environment for individual Linux distributions, the developers have decided to make deeper structural changes in GNOME 49, which will affect distribution support.

2025. June 29.

Facebook's new AI feature quietly opens the door to mass analysis of personal photos

Users who want to share a post on Facebook are greeted with a new warning: a pop-up window asking for permission for “cloud-based processing.” If we approve, the system can access our entire phone photo library—including photos we've never uploaded to the social network. The goal: to generate creative ideas using artificial intelligence, such as collages, themed selections, or stylized versions.

2025. June 28.

openEuler 24.03-LTS-SP2 is the platform of choice for large enterprises in China

The future of digital infrastructure is increasingly based on operating systems that can meet the stability, innovation and compatibility requirements of different industries. openEuler, China's first community open source operating system, is not just a technology product, but the result of a long-term strategic effort to create an independent and diverse technology ecosystem. The latest major milestone in this development is openEuler 24.03 LTS SP2.

2025. June 27.

Google Gemini CLI, a powerful offering in the field of AI accessible from the terminal

Google's recently announced Gemini CLI is an open source, command line AI tool that integrates the Gemini 2.5 Pro large language model directly into the terminal. The goal of the initiative is nothing less than to transform natural language commands into real technical workflows, in an environment that has already been synonymous with efficiency for many.

2025. June 26.

Satya Nadella's thoughts on the role, future, and responsibility of artificial intelligence

Rapid change is not uncommon in the world of technology, but rarely does it affect so many sectors at once as today's artificial intelligence (AI) revolution. In an interview with Y Combinator, Satya Nadella, CEO of Microsoft, not only assessed technological developments, but also placed the development of AI in a broader social and economic context. His approach is restrained, calm, and purposeful: AI is not a mystical entity, but a tool that must be properly applied and interpreted.

Linux distribution updates released in the last few days