Large language models doubled their performance last month

 It's been about a month since Scale AI published the first results of its "Humanity's Final Exam," a test designed to measure artificial intelligence's expert-level knowledge and reasoning abilities across various domains. In addition to these aspects, the test also evaluates the calibration of AI models. The exam covers both the sciences and the humanities, though for understandable reasons, the sciences—particularly mathematics—dominate, as they are the most likely to provide objective truths.

During the initial round of testing, several advanced models were evaluated, including OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and DeepSeek R1. None of the models managed to surpass the 10% threshold, though OpenAI's o1 and DeepSeek R1 came very close. As for calibration, there is still significant room for improvement, as the models exhibited high calibration errors—meaning they were overly confident even when generating incorrect answers.

Following the first results, or in some cases outright failures, various models introduced Deep Research functionalities. These typically brought improvements in reasoning, data analysis, and structured information processing. As expected, these features first appeared in paid models, but soon afterward, free versions also emerged. Recently, Perplexity AI made Deep Research available on its free chat platform as well. 

HLE benchmark
HLE benchmark

 Equipped with Deep Research, the models once again attempted Humanity's Final Exam. As a result, OpenAI now leads with a score of 26.6%, closely followed by Perplexity Deep Research at 21.1%. This marks a significant leap in a short period. However, it's important to note that not all models have progressed at the same rate, so drawing broad conclusions at this stage would be premature. The Center for AI Safety predicts that some models may surpass the 50% mark by the end of the year.

While Humanity's Final Exam is an important milestone and offers a fascinating glimpse into AI's progress, it is not the sole metric for evaluating model development. The real breakthrough will likely come in the form of creative problem-solving and handling complex, open-ended tasks. 

Share this post
Artificial Intelligence in Network Management and Maintenance
Ericsson recently presented its strategic plans for 2025 at the Mobile World Congress 2025 (MWC25). These ideas are particularly intriguing as they demonstrate how artificial intelligence is being integrated into industrial processes that impact our daily lives—yet remain unnoticed as long as they function smoothly.
GTC 2025: NVIDIA's Blackwell-Based Servers and DGX Station
The GTC (GPU Technology Conference), held annually since 2009, will be hosted by NVIDIA this year from March 17 to 21. The conference is designed to showcase the latest developments and to promote collaboration and further innovation across different industries. It is attended mainly by developers, researchers, and technology leaders. NVIDIA CEO Jensen Huang has been saying for some time that companies will become token factories in the future—meaning that every workflow will be supported by artificial intelligence. Currently, large servers play a major role in this process, but AI integration will increasingly extend to personal computers. In the future, computers and laptops will have hardware capable of running even large language models in the background. This is necessary because programmers, engineers, and almost everyone will work with AI assistance.
Fedora 42 Beta Available
Fedora 42 beta is now available for testing, with a stable release planned for 15 April. The new version includes several major enhancements designed to improve the user experience, simplify the installation process, and integrate modern desktop environments and technical solutions.
Video Games in Artificial Intelligence Testing
For decades, video games have served as laboratories for testing the capabilities of various AI algorithms. Whether they are classic platformers or more complex strategy games, these games provide a way for AI systems to learn how to act, adapt to changing environments, and optimize their decisions in order to earn rewards.