Large language models doubled their performance last month

 It's been about a month since Scale AI published the first results of its "Humanity's Final Exam," a test designed to measure artificial intelligence's expert-level knowledge and reasoning abilities across various domains. In addition to these aspects, the test also evaluates the calibration of AI models. The exam covers both the sciences and the humanities, though for understandable reasons, the sciences—particularly mathematics—dominate, as they are the most likely to provide objective truths.

During the initial round of testing, several advanced models were evaluated, including OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and DeepSeek R1. None of the models managed to surpass the 10% threshold, though OpenAI's o1 and DeepSeek R1 came very close. As for calibration, there is still significant room for improvement, as the models exhibited high calibration errors—meaning they were overly confident even when generating incorrect answers.

Following the first results, or in some cases outright failures, various models introduced Deep Research functionalities. These typically brought improvements in reasoning, data analysis, and structured information processing. As expected, these features first appeared in paid models, but soon afterward, free versions also emerged. Recently, Perplexity AI made Deep Research available on its free chat platform as well. 

HLE benchmark
HLE benchmark

 Equipped with Deep Research, the models once again attempted Humanity's Final Exam. As a result, OpenAI now leads with a score of 26.6%, closely followed by Perplexity Deep Research at 21.1%. This marks a significant leap in a short period. However, it's important to note that not all models have progressed at the same rate, so drawing broad conclusions at this stage would be premature. The Center for AI Safety predicts that some models may surpass the 50% mark by the end of the year.

While Humanity's Final Exam is an important milestone and offers a fascinating glimpse into AI's progress, it is not the sole metric for evaluating model development. The real breakthrough will likely come in the form of creative problem-solving and handling complex, open-ended tasks. 

Share this post
Facebook's new AI feature quietly opens the door to mass analysis of personal photos
Users who want to share a post on Facebook are greeted with a new warning: a pop-up window asking for permission for “cloud-based processing.” If we approve, the system can access our entire phone photo library—including photos we've never uploaded to the social network. The goal: to generate creative ideas using artificial intelligence, such as collages, themed selections, or stylized versions.
openEuler 24.03-LTS-SP2 is the platform of choice for large enterprises in China
The future of digital infrastructure is increasingly based on operating systems that can meet the stability, innovation and compatibility requirements of different industries. openEuler, China's first community open source operating system, is not just a technology product, but the result of a long-term strategic effort to create an independent and diverse technology ecosystem. The latest major milestone in this development is openEuler 24.03 LTS SP2.
Will ASICs replace NVIDIA GPUs?
The development of artificial intelligence over the past decade has been closely linked to the name NVIDIA, which has become the dominant player in the market with its graphics processing units (GPUs). A significant portion of today's AI models are built on these GPUs, and NVIDIA's decade-old software ecosystem—especially the CUDA platform—has become an indispensable tool for research, development, and industrial applications. At the same time, in recent years, the biggest players in the technology sector – including Google, Amazon, Meta, and Microsoft – have been turning with increasing momentum toward AI chips developed in-house and optimized for specific tasks, known as ASICs.
Google Gemini CLI, a powerful offering in the field of AI accessible from the terminal
Google's recently announced Gemini CLI is an open source, command line AI tool that integrates the Gemini 2.5 Pro large language model directly into the terminal. The goal of the initiative is nothing less than to transform natural language commands into real technical workflows, in an environment that has already been synonymous with efficiency for many.
Satya Nadella's thoughts on the role, future, and responsibility of artificial intelligence
Rapid change is not uncommon in the world of technology, but rarely does it affect so many sectors at once as today's artificial intelligence (AI) revolution. In an interview with Y Combinator, Satya Nadella, CEO of Microsoft, not only assessed technological developments, but also placed the development of AI in a broader social and economic context. His approach is restrained, calm, and purposeful: AI is not a mystical entity, but a tool that must be properly applied and interpreted.
What does RefreshOS 2.5 offer Linux users?
The world of Linux distributions is rich but often divisive: on one side are complex, purist systems, and on the other are solutions that try to satisfy every need but are often overloaded. RefreshOS aims to bridge the gap between the two. The latest 2.5 release of the system developed by eXybit Technologies™ (formerly eGoTech™) is the latest step in this endeavor, building on the stable foundations of Debian to provide a simple yet modern user experience.

Linux distribution updates released in the last few days