Large language models doubled their performance last month

 It's been about a month since Scale AI published the first results of its "Humanity's Final Exam," a test designed to measure artificial intelligence's expert-level knowledge and reasoning abilities across various domains. In addition to these aspects, the test also evaluates the calibration of AI models. The exam covers both the sciences and the humanities, though for understandable reasons, the sciences—particularly mathematics—dominate, as they are the most likely to provide objective truths.

During the initial round of testing, several advanced models were evaluated, including OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and DeepSeek R1. None of the models managed to surpass the 10% threshold, though OpenAI's o1 and DeepSeek R1 came very close. As for calibration, there is still significant room for improvement, as the models exhibited high calibration errors—meaning they were overly confident even when generating incorrect answers.

Following the first results, or in some cases outright failures, various models introduced Deep Research functionalities. These typically brought improvements in reasoning, data analysis, and structured information processing. As expected, these features first appeared in paid models, but soon afterward, free versions also emerged. Recently, Perplexity AI made Deep Research available on its free chat platform as well. 

HLE benchmark
HLE benchmark

 Equipped with Deep Research, the models once again attempted Humanity's Final Exam. As a result, OpenAI now leads with a score of 26.6%, closely followed by Perplexity Deep Research at 21.1%. This marks a significant leap in a short period. However, it's important to note that not all models have progressed at the same rate, so drawing broad conclusions at this stage would be premature. The Center for AI Safety predicts that some models may surpass the 50% mark by the end of the year.

While Humanity's Final Exam is an important milestone and offers a fascinating glimpse into AI's progress, it is not the sole metric for evaluating model development. The real breakthrough will likely come in the form of creative problem-solving and handling complex, open-ended tasks. 

Share this post
Could the age of the smartphone soon be over?
Google’s antitrust trial is now under way, and Eddy Cue, Apple’s senior vice-president of services, has been called to testify. During his testimony, Cue made an unexpected and exciting remark: he suggested that in ten years, we might not need an iPhone—just as surprising as that sounds, it could follow the same path as the iPod.
Apple Plans Its Own “Vibe-Coding” Platform in Partnership with Anthropic
Apple has encountered several challenges in developing its own AI solutions recently, so it’s perhaps unsurprising that the company is turning to external expertise. According to the latest reports, Apple has decided to join forces with Anthropic to create a revolutionary “vibe-coding” software platform that uses generative AI to write, edit, and test programmers’ code.
The Stablecoin Revolution Has Begun
Stripe has started testing stablecoin-based payments in countries outside the developed world. The initiative follows the acquisition of Bridge, a stablecoin platform founded by former Coinbase executives Zach Abrams and Sean Yu. The stablecoin used by Stripe is pegged to the US dollar and is primarily intended to facilitate payments for businesses operating in countries where the high volatility of the local currency or other infrastructure issues make transferring money in traditional currencies extremely costly.
QNodeOS the first quantum operating system
Quantum networks have been very hard for developers because each hardware type needed its own software layer. In mid-March, a team from the Quantum Internet Alliance (QIA) announced a new quantum operating system called QNodeOS. Like classic operating systems, QNodeOS hides low-level hardware details and lets you build higher-level applications on different quantum processors. The first demo appeared online in Nature on March 12, 2025, and since then QNodeOS has become a hot topic in quantum network research.