It's been about a month since Scale AI published the first results of its "Humanity's Final Exam," a test designed to measure artificial intelligence's expert-level knowledge and reasoning abilities across various domains. In addition to these aspects, the test also evaluates the calibration of AI models. The exam covers both the sciences and the humanities, though for understandable reasons, the sciences—particularly mathematics—dominate, as they are the most likely to provide objective truths.
During the initial round of testing, several advanced models were evaluated, including OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and DeepSeek R1. None of the models managed to surpass the 10% threshold, though OpenAI's o1 and DeepSeek R1 came very close. As for calibration, there is still significant room for improvement, as the models exhibited high calibration errors—meaning they were overly confident even when generating incorrect answers.
Following the first results, or in some cases outright failures, various models introduced Deep Research functionalities. These typically brought improvements in reasoning, data analysis, and structured information processing. As expected, these features first appeared in paid models, but soon afterward, free versions also emerged. Recently, Perplexity AI made Deep Research available on its free chat platform as well.

Equipped with Deep Research, the models once again attempted Humanity's Final Exam. As a result, OpenAI now leads with a score of 26.6%, closely followed by Perplexity Deep Research at 21.1%. This marks a significant leap in a short period. However, it's important to note that not all models have progressed at the same rate, so drawing broad conclusions at this stage would be premature. The Center for AI Safety predicts that some models may surpass the 50% mark by the end of the year.
While Humanity's Final Exam is an important milestone and offers a fascinating glimpse into AI's progress, it is not the sole metric for evaluating model development. The real breakthrough will likely come in the form of creative problem-solving and handling complex, open-ended tasks.