
Humanity’s Last Exam
In early 2025, a new AI benchmark quietly reshaped how we measure artificial intelligence. Known as Humanity’s Last Exam (HLE), this extensive evaluation framework was launched by the Center for AI Safety (CAIS) and Scale AI.
Unlike previous benchmarks, HLE is designed not just to assess whether an AI model is good — but whether it's truly capable of mastering the full spectrum of human knowledge.
As of July 2025, no large language model (LLM) has achieved a perfect score.
That, however, is the point.
Unlike previous benchmarks, HLE is designed not just to assess whether an AI model is good — but whether it's truly capable of mastering the full spectrum of human knowledge.
As of July 2025, no large language model (LLM) has achieved a perfect score.
That, however, is the point.
What Is HLE, Exactly?
Humanity’s Last Exam is a multi-domain test built specifically for modern AI systems. It includes 2,500 questions across more than 100 disciplines — including mathematics, physics, biology, law, social sciences, and even visual interpretation tasks.According to Scale AI’s public materials (source: scale.com/cais/hle, June 2025), the questions range from multiple choice (24%) to open-ended factual prompts (76%).
But HLE isn't just big — it's hard. Questions are intentionally positioned at the edge of current human knowledge, beyond what’s typically covered in academic syllabi or professional certifications.
Example, cited by CAIS, asks:
In the Physics and Engineering section of the test, there is a question of this type:
"A quartz crystal with a natural resonant frequency of 32.768 Hz is used in a standard clock. Assuming a temperature drift of ±0.5 °C/ms, how does this affect the clock signal duration at a cycle of 1 second? Estimate the frequency offset percentage and the stable operating time without correction."
This question requires a combination of knowledge:
physics of resonances and temperature dependence;
engineering understanding of time accuracy in real-time systems;
mathematical calculations of drift and its influence on the phase signal angle.
This is far from a simple academic task: an engineer must be able to move from physical formulas (the temperature coefficient of quartz) to practical figures - how much shift in percent per second, and what the consequences are for the synchronous clock system.

Humanity’s Last Exam
Not just knowledge of the topic, but the ability to integrate data from different disciplines .
It does not rest on familiar patterns - the question requires a logical approach and interdisciplinary calculation .
Examples like these expose weaknesses even in advanced language models—they can't just pick an answer, they have to reason.
Example of a question from the biology section - simpler and clearer
"In butterflies of the family Nymphalidae, males and females have a difference in wing width of about 15%. If the average male wing is 50 mm, what is the average female wing width in millimeters?"
Be able to translate a problem from biology into precise calculations.
They do not require deep specialized knowledge, but test the model's ability to link simple biological information with arithmetic.
Combining concepts and calculations is exactly the level of tasks that many language models fear.
Even advanced LLMs often make mistakes when interpreting percentages and basic calculations if they don't "get" the context.
This isn’t trivia. It’s the type of question where even subject-matter experts might hesitate.
It does not rest on familiar patterns - the question requires a logical approach and interdisciplinary calculation .
Examples like these expose weaknesses even in advanced language models—they can't just pick an answer, they have to reason.
Example of a question from the biology section - simpler and clearer
"In butterflies of the family Nymphalidae, males and females have a difference in wing width of about 15%. If the average male wing is 50 mm, what is the average female wing width in millimeters?"
Be able to translate a problem from biology into precise calculations.
They do not require deep specialized knowledge, but test the model's ability to link simple biological information with arithmetic.
Combining concepts and calculations is exactly the level of tasks that many language models fear.
Even advanced LLMs often make mistakes when interpreting percentages and basic calculations if they don't "get" the context.
This isn’t trivia. It’s the type of question where even subject-matter experts might hesitate.
The Origin: From Benchmark Fatigue to Bold Experiment
HLE was conceptualized by Dan Hendrycks, a leading researcher in machine learning safety and director at CAIS.After co-authoring benchmarks like MMLU (Massive Multitask Language Understanding) and GSM8K, Hendrycks was struck by a growing problem: "AI was beating the tests, but still failing in the real world."
He reportedly began work on HLE after conversations with Elon Musk, who criticized existing tests as “too easy.”
In response, Hendrycks collaborated with Scale AI to crowdsource new benchmark questions.
Over 1,000 contributors from 50+ countries — professors, PhDs, domain experts — joined the effort.
Their task: find what even GPT-4 can’t solve.
The Test-Building Process
According to CAIS documentation, HLE's questions went through a rigorous pipeline:1. Initial screening: Over 70,000 questions were filtered by top LLMs. If models failed or performed below random guess thresholds, the questions were kept.
2. Human refinement: \~13,000 “fail” questions were reviewed by graduate-level experts.
3. Final vetting: A two-layer quality check — blind peer review and expert approval — ensured the intellectual rigor.
To encourage quality contributions, authors of the 50 best questions were awarded \$5,000 each, with \$500 for the next 500.
In early 2025, Scale AI also launched a bug bounty to catch dataset errors.
The Results (So Far)
When HLE launched in Q1 2025, leading models scored surprisingly low. For example:* GPT-4o: 3.3%
* Grok-2: 3.8%
* Claude 3.5 Sonnet: <10%
By June 2025, results had improved. According to Scale AI’s public leaderboard (scale.com/hle/leaderboard):
* Gemini 2.5 Pro Preview: 21.64%
* o3 (high) by OpenAI: 20.32%
* Claude Opus 4 (Thinking): 10.72%
The most significant breakthrough came in July 2025, when xAI announced that "Grok 4 scored 25.4%", and its enhanced version — Grok 4 Heavy — reached "44.4%", using multi-agent problem-solving.
That said, even the best models are still failing "over half the questions".
What Makes HLE Different?
Here’s why this test matters more than most AI benchmarks:1. It Exposes Model Limitations
Unlike benchmarks that test recall or math puzzles, HLE focuses on "conceptual blind spots" and edge-case reasoning — areas where real-world AI deployments often fail.
2. It’s Not Culturally Narrow
HLE is multilingual and culturally diverse, including questions on Islamic finance, Confucian ethics, and global regulatory systems.
3. It Challenges the Plateau
Models had begun scoring >90% on MMLU, making it difficult to detect incremental improvements. HLE breaks that ceiling.
4. It Encourages Multi-Agent Reasoning
Some models, like Grok 4 Heavy, use "AI agent teams" to divide tasks — a major shift toward autonomous AI collaboration.
Expert Reactions: Smart ≠ Useful
Not everyone agrees that excelling at HLE means a model is “intelligent.” Kevin Zhou, a theoretical physicist who contributed to HLE, points out:“There’s a huge gap between passing a test and doing real research. An AI that can answer these questions may still fail at unstructured discovery.”
This reflects a broader concern in AI safety: "Do benchmarks reflect real-world utility, or are they intellectual mirages?"
What’s Next?
Hendrycks believes HLE scores will continue to climb. He predicts models will "surpass 50% accuracy by the end of 2025", at which point AI may achieve “world-class oracle” status — able to outperform human experts across most fields.But even then, he notes, the community may need a new challenge.
“HLE could be the last academic test for AI — but it won’t be the last test of AI.”
By Claire Whitmore
August 29, 2025
Join us. Our Telegram: @forexturnkey
All to the point, no ads. A channel that doesn't tire you out, but pumps you up.
Report
My comments