Humanity's Last Exam: Why GPT-5 Scored Only 25% on the Hardest AI Test Ever - FX24 forex crypto and binary news

Humanity's Last Exam: Why GPT-5 Scored Only 25% on the Hardest AI Test Ever

  • Must Read
  • March Election

Humanity's Last Exam: Why GPT-5 Scored Only 25% on the Hardest AI Test Ever

An international group of scientists proposed a new examination for large language models called Humanity's Last Exam (HLE). The set comprises 2,500 complex questions across mathematics, natural sciences, and humanities—each crafted by subject experts. When launched in early 2025, leading models scored single-digit percentages. GPT-4o achieved 2.7%, while GPT-5 reached approximately 25%. The exam tests tasks like translating Latin tombstone inscriptions, determining hummingbird tendon anatomy, decomposing multi-step chemical reactions, and identifying Hebrew syllables ending with consonants using reconstructed Tiberian pronunciation. The benchmark addresses a critical problem: many older testing scales are exhausted, with advanced systems frequently scoring over 90%.

Why Existing AI Benchmarks No Longer Work

Latest-generation models improved in mathematics, biology, medicine, programming—demonstrating reasoning resembling everyday logic. To track growth, researchers use test task sets called benchmarks.
The problem: many older scales are exhausted. Advanced systems often score above 90%, so checks stop showing progress. When AI achieves near-perfect scores, distinguishing improvements becomes impossible.
Humanity's Last Exam addresses this by designing questions current AI cannot solve, emphasizing tasks requiring deep domain expertise, multi-step reasoning, interdisciplinary knowledge—areas where language models demonstrate limitations.
The benchmark excludes open-ended answers. Formats like scientific articles or legal opinions were rejected. Approximately 70,000 proposed questions ran through AI models. Only those which algorithms failed advanced. Experts re-evaluated material against strict criteria.
Public release included 2,500 assignments. Developers keep remaining database closed to prevent overfitting—AI memorizing past exams rather than understanding concepts.

Humanity's Last Exam: Why GPT-5 Scored Only 25% on the Hardest AI Test Ever

How HLE Actually Tests AI: The Question Selection Process

HLE development involved Center for AI Safety, Scale AI, and researcher consortium. Organizers collected assignments from thousands of experts across 50 countries (2025). Graduate and postgraduate-level questions on narrow topics were accepted.
Answer formats came in two types: exact match with reference solution, or multiple choice selection. This simplified automatic results verification, enabling consistent objective scoring across AI systems.

Questions passed rigorous filtering. The 70,000 initial submissions ran through existing AI models to identify which current systems cannot solve. This ensures HLE tests genuine capability gaps rather than arbitrary difficulty.
Expert re-evaluation focused on measurement utility. Questions needed clear correct answers, relevance to actual expert knowledge, and resistance to simple search engine lookup. This combination makes HLE fundamentally different from general knowledge tests where memorization suffices.
The closed portion strategy prevents training contamination. If all questions were public, AI companies could train models specifically on HLE content, artificially inflating scores without genuine capability improvements.

What the Test Results Actually Reveal About AI Limitations

After launch in early 2025, leading models showed single-digit percentages. Companies began using the test for demonstrating new versions. Indicators grew but remain far from maximum. GPT-4o achieved 2.7%, GPT-5 approximately 25%.
The stark performance gap reveals fundamental limitations in current AI architectures. Despite impressive domain-specific capabilities, language models lack integrated expertise human specialists develop through years of focused study. A human expert in ancient Hebrew might score 80-90% on relevant HLE sections, while GPT-5's 25% overall suggests broad shallow knowledge rather than deep understanding.

Score improvements from GPT-4o to GPT-5 demonstrate progress but highlight remaining gaps. The jump from 2.7% to 25% represents nearly tenfold improvement, yet still falls dramatically short of human expert performance, suggesting current scaling approaches deliver diminishing returns as tasks require genuine expertise rather than pattern recognition.

Controversies immediately surrounded the exam. Some specialists dislike the name itself, as the public might perceive it as direct comparison of AI and human capabilities. The "Last Exam" framing suggests eventual AI dominance—an interpretation creators reject but the provocative name encourages.
Questions arise about what exactly such verification measures. It reflects breadth of academic knowledge and improvement dynamics but inevitably simplifies real research tasks, which often require long reasoning chains and interdisciplinary work.
Critics remind that process doesn't reduce to providing answers. Also important: evaluating task formulation correctness, finding hidden assumptions, understanding confidence in results. These metacognitive skills—recognizing what you don't know, questioning premises, calibrating certainty—remain beyond current AI but prove crucial for actual research.
Overfitting risk gets discussed separately. Score growth may reflect not architectural improvements but training on published assignments, analogous to students preparing from previous exams. If AI companies train specifically on public HLE questions, rising scores reflect test-specific optimization rather than general capability gains.
The HLE team acknowledges limitations and continues refining methodology. Other groups propose alternative scales attempting to measure algorithms' scientific inventiveness and ability to work with people in real projects.
Humanity's Last Exam and AI Testing

Q: What is Humanity's Last Exam (HLE)?
A: HLE is a benchmark with 2,500 expert-level questions across mathematics, sciences, and humanities, designed to test AI where current models fail. Created by Center for AI Safety, Scale AI, international researchers (2025).

Q: How did leading AI models perform on HLE?
A: When launched early 2025, GPT-4o scored 2.7%, GPT-5 achieved approximately 25%. Leading models initially showed single-digit percentages, demonstrating significant capability gaps versus human experts.

Q: Why do researchers need HLE when other benchmarks exist?
A: Many older benchmarks are exhausted—advanced AI frequently scores above 90%, unable to demonstrate real progress. HLE targets tasks current AI cannot solve, maintaining measurement value.

Q: What makes HLE questions different from standard tests?
A: Questions require deep domain expertise from 50+ countries' experts, use graduate/postgraduate difficulty, exclude open-ended formats, pre-filtered to ensure current AI fails. Examples: Latin translations, hummingbird anatomy, Hebrew phonology reconstruction.

Q: Can AI companies train specifically on HLE to improve scores?
A: Partially. 2,500 questions are public, but developers keep remaining database closed to prevent overfitting. Critics worry rising scores may reflect test-specific training rather than genuine capability improvements.

Humanity's Last Exam exposes fundamental gaps between current AI and human expertise, with GPT-5's 25% score dramatically trailing human expert performance despite representing tenfold improvement over GPT-4o's 2.7%. The benchmark addresses critical measurement problems as older tests become saturated, but raises questions about what constitutes AI "intelligence" and whether rising scores reflect genuine capability gains or test-specific optimization.
As AI companies compete to demonstrate progress, HLE provides standardized comparison while highlighting that language models still lack deep integrated expertise, metacognitive skills, and interdisciplinary reasoning defining human specialist knowledge.
By Miles Harrington
February 11, 2026

Join us. Our Telegram: @forexturnkey
All to the point, no ads. A channel that doesn't tire you out, but pumps you up.

Report

My comments

FX24

Author’s Posts

  • Social Trading and Binary Options: How Platforms Bring Traders Together to Share Experiences

    Social trading in binary options in 2026: how copying trades and interacting with professionals helps beginners learn, where the lin...

    Feb 13, 2026

  • Humanity's Last Exam: Why GPT-5 Scored Only 25% on the Hardest AI Test Ever

    Humanity's Last Exam challenges AI with 2,500 expert questions. GPT-5 scored 25%, GPT-4o just 2.7%. How researchers prove AI still l...

    Feb 13, 2026

  • European Stocks Navigate AI Turbulence

    European stocks face mixed opening after AI-driven Wall Street sell-off. Expert analysis of market trends, tariff rollbacks, and inf...

    Feb 13, 2026

  • The Role of CRM in Customer Retention: How MetaTrader Providers Use Relationship Management Systems to Reduce Trader Churn

    Discover how MetaTrader providers leverage CRM systems to reduce trader churn through personalized offers, loyalty programs, and tim...

    Feb 13, 2026

  • Family Business in Forex: How Couples Trade Together and Succeed in the Market

    Discover how couples collaborate to conquer the forex market, balancing technical and fundamental analysis while building emotional ...

    Feb 12, 2026

Copyright ©2026 FX24 forex crypto and binary news


main version