FX24 Forex news 29 August 2025 Hits: 1486

Humanity’s Last Exam

FX24 Forex news 29 August 2025 Hits: 1486

Humanity’s Last Exam

In early 2025, a new AI benchmark quietly reshaped how we measure artificial intelligence. Known as Humanity’s Last Exam (HLE), this extensive evaluation framework was launched by the Center for AI Safety (CAIS) and Scale AI.

Unlike previous benchmarks, HLE is designed not just to assess whether an AI model is good — but whether it's truly capable of mastering the full spectrum of human knowledge.

As of July 2025, no large language model (LLM) has achieved a perfect score.
That, however, is the point.

What Is HLE, Exactly?

Humanity’s Last Exam is a multi-domain test built specifically for modern AI systems. It includes 2,500 questions across more than 100 disciplines — including mathematics, physics, biology, law, social sciences, and even visual interpretation tasks.
According to Scale AI’s public materials (source: scale.com/cais/hle, June 2025), the questions range from multiple choice (24%) to open-ended factual prompts (76%).

But HLE isn't just big — it's hard. Questions are intentionally positioned at the edge of current human knowledge, beyond what’s typically covered in academic syllabi or professional certifications.

Example, cited by CAIS, asks:

In the Physics and Engineering section of the test, there is a question of this type:

"A quartz crystal with a natural resonant frequency of 32.768 Hz is used in a standard clock. Assuming a temperature drift of ±0.5 °C/ms, how does this affect the clock signal duration at a cycle of 1 second? Estimate the frequency offset percentage and the stable operating time without correction."

This question requires a combination of knowledge:

physics of resonances and temperature dependence;
engineering understanding of time accuracy in real-time systems;
mathematical calculations of drift and its influence on the phase signal angle.

This is far from a simple academic task: an engineer must be able to move from physical formulas (the temperature coefficient of quartz) to practical figures - how much shift in percent per second, and what the consequences are for the synchronous clock system.

Humanity’s Last Exam

Not just knowledge of the topic, but the ability to integrate data from different disciplines .
It does not rest on familiar patterns - the question requires a logical approach and interdisciplinary calculation .
Examples like these expose weaknesses even in advanced language models—they can't just pick an answer, they have to reason.

Example of a question from the biology section - simpler and clearer

"In butterflies of the family Nymphalidae, males and females have a difference in wing width of about 15%. If the average male wing is 50 mm, what is the average female wing width in millimeters?"
Be able to translate a problem from biology into precise calculations.

They do not require deep specialized knowledge, but test the model's ability to link simple biological information with arithmetic.
Combining concepts and calculations is exactly the level of tasks that many language models fear.
Even advanced LLMs often make mistakes when interpreting percentages and basic calculations if they don't "get" the context.

This isn’t trivia. It’s the type of question where even subject-matter experts might hesitate.

The Origin: From Benchmark Fatigue to Bold Experiment

HLE was conceptualized by Dan Hendrycks, a leading researcher in machine learning safety and director at CAIS.
After co-authoring benchmarks like MMLU (Massive Multitask Language Understanding) and GSM8K, Hendrycks was struck by a growing problem: "AI was beating the tests, but still failing in the real world."

He reportedly began work on HLE after conversations with Elon Musk, who criticized existing tests as “too easy.”
In response, Hendrycks collaborated with Scale AI to crowdsource new benchmark questions.
Over 1,000 contributors from 50+ countries — professors, PhDs, domain experts — joined the effort.

Their task: find what even GPT-4 can’t solve.

The Test-Building Process

According to CAIS documentation, HLE's questions went through a rigorous pipeline:

1. Initial screening: Over 70,000 questions were filtered by top LLMs. If models failed or performed below random guess thresholds, the questions were kept.
2. Human refinement: \~13,000 “fail” questions were reviewed by graduate-level experts.
3. Final vetting: A two-layer quality check — blind peer review and expert approval — ensured the intellectual rigor.

To encourage quality contributions, authors of the 50 best questions were awarded \$5,000 each, with \$500 for the next 500.
In early 2025, Scale AI also launched a bug bounty to catch dataset errors.

The Results (So Far)

When HLE launched in Q1 2025, leading models scored surprisingly low. For example:

* GPT-4o: 3.3%
* Grok-2: 3.8%
* Claude 3.5 Sonnet: <10%

By June 2025, results had improved. According to Scale AI’s public leaderboard (scale.com/hle/leaderboard):

* Gemini 2.5 Pro Preview: 21.64%
* o3 (high) by OpenAI: 20.32%
* Claude Opus 4 (Thinking): 10.72%

The most significant breakthrough came in July 2025, when xAI announced that "Grok 4 scored 25.4%", and its enhanced version — Grok 4 Heavy — reached "44.4%", using multi-agent problem-solving.

That said, even the best models are still failing "over half the questions".

What Makes HLE Different?

Here’s why this test matters more than most AI benchmarks:

1. It Exposes Model Limitations

Unlike benchmarks that test recall or math puzzles, HLE focuses on "conceptual blind spots" and edge-case reasoning — areas where real-world AI deployments often fail.

2. It’s Not Culturally Narrow

HLE is multilingual and culturally diverse, including questions on Islamic finance, Confucian ethics, and global regulatory systems.

3. It Challenges the Plateau

Models had begun scoring >90% on MMLU, making it difficult to detect incremental improvements. HLE breaks that ceiling.

4. It Encourages Multi-Agent Reasoning

Some models, like Grok 4 Heavy, use "AI agent teams" to divide tasks — a major shift toward autonomous AI collaboration.

Expert Reactions: Smart ≠ Useful

Not everyone agrees that excelling at HLE means a model is “intelligent.” Kevin Zhou, a theoretical physicist who contributed to HLE, points out:

“There’s a huge gap between passing a test and doing real research. An AI that can answer these questions may still fail at unstructured discovery.”

This reflects a broader concern in AI safety: "Do benchmarks reflect real-world utility, or are they intellectual mirages?"

What’s Next?

Hendrycks believes HLE scores will continue to climb. He predicts models will "surpass 50% accuracy by the end of 2025", at which point AI may achieve “world-class oracle” status — able to outperform human experts across most fields.

But even then, he notes, the community may need a new challenge.
“HLE could be the last academic test for AI — but it won’t be the last test of AI.”

By Claire Whitmore
August 29, 2025

Join us. Our Telegram: @forexturnkey
All to the point, no ads. A channel that doesn't tire you out, but pumps you up.

My comments RSS

1000 Characters left

I consent to this website collecting my details through this form.

FX24

Author’s Posts

Scaling Turnkey Brokerage in 2025: MAM/PAMM Modules for Efficient Client Asset Management

Learn how MAM and PAMM modules help scale Turnkey's MT5-based brokerage solutions, increasing transparency and efficiency in client ...

Oct 14, 2025
Sitting Out Losses in Forex: Why Traders Lose Deposits and How to Stop This Scenario

Waiting out losses in Forex is a common mistake traders make. We explore the reasons, psychology, strategy examples, and the case of...

Oct 14, 2025
Blockchain Archaeology: How Lost Wallets and Dead Coins Are Changing the Crypto Economy

Lost Bitcoin and altcoins are creating "digital cemeteries," affecting the price, liquidity, and perception of cryptocurrency scarci...

Oct 14, 2025
Binary Options: The High-Speed Trading Tool That Can Make (or Break) Your Strategy

In 2025, binary options remain the fastest — and riskiest — segment of Forex. Explore how micro-volatility, AI automation, and t...

Oct 13, 2025
The advantage of pending orders in Forex: trading without emotions and 24/7 control

ending orders are a powerful tool for Forex traders, allowing them to trade even without being constantly present at the terminal. W...

Oct 13, 2025

All Section

What Are You Looking For?

Popular Tags

Forex markets

Scaling Turnkey Brokerage in 2025: MAM/PAMM Modules for Efficient Client Asset Management

Scaling Turnkey Brokerage in 2025: MAM/PAMM Modules for Efficient Client Asset Management

Sitting Out Losses in Forex: Why Traders Lose Deposits and How to Stop This Scenario

Humanity’s Last Exam