Computer screen displaying complex academic questions with AI interface struggling to provide answers

AI Still Stumps on New Expert-Level Test, Proving Limits

🤯 Mind Blown

Top AI models score just 25% on a groundbreaking new test designed by thousands of experts, showing these systems still have a long way to go. The difficulty proves human expertise remains irreplaceable in complex thinking.

The world's smartest AI systems just got humbled by a test designed to prove what they can't do yet.

An international team of thousands of experts created Humanity's Last Exam, a collection of 2,500 graduate-level questions spanning math, sciences, and humanities. Each question was carefully crafted to stump AI models and can't be answered with a simple internet search.

The results are surprisingly encouraging for humans. OpenAI's latest GPT-5 model managed only a 25% score, while its predecessor GPT-4o scraped by with just 2.7%. Google and Anthropic's top models performed similarly, showing that even the most advanced AI systems struggle with expert-level reasoning.

The test includes questions that require deep knowledge and multi-step thinking. Translating ancient Roman inscriptions, identifying hummingbird bone structures, mapping complex chemical reactions, and analyzing Hebrew pronunciation all made the cut. These aren't trick questions but real problems that experts tackle in their fields.

Researchers collected roughly 70,000 question submissions from experts in 50 countries. Only questions that stumped multiple AI models advanced to the final test, ensuring the benchmark would remain challenging as technology improves.

AI Still Stumps on New Expert-Level Test, Proving Limits

The team kept most questions private to prevent AI companies from training their models specifically to pass this test. This approach keeps the benchmark honest and useful for measuring genuine progress rather than memorization.

Why This Inspires

This test isn't about beating AI down. It's about understanding what these tools can and can't do as they become more common in research and decision-making. Having an objective way to measure AI capabilities helps everyone set realistic expectations.

The gaps revealed by this test also highlight areas where human expertise shines. Complex reasoning across disciplines, nuanced understanding, and creative problem-solving remain uniquely human strengths. As AI handles routine tasks, humans can focus on the sophisticated thinking that makes us irreplaceable.

Some critics argue the test's name sounds too dramatic, while others note it can't measure everything that makes human thinking special. They're right that short-answer tests miss the messy, creative work of real expertise like writing papers or developing new theories.

But the test serves an important purpose. As AI companies race to develop more powerful models, having a difficult benchmark prevents premature claims of superhuman intelligence. It keeps the conversation grounded in reality.

The improving scores show AI is advancing, just not as quickly as headlines might suggest. Human expertise isn't going anywhere.

More Images

AI Still Stumps on New Expert-Level Test, Proving Limits - Image 2
AI Still Stumps on New Expert-Level Test, Proving Limits - Image 3
AI Still Stumps on New Expert-Level Test, Proving Limits - Image 4

Based on reporting by Singularity Hub

This story was written by BrightWire based on verified news reports.

Spread the positivity!

Share this good news with someone who needs it

More Good News