Can AI Really Do Science? New Benchmarks Reveal the Truth

AI might ace science exams, but can it actually do science? That's the question researchers at Allen AI set out to answer, and the results are surprisingly hopeful.

Back in 2022, the institute released ScienceWorld, a virtual lab where AI agents attempt elementary school science experiments. The best AI models at the time scored below 10%, despite acing multiple choice tests on the same material. It revealed a huge gap between knowing facts and applying them.

Three years later, the newest AI models score in the low 80s on those same fourth grade experiments. That's real, measurable progress in teaching machines to think like scientists, not just memorize textbooks.

The team didn't stop there. In 2024, they released DiscoveryWorld, a much harder challenge set on a fictional space colony called Planet X. AI agents must design their own experiments from scratch, form hypotheses, test them, and analyze results across 120 different scientific scenarios.

AI Lab Tests If Robots Can Actually Do Real Science

These aren't simple tasks. Agents might investigate disease outbreaks or figure out how quantum reactors work, requiring hundreds of individual actions to solve. The benchmark tests whether AI truly understands what it discovered or just got lucky.

Human scientists with advanced degrees solve about 70% of the harder DiscoveryWorld challenges. The best AI systems? Only about 20%. That gap matters because it shows the difference between intelligence and true scientific reasoning.

The Ripple Effect

This research gives us something rare in the AI hype cycle: honest measurement. While companies announce flashy AI science agents on social media, these benchmarks provide the reality check we need.

Lead researcher Peter Jansen notes that if the best systems struggle with these challenges, we should be skeptical of grander claims. His work has been cited nearly 80 times and covered by New Scientist, helping the entire field understand where AI truly stands.

The benchmarks also reveal how fast AI is actually improving. ScienceWorld seemed impossibly hard when it launched, but models caught up within three years. DiscoveryWorld is following the same pattern, suggesting we're moving toward AI that can genuinely assist human scientists.

The real win isn't that AI can replace scientists. It's that we finally have reliable ways to measure progress toward AI that thinks scientifically, not just statistically. That honest assessment will help us build better tools for the researchers working to solve humanity's biggest challenges.

AI Lab Tests If Robots Can Actually Do Real Science

Spread the positivity!

More Good News

Nvidia Brings AI Power to Laptops and Desktops This Fall

Montreal Team Cuts AI Energy Use With New Photonic Chip

Crew-11 Astronauts Share 167 Days in Space at NASA Event

Daily Morale

Explore Intel

Daily Inspiration