Humanity's Last Exam: Scientists Build the Hardest Test Ever

Latest AI News: What Is Artificial Intelligence Really Capable Of?

Artificial intelligence has come a long way in recent years. Machines now pass tests that once seemed impossible. But how smart are these machines really? Scientists have created a new exam to find out.

What is Humanity’s Last Exam?

The test is called “Humanity’s Last Exam.” Nearly 1,000 experts from around the world worked together to create it. Their goal was both simple and ambitious: to measure the real difference between AI and human expertise.

Estimated reading time: 9 minutes

Latest AI News: What Is Artificial Intelligence Really Capable Of?
What is Humanity’s Last Exam?
The History of Artificial Intelligence Testing
Artificial Intelligence Research Creates an Unprecedented Challenge
How Scientists Designed Humanity’s Last Exam, the Ultimate AI Test
Latest AI News: Even Advanced Systems Struggle
What Is Artificial Intelligence Good At?
Generative AI News: Machines as Creative Collaborators
ChatGPT News and the Future of AI Evaluation
Artificial Intelligence News: What This Means for the Future
Research About Artificial Intelligence
Rethinking How We Measure Artificial Intelligence
Understanding Artificial Intelligence’s True Potential
Read the complete studies:
Related News

The History of Artificial Intelligence Testing

For decades, researchers have used benchmarks to track how AI is improving. These tests check how well machines handle certain tasks. Early on, computers struggled with these challenges. Now, modern AI systems pass many of them with ease.

One well-known test is called MMLU, which stands for Massive Multitask Language Understanding. Today’s AI systems score over 90 percent on it. When machines do this well, the tests stop being useful, and scientists can’t track real progress anymore.

This problem, where tests become too easy for AI, is now common in artificial intelligence research. Challenges that once set new standards now seem outdated. The field needed a much tougher test.

Artificial Intelligence Research Creates an Unprecedented Challenge

To address this, researchers created Humanity’s Last Exam, or HLE. It has 2,500 questions covering many subjects, including math, humanities, and natural sciences. Some questions deal with ancient languages, while others need specialized medical knowledge.

Dr. Tung Nguyen from Texas A&M University helped create the test. He contributed 73 questions, focusing on mathematics and computer science.

“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human-level understanding,” Nguyen explained. “But HLE reminds us that intelligence isn’t just about pattern recognition.”

The research team shared their findings in the journal Nature. More details can be found at lastexam.ai. Their research paper explains how experts from around the world worked together to build this benchmark.

How Scientists Designed Humanity’s Last Exam, the Ultimate AI Test

Each question was carefully tested before being included. Researchers tried them out on top AI systems, and if any model got the answer right, that question was taken out. This way, the exam stayed out of reach for today’s AI.

Humanity’s last exam questions needed clear, checkable answers and couldn’t be solved with a quick internet search. Some tasks asked for translations of ancient Palmyrene inscriptions, while others involved spotting tiny anatomical features in birds. Many questions required deep knowledge in very specialized areas.

Nearly 1,000 experts from around the world took part in this project. They came from over 500 institutions in 50 countries, and most had advanced degrees such as PhDs or master’s degrees. To bring in top talent, the project offered a $500,000 prize pool.

Latest AI News: Even Advanced Systems Struggle

The results show a big gap in what AI can do. Even the most advanced systems struggled with HLE. GPT-4o scored only 2.7 percent, Claude 3.5 Sonnet got 4.1 percent, and OpenAI’s o1 model did a bit better at 8 percent.

The best systems tested, like Gemini 3.1 Pro and Claude Opus 4.6, scored between 40 and 50 percent. While that’s an improvement, it’s still much lower than what human experts can achieve.

“Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do,” Nguyen said. “Benchmarks provide the foundation for measuring progress and identifying risks.”

The research also looked at how confident AI models were in their answers. Often, they gave wrong answers but were very sure of themselves. This overconfidence is a real safety concern, as most models had calibration errors over 70 percent.

What Is Artificial Intelligence Good At?

The exam shows where today’s AI is strong and where it falls short. Machines are great at spotting patterns and processing data, and they handle routine tasks well. But they have trouble with deep reasoning and specialized knowledge.

“This isn’t a race against AI,” Nguyen emphasized. “It’s a method for understanding where these systems are strong and where they struggle.”

That understanding helps build safer, more reliable technologies. It also reminds us why human expertise still matters.

AI Model Performance Comparison

AI Model Performance on Humanity’s Last Exam

Accuracy scores across leading AI systems tested on expert-level questions

AI Model	Developer	Accuracy Score
Entry-Level Performance (0-10%)
GPT-4o	OpenAI	2.7%
Claude 3.5 Sonnet	Anthropic	4.1%
OpenAI o1	OpenAI	8.0%
Advanced Performance (40-50%)
Gemini 3.1 Pro	Google DeepMind	40-50%
Claude Opus 4.6	Anthropic	40-50%
Human Baseline (For Comparison)
Human Experts	PhD/Master’s Level	~89%

⚠️ Key Finding: Even the most advanced AI systems scored between 40-50%, significantly below human expert performance. Most models showed calibration errors above 70%, meaning they frequently provided wrong answers with high confidence.

Source: Nature Journal (2026) – Texas A&M University / ScienceDaily Research
Test: Humanity’s Last Exam (HLE) – 2,500 expert-level questions across multiple disciplines

Generative AI News: Machines as Creative Collaborators

Recent artificial intelligence research reveals another dimension to these systems. While AI struggles with expert-level academic questions, it excels at creative collaboration. Scientists at Swansea University discovered this through a large-scale study.

More than 800 participants designed virtual cars using AI-supported tools. The system generated diverse design galleries showing many possibilities. These included effective designs, unusual ideas, and intentionally flawed options. Dr. Sean Walton led the research.

“When people were shown AI-generated design suggestions, they spent more time on the task,” he explained. “They produced better designs and felt more involved.”

The study appeared in the ACM journal Transactions on Interactive Intelligent Systems. This research questions the usual idea that artificial intelligence is just a replacement for human work. The Swansea study points to a different role. AI can be a creative partner that helps people explore new ideas.

Interestingly, imperfect AI suggestions proved valuable.

“Participants responded most positively to galleries that included a wide variety of ideas, including bad ones,” Walton noted.

These diverse options helped people move beyond initial assumptions. They encouraged creative risk-taking and prevented early fixation.

TalkPal AI – The most efficient way to learn a new language – Get started today

ChatGPT News and the Future of AI Evaluation

The research raises important questions about how we judge AI systems. Traditional measures, like counting clicks, miss the deeper ways AI affects how people think and get involved.

Researchers say we need broader ways to evaluate AI. It’s important to understand how AI affects human creativity. As AI becomes part of creative fields like engineering, architecture, music, and game design, working together becomes key.

The Swansea research examined how AI systems influence cognitive, emotional, and behavioral engagement. The study involved both a large field study with 808 participants and a controlled lab study with 12 participants. Researchers used the Genetic Car Designer tool, which employed MAP-Elites algorithms to generate diverse design suggestions.

The results showed that people who looked at AI-generated galleries spent more time on their tasks and created better designs. Still, opinions about how helpful the AI was varied a lot. Some liked the structured variety from MAP-Elites algorithms, while others preferred random suggestions for exploring new ideas.

“As the technology evolves, the question is not only what AI can do,” Walton said. “It’s how it can help us think, create and collaborate more effectively.”

Artificial Intelligence News: What This Means for the Future

Humanity’s Last Exam is set to be a lasting benchmark for future AI progress. Researchers made some of Humanity’s Last Exam questions public but kept others secret, so AI models can’t just memorize the answers.

The exam shows there are still big differences between AI and human intelligence. Even with fast advances in technology, machines don’t have real expert-level understanding. They still struggle with specialized knowledge and deep reasoning.

But the research on creative collaboration gives reason for hope. AI doesn’t have to replace human intelligence. Instead, it can boost and support what people can do. The technology is most useful when it helps human creativity and exploration.

Research About Artificial Intelligence

The latest artificial intelligence research paper reveals several important insights:

Current AI systems remain far from human expert performance on specialized tasks.
Traditional benchmarks no longer adequately measure AI capabilities.
AI shows promise as a creative collaborator rather than just a replacement tool.
Diverse AI outputs, including imperfect suggestions, enhance human creativity.
Evaluation methods need to capture deeper effects on human thinking.
Behavioral engagement with AI tools correlates with better design outcomes.
The gap between AI confidence and actual accuracy presents safety concerns.

Rethinking How We Measure Artificial Intelligence

The Swansea research brings up important ideas for judging human-AI systems. The study looked at three types of engagement: cognitive, which is about attention and effort; emotional, which covers feelings and interest; and behavioral, which means actions and participation.

The findings showed that just looking at AI suggestions changed how people designed things. This challenges current ways of measuring AI, which often only count clicks or copies. The time people spent viewing galleries was just as important as direct actions.

The research also found that designers changed their approach as they worked. Some focused on making the best designs, while others experimented and learned from both good and bad examples. This suggests that AI systems should adjust to different user preferences and stages of the design process.

Understanding Artificial Intelligence’s True Potential

These studies give us important insights into what AI can and can’t do. Humanity’s Last Exam shows that AI still has a long way to go. Even the most advanced systems have trouble with expert-level academic tasks, and there’s still a big gap between machines and human experts.

Still, the research on creative collaboration gives an optimistic view. When designed well, AI can help people do more. It encourages exploration, keeps people engaged, and leads to better results. The real value of AI is in working together with humans, not replacing them.

As AI keeps evolving, it’s important to measure its abilities accurately. Policymakers, developers, and users all need a clear understanding of what these systems can do. This knowledge helps everyone make better choices about how to use and manage AI.

The research shows that human expertise is still essential. Machines are good at some tasks, but they don’t have the depth of understanding that people do. The future will likely see humans and AI working together, achieving more as a team than either could alone.