A Test So Hard No AI System Can Pass It — Yet

If you’re looking for a new reason to be nervous about artificial intelligence, try this: Some of the world’s smartest humans are struggling to create tests that AI systems can’t pass.

For years, AI systems have been measured by giving new models a series of standard benchmark tests. Many of these tests consisted of challenging, SAT-caliber problems in areas such as math, science, and logic. Comparing the models’ scores over time is a rough measure of AI progress.

But eventually AI systems got too good at these tests, so new, harder tests were created — often with questions that graduate students might face on their exams.

Those tests are not in good condition either. New models from companies like OpenAI, Google and Anthropic are scoring high on many of the PhD-level challenges, limiting the usefulness of these tests and raising a chilling question: Are AI systems Getting too smart for us to measure?

This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: a new assessment, dubbed “The Last Test of Humanity,” that they claim AI systems is the toughest test ever.

Humanity’s Last Test is the brainchild of Dan Hendricks, a noted AI safety researcher and director of the Center for AI Safety. (The original name of the test, “Mankind’s Last Stand,” was rejected as overly dramatic.)

Mr. Hendricks worked with Scale AI, an AI company where he is a consultant, to compile the test, which includes about 3,000 multiple-choice and short-answer questions in fields ranging from analytical philosophy to rocket engineering. are designed to test the capabilities of AI systems. .

The questions were compiled by experts in these fields, including college professors and prize-winning mathematicians, who were asked to answer the most difficult questions to which they knew the answers.

Here, try your hand at a question about hummingbird anatomy from the test:

Hummingbirds within the Apodiformes uniquely possess a bilaterally articulated ovate bone, a sesamoid embedded in the caudolateral portion of the extended, cruciate aponeurosis of the M insertion. Depression caudae. How many paired tendons are supported by this sesamoid bone? Answer with the number.

Or, if physics is more your speed, try this:

A block is placed on a horizontal rail, along which it can slide without friction. It is attached to a rigid, massless rod end of length R. A mass is attached to the other end. Both the objects have weights W. The system is initially stationary, with the volume directly above the block. The mass is given an infinite push parallel to the rail. Assume that the system is designed so that the rod can rotate through a full 360 degrees without restriction. When the rod is horizontal, it has tension T1. When the rod is vertical again, directly below the mass block, it carries a tension T2. (Both of these quantities can be negative, which would indicate that the rod is in compression.) What is the value of (T1−T2)/W?

(I’d print the answers here, but that would spoil the test of any AI systems trained on this column. Besides, I’m too dumb to verify the answers myself.)

Humanities final exam questions went through a two-step filtering process. First, well-known AI models were given to solve the presented questions.

If the models couldn’t answer them (or if, in the case of multiple-choice questions, the models performed worse than random guesses), the questions were given to a set of human reviewers, who scored them. Revised and verified correct answers. . Experts who wrote advanced questions were paid between $500 and $5,000 per question, as well as receiving credit for contributing to the exam.

Kevin Zhao, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, submitted a handful of questions for the test. are He chose three questions, all of which he told me were “within the upper range of what a graduate might see on a test.”

Mr Hendricks, who helped create a widely used AI test called Massive Multitask Language Understanding, or MMLU, said he was inspired to create the difficult AI test through conversations with Elon Musk. (Mr. Hendricks is also a security adviser to Mr. Musk’s AI company, xAI.) Mr. Musk, he said, expressed concern about the current tests given to AI models, which he thought were too easy.

“Alvin looked at the MMLU questions and said, ‘These are undergrad-level. I want things that a world-class expert can do,'” Mr. Hendricks said.

There are other tests that attempt to measure advanced AI abilities in certain domains, such as Frontier Math, a test developed by Epoch AI, and ARC-AGIa test Developed by AI researcher François Chollet.

But the ultimate test of humanity is to determine how good AI systems are at answering complex questions in a wide variety of academic subjects, which we can think of as general intelligence scores.

“We’re trying to gauge the extent to which AI can automate very difficult intellectual labor,” Mr Hendricks said.

After compiling a list of questions, the researchers gave the final test of humanity to six well-known AI models, including Google’s Gemini 1.5 Pro and Anthropic’s Cloud 3.5 Sonnet. All of these failed miserably. OpenAI’s o1 system scored the highest in the group with a score of 8.3 percent.

(The New York Times has filed a lawsuit against OpenAI and its partner Microsoft, alleging copyright infringement of news content about AI systems. OpenAI and Microsoft have denied the claims. )

Mr. Hendricks said he expects those scores to rise rapidly, possibly surpassing 50 percent by the end of the year. At this point, he said, AI systems can be thought of as “world-class oracles,” more accurate than human experts on any subject. Able to answer questions. And we may have to find other ways to measure the impact of AI, such as looking at economic data or deciding whether it Can make new discoveries in fields like math and science.

“You can imagine a better version of this where we can ask questions that we don’t know the answers to yet, and we can confirm whether the model can help us solve it. ” said Summer Yu, Scales. AI’s Director of Research and Examination Administrator.

Part of what’s confusing about AI progress these days is how concrete it is. We have AI models that are capable of diagnosing diseases more effectively than human doctors. Winning a silver medal in the International Mathematical Olympiad And Beating the top human programmers On competitive coding challenges.

But these same models sometimes struggle with basic tasks, like writing math or metered poetry. It’s given them a reputation for being amazingly brilliant at some things and downright useless at others, and it’s created very different perceptions of how fast AI is improving, depending on who you think is the best. Or looking at the worst results.

This clutter also makes these models difficult to measure. I wrote last year that we need better diagnostics for AI systems. I still believe it. But I also believe that we need more creative ways to track AI progress that don’t rely on standardized tests, because that’s what most humans do — and what we fear AI will do to us. Will improve — Cannot be taken in written examination. .

Mr. Zhao, a theoretical particle physics researcher who posed questions on the final humanities exam, told me that while AI models are often impressive at answering complex questions, he did not see them as a threat to himself or his colleagues. , because their jobs involve a lot. More than just spitting out the right answers.

“There’s a big gulf between what it means to do a test and what it means to be a physicist and researcher,” he said. “Even an AI that can answer these questions may not be ready to help with research, which is inherently less structured.”

Leave a Comment