In 2019, an A.I. researcher, François Chollet, designed a puzzle game that was meant to be easy for humans but hard for machines.
The game, called ARC, became an important way for experts to track the progress of artificial intelligence and push back against the narrative that scientists are on the brink of building A.I. technology that will outsmart humanity.
Mr. Chollet’s colorful puzzles test the ability to quickly identify visual patterns based on just a few examples. To play the game, you look closely at the examples and try to find the pattern.
Each example uses the pattern to transform a grid of colored squares into a new grid of colored squares:
The pattern is the same for every example.
Now, fill in the new grid by applying the pattern you learned in the examples above.
For years, these puzzles proved to be nearly impossible for artificial intelligence, including chatbots like ChatGPT.
A.I. systems typically learned their skills by analyzing huge amounts of data culled from across the internet. That meant they could generate sentences by repeating concepts they had seen a thousand times before. But they couldn’t necessarily solve new logic puzzles after seeing only a few examples.
That is, until recently. In December, OpenAI said that its latest A.I. system, called OpenAI o3, had surpassed human performance on Mr. Chollet’s test. Unlike the original version of ChatGPT, o3 was able to spend time considering different possibilities before responding.
Some saw it as proof that A.I. systems were approaching artificial general intelligence, or A.G.I., which describes a machine that’s as smart as a human. Mr. Chollet had created his puzzles as a way of showing that machines were still a long way from this ambitious goal.
But the news also exposed the weaknesses in benchmark tests like ARC, short for Abstraction and Reasoning Corpus. For decades, researchers have set up milestones to track A.I.’s progress. But once these milestones were reached, they were exposed as insufficient measures of true intelligence.
Arvind Narayanan, a Princeton computer science professor and co-author of the book “AI Snake Oil,” said that any claim that the ARC test measured progress toward A.G.I. was “very much iffy.”
Still, Mr. Narayanan acknowledged that OpenAI’s technology demonstrated impressive skills in passing the ARC test. Some of the puzzles are not as easy as the one you just tried.
The one below is little harder, and it, too, was correctly solved by OpenAI’s new A.I. system:
A puzzle like this shows that OpenAI’s technology is getting better at working through logic problems. But the average person can solve puzzles like this one in seconds. OpenAI’s technology consumed significant computing resources to pass the test.
Last June, Mr. Chollet teamed up with Mike Knoop, co-founder of the software company Zapier, to create what they called the ARC Prize. The pair financed a contest that promised $1 million to anyone who built an A.I. system that exceeded human performance on the benchmark, which they renamed “ARC-AGI.”
Companies and researchers submitted over 1,400 A.I. systems, but no one won the prize. All scored below 85 percent, which marked the performance of a “smart” human.
OpenAI’s o3 system correctly answered 87.5 percent of the puzzles. But the company ran afoul of competition rules because it spent nearly $1.5 million in electricity and computing costs to complete the test, according to pricing estimates.
OpenAI was also ineligible for the ARC Prize because it was not willing to publicly share the technology behind its A.I. system through a practice called open sourcing. Separately, OpenAI ran a “high-efficiency” variant of o3 that scored 75.7 percent on the test and cost less than $10,000.
“Intelligence is efficiency. And with these models, they are very far from human-level efficiency,” Mr. Chollet said.
(The New York Times sued OpenAI and its partner, Microsoft, in December for copyright infringement of news content related to A.I. systems.)
On Monday, the ARC Prize introduced a new benchmark, ARC-AGI-2, with hundreds of additional tasks. The puzzles are in the same colorful, grid-like game format as the original benchmark, but are more difficult.
“It’s going to be harder for humans, still very doable,” said Mr. Chollet. “It will be much, much harder for A.I. — o3 is not going to be solving ARC-AGI-2.”
Here is a puzzle from the new ARC-AGI-2 benchmark that OpenAI’s system tried and failed to solve. Remember, the same pattern applies to all the examples.
Now try to fill in the grid below according to the pattern you found in the examples:
This shows that although A.I. systems are better at dealing with problems they have never seen before, they still struggle.
Here are a few additional puzzles from ARC-AGI-2, which focuses on problems that require multiple steps of reasoning:
As OpenAI and other companies continue to improve their technology, they may pass the new version of ARC. But that does not mean that A.G.I. will be achieved.
Judging intelligence is subjective. There are countless intangible indicators of intelligence, from composing works of art to navigating moral dilemmas to intuiting emotions.
Companies like OpenAI have built chatbots that can answer questions, write poetry and even solve logic puzzles. In some ways, they have already exceeded the powers of the brain. OpenAI’s technology has outperformed its chief scientist, Jakub Pachocki, on a competitive programming test.
But these systems still make mistakes that the average person would never make. And they struggle to do simple things that humans can handle.
“You’re loading the dishwasher, and your dog comes over and starts licking the dishes. What do you do?” said Melanie Mitchell, a professor in A.I. at the Santa Fe Institute. “We sort of know how to do that, because we know all about dogs and dishes and all that. But would a dishwashing robot know how to do that?”
To Mr. Chollet, the ability to efficiently acquire new skills is something that comes naturally to humans but is still lacking in A.I. technology. And it’s what he has been targeting with the ARC-AGI benchmarks.
In January, the ARC Prize became a nonprofit foundation that serves as a “north star for A.G.I.” The ARC Prize team expects ARC-AGI-2 to last for about two years before it is solved by A.I. technology — though they would not be surprised if it happened sooner.
They have already started work on ARC-AGI-3, which they hope to debut in 2026. An early mock-up hints at a puzzle that involves interacting with a dynamic, grid-based game.
A.I. researcher François Chollet designed a puzzle game meant to be easy for humans but hard for machines.
Kelsey McClellan for The New York Times
Early mock-up for ARC-AGI-3, a benchmark that could involve interacting with a dynamic, grid-based game.
ARC Prize Foundation
This is a step closer to what people deal with in the real world — a place filled with movement. It does not stand still like the puzzles you tried above.
Even this, however, will go only part of the way toward showing when machines have surpassed the brain. Humans navigate the physical world — not just the digital. The goal posts will continue to shift as A.I. advances.
“If it’s no longer possible for people like me to produce benchmarks that measure things that are easy for humans but impossible for A.I.,” Mr. Chollet said, “then you have A.G.I.”