ARC-AGI-3

lairv 328 points 207 comments March 25, 2026
arcprize.org · View on Hacker News

https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf

Discussion Highlights (20 comments)

CamperBob2

Without reading the .pdf, I tried the first game it gave me, at https://arcprize.org/tasks/ls20 , and I couldn't begin to guess what I was supposed to do. Not sure what this benchmark is supposed to prove. Edit: Having messed around with it now (and read the .pdf), it seems like they've left behind their original principle of making tests that are easy for humans and hard for machines. I'm still not convinced that a model that's good at these sorts of puzzles is necessarily better at reasoning in the real world, but am open to being convinced otherwise.

tasuki

So ARC-AGI was released in 2019. That's been solved, then there was ARC-AGI-2, and now there's ARC-AGI-3. What is even the point? Will ARC-AGI-26 hit the front page of Hacker News in 2057 ?

Stevvo

Maybe I'm just not intelligent, but I gave it a couple of minutes and couldn't figure out WTF the game wants from you or how to win it.

typs

My takeaway from playing a number of levels is that I am definitely not AGI

nubg

Any benchmarks?

dinkblam

what is the evidence that being able to play games equates to AGI?

semiinfinitely

i feel bad that we make the LLMs play this

chaise

The official leaderboard for ARC-AGI-3 for current LLMs : https://arcprize.org/leaderboard (yous should select the 3th leaderboard) CRAZY 0.1% in average lmao

OsrsNeedsf2P

Some of these tasks are crazy. Even I can't beat them: https://arcprize.org/tasks/ar25

6thbit

Not clear to me the diff with v2?

baron816

Looks like I’m generally unintelligent

Tiberium

https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing): - Human baseline is "defined as the second-best first-run human by action count". Your "regular people" are people who signed up for puzzle solving and you don't compare the score against a human average but against the second best human solution - The scoring doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans. It uses squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% ((10/100)^2) - 100% just means that all levels are solvable. The 1% number uses uses completely different and extremely skewed scoring based on the 2nd best human score on each level individually. They said that the typical level is solvable by 6 out of 10 people who took the test, so let's just assume that the median human solves about 60% of puzzles (ik not quite right). If the median human takes 1.5x more steps than your 2nd fastest solver, then the median score is 0.6 * (1/1.5)^2 = 26.7%. Now take the bottom 10% guy, who maybe solves 30% of levels, but they take 3x more steps to solve it. this guy would get a score of 3% - The scoring is designed so that even if AI performs on a human level it will score below 100% - No harness at all and very simplistic prompt - Models can't use more than 5X the steps that a human used - Notice how they also gave higher weight to later levels? The benchmark was designed to detect the continual learning breakthrough. When it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES"

BeetleB

> As long as there is a gap between AI and human learning, we do not have AGI. Back in the 90's, Scientific American had an article on AI - I believe this was around the time Deep Blue beat Kasparov at chess. One AI researcher's quote stood out to me: "It's silly to say airplanes don't fly because they don't flap their wings the way birds do." He was saying this with regards to the Turing test, but I think the sentiment is equally valid here. Just because a human can do X and the LLM can't doesn't negate the LLM's "intelligence", any more than an LLM doing a task better than a human negates the human's intelligence.

abraxas

Even if tomorrow's models get good enough to complete these games we won't be able to proclaim AGI. In the realm of silly computer games alone I'm going on record saying that there are plenty of 8 bit games that AIs will trip on even when this benchmark is crushed. 2D platformers like Manic Miner or Mario need skills that none of these games appear to capture.

ranyume

This is an interesting update. And a big challenge for companies and labs. The new tools for measurement are indeed what I'd like out of future agents, and agents that solve the games will need to use different subsystems to do so. This is basically optimization for achieving goals (as opposed to prompt engineering / magic spells to make the LLM do what is told to do) which imo is the future we should aspire to build.

andai

In the year 2032: ARC-AGI-13: Almost definitely AGI this time!

spprashant

I played the demo, but it definitely took me a minute to grok the rules. I don't know if this is how we want to measure AGI. In general I believe the we should probably stop this pursuit for human equivalent intelligence that encourages people to think of these models as human replacements. LLMs are clearly good at a lot of things, lets focus on how we can augment and empower the existing workforce.

cedws

It's like playing The Witness. Somebody should set LLMs loose on that.

lukev

I'm not sure how this relates to AGI. This measures the ability of a LLM to succeed in a certain class of games. Sure, that could be a valuable metric on how powerful (or even generally powerful) a LLM is. Humans may or may not be good at the same class of games. We know there exists a class of games (including most human games like checkers/chess/go) that computers (not LLMs!) already vastly outpace humans. So the argument for whether a LLM is "AGI" or not should not be whether a LLM does well on any given class of games, but whether that class of games is representative of "AGI" (however you define that.) Seems unlikely that this set of games is a definition meaningful for any practical, philosophical or business application?

WarmWash

Captcha's about to get wild. Maybe the internet will briefly go back to a place mainly populated with outliers.

Semantic search powered by Rivestack pgvector
3,471 stories · 32,344 chunks indexed