We're running out of benchmarks to upper bound AI capabilities
gmays
15 points
8 comments
April 10, 2026
Related Discussions
Found 5 related stories in 59.9ms across 4,179 title embeddings via pgvector HNSW
- What if AI just makes us work harder? paulpauper · 42 pts · March 06, 2026 · 56% similar
- Why No AI Games? pavel_lishin · 67 pts · March 03, 2026 · 56% similar
- If AI has a bright future, why does AI think it doesn't? JCW2001 · 15 pts · March 06, 2026 · 56% similar
- Is anybody else bored of talking about AI? jakelsaunders94 · 614 pts · March 24, 2026 · 55% similar
- Why I'm Not Worried About Running Out of Work in the Age of AI 0bytematt · 34 pts · March 20, 2026 · 55% similar
Discussion Highlights (3 comments)
WarmWash
Start front loading the models with 5k, 10k, 50k, 100k tokens of messy quasi related context, and then run the benchmarks. These models are ridiculously powerful with a blank slate. It's when they get loaded down with all the necessary (and inevitably unnecessary) context to complete the task that they really start to crumble and fold.
nikisweeting
We can definitely make harder evals, the problem is a good eval set is indistinguishable from good training data / market edge, so no one is incentivized to share their best eval sets publicly.
UltraSane
This is the least true thing ever. All LLMs are terrible at ARC-AGI-3. Every video game can be used as a benchmark. You could rank LLMs on how long they can keep a game of Dwarf Fortress running or how fast they can beat GTA5.