Evals will break
rajveerb
23 points
2 comments
May 20, 2026
Related Discussions
Found 5 related stories in 83.1ms across 8,303 title embeddings via pgvector HNSW
- Ask HN: How are people doing AI evals these days? yelmahallawy · 14 pts · March 10, 2026 · 44% similar
- Vibecoders Can't Build for Longevity dominicq · 18 pts · March 23, 2026 · 42% similar
- The difficulty of making sure your website is broken mcpherrinm · 61 pts · April 10, 2026 · 41% similar
- Elevated errors on Claude.ai, API, Claude Code redm · 242 pts · April 15, 2026 · 41% similar
- SWE-bench Verified no longer measures frontier coding capabilities kmdupree · 277 pts · April 26, 2026 · 41% similar
Discussion Highlights (2 comments)
rajveerb
I read through this blog post and it's timely given how close the models are to max out the benchmarks/evals. One thing which was not addressed but will be interesting to discuss would be benchmarks/evals that conflict. Are there desirable emergent behavior that might not be optimized because the evals penalize them?
ppeetteerr
The argument in the article is backwards. Evals test the stability and boundaries of a concept. They are not created before the concept has been prototyped (which the author acknowledges). An eval is not somehow breaking silently due to some new capabilities in an LLM. It wouldn't be a good eval if it did. What it does is steer the LLM towards specific goals. If anything, an argument can be made that they restrict creativity and experimentation by narrowing goals. If the argument is that evals need to written before some new behavior can be devised, that's incorrect. There are an infinite number of evals that test for things which cannot be done. Only when something has been demonstrated to work in a specific context, can an eval be written.