SWE-bench Verified no longer measures frontier coding capabilities

kmdupree 277 points 156 comments April 26, 2026
openai.com · View on Hacker News

Discussion Highlights (20 comments)

w4yai

I don't understand these websites which force translation to my native language. I mean, it's fine as it's useful for many people, but where is the button for disabling it ? Or why is it enabled by default ? "codage de pointe" sounds so weird and cringe in French.

1a527dd5

This feels very much like "we are now moving the goal posts".

neversupervised

Terminal Bench is the future

vintagedave

> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions, despite our best efforts in improving on this in the initial creation of SWE-bench Verified. Is this saying a quarter* of the questions and answers were wrong, this whole time?! If so, how was this ever, in any way, a valid measurement? And what was the process for creating this benchmark and how did it end up with such an extraordinarily poor set of data? (There is a description later of how, which seems to be a high standard and I struggle to understand how it aligns with the other results they discuss.) Kudos to them for highlighting the issues, but I am left with questions. [*] Not one in four, but one in six, thanks commenters for the correction; leaving the original since, eh, my bad, and it lets replies make sense. I feel the broad point still stands!

adityamwagh

> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests. No shit, Sherlock!

djoldman

> We have incorporated these findings into our recent evaluation efforts. In the last months we’ve chosen to report results from the public split of SWE-Bench Pro. We recommend other model developers do the same. SWE-bench Pro is not perfect, but empirically seems to suffer less from contamination issues. https://arxiv.org/pdf/2509.16941

Jcampuzano2

Its pretty clear that any benchmark that comes out will be outdated and exist within the training data with short measure. There will always be an incentive to optimize specifically for these benchmarks even if just for marketing material. Sure there is a training cutoff, but its usually only 3-6 months off of the public release dates. The problem with coding benchmarks then becomes creating novel benchmarks that are guaranteed to not already be in the training data, and not borrow anything from previous benchmarks. In this regard I don't think any benchmark that was created before a given model is released should ever be considered valid or representative of model performance. The potential financial gain for including the data just to be able to market a minor improvement is too swaying. With that in mind they should honestly just stop including benchmarks altogether in marketing material Let the model speak for itself and let the community decide, but of course that will never slide with corporate types with so much money on the line.

Jimmc414

Goodhart’s Law in reverse, what can’t be gamed gets rejected.

varispeed

Issue with these benchmark also is that they measure a model you are unlikely going to be routed to. My experience with Anthropic is that despite using Opus 4.6 and 4.7, most of the time the performance is matching low B parameter Qwen. I think there should be a way to verify what model is actually being used to process prompts - that should be independently verified. At the moment it is so bad, you have to ask verification question to the model in form of a non-trivial problem. If it solves it, then there is a chance you actually get Opus and not an impostor and so you can continue the session instead of restarting it hoping you get routed correctly. But that does not help if model is replaced with cheaper one mid session. I've got so much work lost because of these shenanigans.

gpm

Curiously Opus 4.7 claims to have a 87.6% pass rate and Mythos claims to have a 93.9% pass rate... leading to the conclusion that it's actually possible to "solve" the problems that OpenAI claims are incorrect.

ripvanwinkle

>>In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix used as the ground-truth reference, known as the gold patch, or verbatim problem statement specifics for certain tasks, indicating that all of them have seen at least some of the problems and solutions during training this statement alone seems to invalidate the SWE-bench tests

threepts

Why don't they ask their premier model to generate a bench for them? Jokes aside, a benchmark I look forward to is ARC-AGI-3. I tried out their human simulation, and it feels very reasoning heavy. Leaderboard: https://arcprize.org/leaderboard (Most premier models don't even pass 5 percent.)

retinaros

it never did

gertlabs

A better benchmark needs to be objectively scored, have multi-disciplinary, breadth, and be scalable (no single correct answer). That's what we designed at https://gertlabs.com . We put a lot of thought into it, and kept it mostly (not fully) related to problem solving through coding.

DeathArrow

So we need to generate benchmarks after the models finish training. Or we need to keep the solutions to the benchmark problems as closed source.

kqr

It was never that great, it seems. For all of 2025 there was virtually no improvement in the rate at which models produced quality code. They only got better at passing automated tests. https://entropicthoughts.com/no-swe-bench-improvement

DeathArrow

So Opus 4.7 and Mythos are solving problems that are impossible to solve?

rustyhancock

I think an Olympiad format is better. But the financial incentive is such that it might be near impossible to stop leaks. I.e. A panel comes up with a series of problems. Like advent of code or project Euler but more complex and constricted. Benchmark outcomes could be performance points and measure of cost, time to solution (well token count really). A couple times per year it's run. It avoids overfitting. Overtime the tasks can become more complex if needed. If they benchmax it into being able to complete full products from spec and robust implementations amazing.

cowartc

The headline leads with contamination, but buried is that 59% of audited failures had test design defects. That's a measurement system never validated against ground truth before being adopted industry-wide as a score that mattered. They reported on it for two years but the gauge was broken the entire time.

neuroelectron

It's really naïve to think any of the big AI companies won't cheat

Semantic search powered by Rivestack pgvector
8,303 stories · 78,303 chunks indexed