Show HN: New Benchmark from SWE-bench team is 0% solved
lieret
15 points
2 comments
May 05, 2026
Related Discussions
Found 5 related stories in 79.2ms across 6,792 title embeddings via pgvector HNSW
- SWE-bench Verified no longer measures frontier coding capabilities kmdupree · 277 pts · April 26, 2026 · 59% similar
- Show HN: A new benchmark for testing LLMs for deterministic outputs khurdula · 50 pts · April 29, 2026 · 55% similar
- Many SWE-bench-Passing PRs would not be merged mustaphah · 199 pts · March 11, 2026 · 49% similar
- Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview GodelNumbering · 325 pts · April 27, 2026 · 48% similar
- Show HN: PhAIL – Real-robot benchmark for AI models vertix · 20 pts · March 31, 2026 · 48% similar
Discussion Highlights (2 comments)
ivarv
This looks pretty interesting, but I don't understand why decompilers are not allowed. If this benchmark was aimed at recreating a SASS/server based product then it might make more sense, but given the fact that decompilers are readily available in practice the "no read" restriction seems to artificially increase the challenge level.
dnnehgf
figures 10 and 11 in the paper are interesting. i suppose at a high level this works because it is much easier for the evaluator to generate tests with fuzzing than it is for the model to probe. this method somehow clarifies the way in which code generation is curve fitting, where the output curve is some linear transformation of the inputs. kind of satisfying that when all is said and done, and we have a machine that can fit curve descriptions as well as or better than humans, we won't be any closer to explaining how anything really works.