DeepSWE: A contamination-free benchmark for long-horizon coding agents
ammar_x
40 points
11 comments
May 26, 2026
Related Discussions
Found 5 related stories in 98.1ms across 8,541 title embeddings via pgvector HNSW
- SWE-bench Verified no longer measures frontier coding capabilities kmdupree · 277 pts · April 26, 2026 · 58% similar
- DeepSeek reasonix, DeepSeek native coding agent with high caching and low cost Alifatisk · 507 pts · May 24, 2026 · 56% similar
- SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via CI mpweiher · 114 pts · March 08, 2026 · 55% similar
- DeepSeek V4: The Open-Source Model Frontier Labs Feared HelloAi · 61 pts · May 15, 2026 · 51% similar
- DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence cmrdporcupine · 146 pts · April 24, 2026 · 50% similar
Discussion Highlights (7 comments)
ammar_x
https://x.com/serenaa_ge/status/2059308400866111692
dnnssl2
70% at launch seems pretty saturated, why ship a benchmark frontier models are about to top out on?
charleyslee
tysm for posting this! i'm charley, cofounder of datacurve, we created this benchmark and my team and i are here to answer any q's.
toastmaster11
What happened that placed Opus 4.6 on max reasoning below Sonnet 4.6 on a lowered reasoning level?
vanuatu
This benchmark matches my experience with GPT (I occasionally go back to Claude when I run into limits and frequently run into forgotten requirements and reward hacking) I do have two questions / critiques: - The verifier doesn't seem to check for code quality / maintainability, which I would posit is one of the major qualms with SOTA coding models i.e. they lack code 'taste'. Ofc this is a difficult problem to solve at scale, but wanted to point that out nonetheless - This almost feels written like a critique on SWE Bench Pro. Hopefully they fix the issues with that benchmark!
JacobAsmuth
I wonder why they didn't test Gemini 3.5 Flash (High).
gertlabs
While this benchmark has interesting results, the "Contamination free" label only works for the initial release of the benchmark. It still has the same fundamental design issues of any other benchmark-- there's a single correct answer for tasks. It looks to be largely saturated upon release. What they did well: normalizing the harness to mini-swe-agent -- models should be able to generalize to different tools at this point. When they struggle to do that (like most Google models), they're unlikely to be useful in practice. And that kind of generalization is an inherent part of intelligence. For a benchmark that scales, you need to remove the ceiling and provide environments with measurable goals that are NOT a single correct answer, and sufficiently complex evaluation criteria to scale well beyond the current frontier. We do this by running multi-agent simulations with large action spaces at https://gertlabs.com/rankings . We're still relatively unknown in the benchmarking space, but by rotating the pool of environments and ensuring the optimal strategies in the environments themselves are affected by other agents participating in the space, we expect we'll be able to resist contamination as major labs start investing more effort to climb the leaderboard. We've already seen Chinese labs taking an interest.