Are LLM merge rates not getting better?

4diii 131 points 120 comments March 12, 2026
entropicthoughts.com · View on Hacker News

Related: Many SWE-bench-Passing PRs would not be merged - https://news.ycombinator.com/item?id=47341645 - March 2026 (149 comments)

Discussion Highlights (20 comments)

mike_hearn

That's an interesting claim, but I don't see it in my own work. They have got better but it's very hard to quantify. I just find myself editing their work much less these days (currently using GPT 5.4).

boonzeet

Interesting article, although with so few data points and such a specific time slice it is difficult to draw serious conclusions about the "improvement" of LLM models. It's notably lacking newer models (4.5 Opus, 4.6 Sonnet) and models from Gemini. LLMs appear to naturally progress in short leaps followed by longer plateaus, as breakthroughs are developed such as chain-of-thought, mixture-of-experts, sub-agents, etc.

raincole

No Gemini. No Opus 4.5. No GPT codex. As they said, ragebait used to be believable.

reedf1

Given that it is the general consensus that a step function occurred with Opus 4.5/4.6 only 3 months ago - it seems like an insane omission.

Flavius

> This means llms have not improved in their programming abilities for over a year. Isn’t that wild? Why is nobody talking about this? Because it's not true. They have improved tremendously in the last year, but it looks like they've hit a wall in the last 3 months. Still seeing some improvements but mostly in skills and token use optimization.

jeffnv

I don't think it's true, but am I alone in wishing it was? My world is disrupted somewhat but so far I don't think we have a thing that upends our way of life completely yet. If it stayed exactly this good I'd be pretty content.

roxolotl

These studies are always really hard to judge the efficacy of. I would say though the most surprising thing to me about LLMs in the past year is how many people got hyped about the Opus 4.5 release. Having used Claude Code at work since it was released I haven't really noticed any step changes in improvement. Maybe that's because I've never tried to use it to one shot things? Regardless I'm more inclined to believe that 4.5 was the point that people started using it after having given up on copy/pasting output in 2024. If you're going from chat to agentic level of interaction it's going to feel like a leap.

ryanackley

I agree completely. I haven't noticed much improvement in coding ability in the last year. I'm using frontier models. What's been the game changer are tools like Claude Code. Automatic agentic tool loops purpose built for coding. This is what I have seen as the impetus for mainstream adoption rather than noticeable improvements in ability.

fluidcruft

Yeah I'm not buying the last bit about lower MSE with one term in the model vs two (Brier with one outcome category is MSE of the probabilities). That's the sort of thing that would make me go dig to find where I fucked up the calculation.

davecoffin

I've been able to supercharge a hobby project of mine over the last couple months using Opus 4.6 in claude code. I had to collaborate and write code still, but claude did like 75% of the work to add meaningful new features to an iOS/Android native mobile app, including Live Activities which is so overly complicated i would not have been able to figure that out. I have it running in a folder that contains both my back end api (express) and my mobile app (nativescript), so it does back end and front end work simultaneously to support new features. this wasnt possible 8 months ago.

curiouscube

There is a decent case for this thesis to hold true especially if we look at the shift in training regimes and benchmarking over the last 1-2 years. Frontier labs don't seem to really push pure size/capability anymore, it's an all in focus on agentic AI which is mainly complex post-training regimes. There are good reasons why they don't or can't do simple param upscaling anymore, but still, it makes me bearish on AGI since it's a slow, but massive shift in goal setting. In practice this still doesn't mean 50 % of white collar can't be automated though.

thomascgalvin

Anecdotally, I haven't seen any real improvement from the AI tools I leverage. They're all good-ish at what they do, but all still lie occasionally, and all need babysitting. I also wonder how much of the jump in early 2025 comes from cultural acceptance by devs, rather than an improvement in the tools themselves.

ordersofmag

Even if one-shot LLM performance has plateaued (which I'm not convinced this data shows given omission of recent models that are widely claimed to be better) that missing the point that I see in my own work. The improved tooling and agent-based approaches that I'm using now make the LLM one-shot performance only a small part of the puzzle in terms of how AI tools have accelerated the time from idea to decent code. For instance the planning dialogs I now have with Claude are an important part of what's speeding things up for me. Also, the iterative use of AI to identify, track, and take care of small coding tasks (none of which are particularly challenging in terms of benchmarks) is simply more effective. Could this all have been done with the LLM engines of late 2024. Perhaps, but I think the fine-tuning (and conceivably the system prompts) that make the current LLM's more effective at agent-centered workflows (including tool-use) are a big part of it. One-shot task performance at challenging tasks is an interesting, certainly foundational, metric. But I don't think it captures the important advances I see in how LLM's have gotten better over the last year in ways that actually matter to me. I rarely have a well-defined programming challenge and the obligation to solve it in a single-shot.

WithinReason

If you look at a separate trend for the smaller Sonnet models, you can see a rapid trend

antisthenes

They are getting better, but they are also hitting diminishing returns. There's only so much data to train on, and we are unlikely to see giant leaps in performance as we did in 2023/2024. 2026-27 will be the years of primarily ecosystem/agentic improvements and reducing costs.

camdenreslink

From my personal experience, they have gotten better, but they haven’t unlocked any new capabilities. They’ve just improved at what I was already using them for. At the end of the day they still produce code that I need to manually review and fully understand before merging. Usually with a session of back-and-forth prompting or manual edits by me. That was true 2 years ago, and it’s true now (except 2 years ago I was copy/pasting from the browser chat window and we have some nicer IDE integration now).

idorozin

My experience has been that raw “one-shot intelligence” hasn’t improved as dramatically in the last year, but the workflow around the models has improved massively. When you combine models with: tool use planning loops agents that break tasks into smaller pieces persistent context / repos the practical capability jump is huge.

sunaurus

I am pretty convinced that for most types of day to day work, any perceived improvements from the latest Claude models for example were total placebo. In blind tests and with normal tasks, people would probably have no idea if they're using Opus 4.5 or 4.6.

varispeed

In my niche the Opus 4.6 has been a game changer. In comparison all other LLMs look stupid. I am considering cancelling all other subscriptions.

pu_pe

Benchmaxxing aside, if you are using those tools for programming on a regular basis it should be self-evident that they are improving. I find it very hard to believe that someone using LLMs today vs what was available one year ago (Claude Code released Feb 2025) would have any difficulty answering this question.

Semantic search powered by Rivestack pgvector
3,471 stories · 32,344 chunks indexed