Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge
bazlightyear
131 points
56 comments
May 03, 2026
Related Discussions
Found 5 related stories in 96.6ms across 8,303 title embeddings via pgvector HNSW
- Kimi K2.6: Advancing open-source coding meetpateltech · 628 pts · April 20, 2026 · 67% similar
- Kimi K2.6: Advancing Open-Source Coding nekofneko · 39 pts · April 20, 2026 · 65% similar
- Kimi K2.6-code-preview is now available jrop · 12 pts · April 13, 2026 · 60% similar
- CPUs Aren't Dead. Gemma2B Out Scored GPT-3.5 Turbo on Test That Made It Famous fredmendoza · 95 pts · April 15, 2026 · 60% similar
- Kimi K2.6 kbumsik · 11 pts · April 20, 2026 · 57% similar
Discussion Highlights (20 comments)
magicalhippo
In a single challenge, measured by how performant the solution was. Kimi K2.6 is definitely a frontier-sized model, so on the one hand it's not that surprising it's up there with the closed frontier models. Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.
PedroBatista
Great to know, but what was the cost both in terms of $$ and tokens used? Not to invalidate these benchmark results because they are useful, but the real usefulness it what they are capable to do when real people interact with them at scale. Regardless, these are good news, because now that Microsoft is basically giving up their all-in strategy with Github's Copilot and Anthropic is playing the "I'm too good for you" game, it's about time for them to get pressed into not making this AI world into a divide between the haves and the have-nots.
beering
I’m a little confused as to the setup. It was asking each model to one-shot a script and then the scripts faced off? Were the models given a computer environment? Or a test server to iterate against?
Frannky
I have to try Kimi. I was looking for an alternative. If you have any experience, advice, please share. I saw Kimi is at the top of the Open Router ranking.
elromulous
Is the site just slashdotted rn? Can anyone get to it?
jakemanger
What's the GPU VRAM requirements for this thing? Awesome to have a open model that can compete, but damn it would be so much better if you could run it locally. Otherwise, it's almost so difficult to run (e.g. self host) that it's just way more convenient to pay OpenAI, Claude, etc
slashdave
I was surprised by the ranking, until I read what the test was. Not horribly relevant for coding. The current ranking of all tests makes more sense (well, except for how well Gemini does) https://aicc.rayonnant.ai
pbreit
All my co-workers say Claude blows away Gemini. Is it really that good? How can I do Kimi?
justech
I’ve been maining Kimi k2.6 through opencode go and openrouter for a week and I can say it’s the same experience as when I was maining Sonnet 3.5/4 late last year. Not as good or as fast as Claude Code on Opus now but definitely enough for casual/hobby use. The best part is multiple choices for providers, if opencode gimps their service, I’ll switch
jrecyclebin
I absolutely love Kimi's personality - some of the things it says are so out there! And it's been great for very focused, iterative work. Its weakness is that it seems to yak on-and-on when it needs to plan out something big or read through and make sense of how to use a niche piece of a complex library. To the point where it can fill up its 256k window - and rack up a build. (No cache.) I have had better experience with GLM 5.1 in those cases. Anyone out there relate?
gertlabs
I'm glad we're seeing a shift towards objectively scored tests. We've been doing this at scale at https://gertlabs.com/rankings , and although the author looks to be running unique one-off samples, it's not surprising to see how well Kimi K2.6 performed. Based on our testing, for coding especially, Kimi is within statistical uncertainty of MiMo V2.5 Pro for top open weights model, and performs much better with tools than DeepSeek V4 Pro. GPT 5.5 has a comfortable lead, but Kimi is on par with or better than Opus 4.6. The problem with Kimi 2.6 is that it's one of the slower models we've tested.
SomaticPirate
This seems to be testing the models on leetcode style prompts that also require the model to implement TCP calls to send the results. Interesting but probably not a apples to apples comparison. The fact only Grok qualified for the first one seems suspect
aykutseker
This seems less like Kimi is better at coding than Claude and more like Kimi found the right strategy for this particular game. Still interesting though. The fact that an open weight model is close enough for that to matter is probably the real story.
rvz
So we are now at the point where open weight models are rapidly catching up to the frontier models. They are at best 30 days behind, and at worst case 2 months behind. The last issue is being able to run the best one on conventional hardware without a rack of GPUs. The Macbooks, and Mac minis are behind on hardware but eventually in the next 2 years at worst will make it possible thanks to the advancements of the M-series machines. All of this is why companies like Anthropic feel like they have to use "safety" to stop you from running local models on your machine and get you hooked on their casino wasting tokens with a slot machine named Claude.
qakajjqj
Yes gimini is a programming application
0xbadcafebee
These posts are going to be a constant for the next year, because there's no objective way to compare models (past low-level numbers like token generation speed, average reasoning token amount, # of parameters, active experts, etc). They're all quite different in a lot of ways, they're used for many different things by different people, and they're not deterministic. So you're constantly gonna see benchmarks and tests and proclamations of "THIS model beat THAT model!", with people racing around trying to find the best one. But there is no best one. There's just the best one for you, based on whatever your criteria is. It's likely we'll end up in a "Windows vs MacOS vs Linux" style world, where people stick to their camps that do a particular thing a particular way.
slopinthebag
Amazing. To me it feels like GLM 5.1, Kimi 2.6, DeepSeek 4 are all competitive both with each other and with the American models. Truly a great time to be alive. I would like to see more effort making the flash variants work for coding. They are super economical to use to brute force boilerplate and drudgery, and I wonder just how good they can be with the right harness, if it provides the right UX for the steering they require. As much as vibe coding has captured the zeitgeist, I think long term using them as tools to generate code at the hands of skilled developers makes more sense. Companies can only go so long spending obscene amounts of money for subpar unmaintainable code.
walrus01
People thinking to self-host Kimi K2.6 had better be prepared for how big it is. Q8 K XL quantization for instance is around 600GB on disk. I would bet about 700GB of VRAM needed. Quantizations lower than Q8 are probably worthless for quality. Or 2.05TB on disk for the full precision GGUF. https://huggingface.co/unsloth/Kimi-K2.6-GGUF If you can afford the hardware to run Kimi K2.6 at any decent speed for more than 1 simultaneous user, you probably have a whole team of people on staff who are already very familiar with how to benchmark it vs Claude, GPT-5.5, etc.
sieve
Kimi is really good. I have been using Sonnet and others (DeepSeek, ChatGPT, MiniMax, Qwen) for my compiler/vm project and the Claude Pro plan is mostly unusable for any serious coding effort. So I use it in chat mode in the browser where it cannot needlessly read your entire project, and use Kimi on the OpenCode Go plan with pi. Kimi consistently exceeded Sonnet on the C+Python project. Never had to worry about it doing anything other than what I asked it to do. GLM crapped the bed once or twice. Kimi never did.
plexescor
I always though claude is the goat, but i guess its time to change the notion and try Kimi K2.6