Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

bazlightyear 131 points 56 comments May 03, 2026

Discussion Highlights (20 comments)

magicalhippo

In a single challenge, measured by how performant the solution was. Kimi K2.6 is definitely a frontier-sized model, so on the one hand it's not that surprising it's up there with the closed frontier models. Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.

PedroBatista

Great to know, but what was the cost both in terms of $$ and tokens used? Not to invalidate these benchmark results because they are useful, but the real usefulness it what they are capable to do when real people interact with them at scale. Regardless, these are good news, because now that Microsoft is basically giving up their all-in strategy with Github's Copilot and Anthropic is playing the "I'm too good for you" game, it's about time for them to get pressed into not making this AI world into a divide between the haves and the have-nots.

beering

I’m a little confused as to the setup. It was asking each model to one-shot a script and then the scripts faced off? Were the models given a computer environment? Or a test server to iterate against?

Frannky

I have to try Kimi. I was looking for an alternative. If you have any experience, advice, please share. I saw Kimi is at the top of the Open Router ranking.

elromulous

Is the site just slashdotted rn? Can anyone get to it?

jakemanger

What's the GPU VRAM requirements for this thing? Awesome to have a open model that can compete, but damn it would be so much better if you could run it locally. Otherwise, it's almost so difficult to run (e.g. self host) that it's just way more convenient to pay OpenAI, Claude, etc

slashdave

I was surprised by the ranking, until I read what the test was. Not horribly relevant for coding. The current ranking of all tests makes more sense (well, except for how well Gemini does) https://aicc.rayonnant.ai

pbreit

All my co-workers say Claude blows away Gemini. Is it really that good? How can I do Kimi?

justech

I’ve been maining Kimi k2.6 through opencode go and openrouter for a week and I can say it’s the same experience as when I was maining Sonnet 3.5/4 late last year. Not as good or as fast as Claude Code on Opus now but definitely enough for casual/hobby use. The best part is multiple choices for providers, if opencode gimps their service, I’ll switch

jrecyclebin

I absolutely love Kimi's personality - some of the things it says are so out there! And it's been great for very focused, iterative work. Its weakness is that it seems to yak on-and-on when it needs to plan out something big or read through and make sense of how to use a niche piece of a complex library. To the point where it can fill up its 256k window - and rack up a build. (No cache.) I have had better experience with GLM 5.1 in those cases. Anyone out there relate?

gertlabs

I'm glad we're seeing a shift towards objectively scored tests. We've been doing this at scale at https://gertlabs.com/rankings , and although the author looks to be running unique one-off samples, it's not surprising to see how well Kimi K2.6 performed. Based on our testing, for coding especially, Kimi is within statistical uncertainty of MiMo V2.5 Pro for top open weights model, and performs much better with tools than DeepSeek V4 Pro. GPT 5.5 has a comfortable lead, but Kimi is on par with or better than Opus 4.6. The problem with Kimi 2.6 is that it's one of the slower models we've tested.

SomaticPirate

This seems to be testing the models on leetcode style prompts that also require the model to implement TCP calls to send the results. Interesting but probably not a apples to apples comparison. The fact only Grok qualified for the first one seems suspect

aykutseker

This seems less like Kimi is better at coding than Claude and more like Kimi found the right strategy for this particular game. Still interesting though. The fact that an open weight model is close enough for that to matter is probably the real story.

rvz

So we are now at the point where open weight models are rapidly catching up to the frontier models. They are at best 30 days behind, and at worst case 2 months behind. The last issue is being able to run the best one on conventional hardware without a rack of GPUs. The Macbooks, and Mac minis are behind on hardware but eventually in the next 2 years at worst will make it possible thanks to the advancements of the M-series machines. All of this is why companies like Anthropic feel like they have to use "safety" to stop you from running local models on your machine and get you hooked on their casino wasting tokens with a slot machine named Claude.

qakajjqj

Yes gimini is a programming application

0xbadcafebee

These posts are going to be a constant for the next year, because there's no objective way to compare models (past low-level numbers like token generation speed, average reasoning token amount, # of parameters, active experts, etc). They're all quite different in a lot of ways, they're used for many different things by different people, and they're not deterministic. So you're constantly gonna see benchmarks and tests and proclamations of "THIS model beat THAT model!", with people racing around trying to find the best one. But there is no best one. There's just the best one for you, based on whatever your criteria is. It's likely we'll end up in a "Windows vs MacOS vs Linux" style world, where people stick to their camps that do a particular thing a particular way.

slopinthebag

Amazing. To me it feels like GLM 5.1, Kimi 2.6, DeepSeek 4 are all competitive both with each other and with the American models. Truly a great time to be alive. I would like to see more effort making the flash variants work for coding. They are super economical to use to brute force boilerplate and drudgery, and I wonder just how good they can be with the right harness, if it provides the right UX for the steering they require. As much as vibe coding has captured the zeitgeist, I think long term using them as tools to generate code at the hands of skilled developers makes more sense. Companies can only go so long spending obscene amounts of money for subpar unmaintainable code.

walrus01

People thinking to self-host Kimi K2.6 had better be prepared for how big it is. Q8 K XL quantization for instance is around 600GB on disk. I would bet about 700GB of VRAM needed. Quantizations lower than Q8 are probably worthless for quality. Or 2.05TB on disk for the full precision GGUF. https://huggingface.co/unsloth/Kimi-K2.6-GGUF If you can afford the hardware to run Kimi K2.6 at any decent speed for more than 1 simultaneous user, you probably have a whole team of people on staff who are already very familiar with how to benchmark it vs Claude, GPT-5.5, etc.

sieve

Kimi is really good. I have been using Sonnet and others (DeepSeek, ChatGPT, MiniMax, Qwen) for my compiler/vm project and the Claude Pro plan is mostly unusable for any serious coding effort. So I use it in chat mode in the browser where it cannot needlessly read your entire project, and use Kimi on the OpenCode Go plan with pi. Kimi consistently exceeded Sonnet on the C+Python project. Never had to worry about it doing anything other than what I asked it to do. GLM crapped the bed once or twice. Kimi never did.

plexescor

I always though claude is the goat, but i guess its time to change the notion and try Kimi K2.6

Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

Discussion Highlights (20 comments)

Related Discussions