Kimi vendor verifier – verify accuracy of inference providers
Alifatisk
217 points
19 comments
April 20, 2026
Related Discussions
Found 5 related stories in 62.9ms across 5,126 title embeddings via pgvector HNSW
- Kimi K2.6-code-preview is now available jrop · 12 pts · April 13, 2026 · 49% similar
- Kimi K2.6: Advancing open-source coding meetpateltech · 628 pts · April 20, 2026 · 49% similar
- Don't Trust, Verify lwhsiao · 17 pts · March 28, 2026 · 45% similar
- Kimi K2.6: Advancing Open-Source Coding nekofneko · 39 pts · April 20, 2026 · 45% similar
- When AI writes the software, who verifies it? todsacerdoti · 192 pts · March 03, 2026 · 44% similar
Discussion Highlights (8 comments)
OsamaJaber
Good to see this exist. Inference providers quietly swap quant levels. Most users never check. A standard verifier from the model maker is the right move, would love to see other labs ship the same
bobbiechen
If I understand correctly, threat model here seems to be to protect against accidental issues that would impact performance, but doesn't cover malicious actor. For example, Sketchy Provider tells you they are running the latest and greatest, but actually is knowingly running some cheaper (and worse) model and pocketing the difference. These tests wouldn't help since Sketchy Provider could detect when they're being tested and do the right thing (like the Volkswagen emissions scandal). Right?
seism
A test that runs for 15 hours on a high powered rig is going to be hard to reproduce or scale. But I think this addresses a widespread concern, which affects all kinds of cloud services. What you ping is not necessarily what you get.
curioussquirrel
After Anthropic, Moonshot is another model provider who restricts tweaking of sampling parameters. I do like the idea of the vendor verifier, though.
foundry27
I like this idea. This might be one of the more effective social pressures available for getting inference providers to fix long-standing issues. AWS Bedrock, for example, has crippling defects in its serving stack for Kimi’s K2 and K2.5 models that cause 20%-30% of attempts to emit tool calls to instead silently end the conversation (with no token output). That makes AWS effectively irrelevant as a serious inference provider for Kimi, and conveniently pushes users onto Bedrock’s significantly more expensive Anthropic models for comparable performance on agentic tasks.
gertlabs
This is real issue in our benchmarks. Beware of OpenRouter providers that don't specify quantizations or use lower ones than you might be expecting. OpenRouter does provide configuration options for this, and it often limits your options significantly. That being said, even with the best providers, Kimi-K2-thinking was underwhelming and slow on our benchmarks, albeit interesting and useful for temperature/variation. Kimi K2.6, however, is the new open source leader, so far. Agentic evaluations still in progress, but one-shot coding reasoning benchmarks are ready at https://gertlabs.com/?mode=oneshot_coding
m1keil
A related article from fireworks.ai about running open weights models and why such verifier needs to exists in the first place https://fireworks.ai/blog/quality-first-with-kimi-k2p5
punkpeye
Now this is brilliant. I run an AI gateway (Glama), and we had to delist all third-party providers because some of them are obviously lying about their quantization. Being able to vet providers would be a major improvement to our ability to offer a more diverse set of providers.