MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

gainsurier 533 points 385 comments June 08, 2026
mimo.xiaomi.com · View on Hacker News

Discussion Highlights (19 comments)

atemerev

I test all Chinese models with "What happened on Tiananmen Square at June 4th, 1989?" prompt. MiMo-2.5-Pro so far passes the test (explains the event correctly), both on DeepInfra and Xiaomi providers. So not bad.

slopinthebag

I hope this is the next frontier AI labs push. Even the open models are smart enough, and they’re cheap enough, now if they can be fast enough they can make certain workflows possible and allow us to remain in flow state while we use them.

elar_verole

Yeah, this seems to be the easiest path for overall agents efficiency in the short term

minraws

Assuming they mean 8xA100 or similar, that's some rather insane performance, and at just 3x the cost, it still quite cheap-ish. With some optimisations this might be quite interesting. I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.

maxloh

The generation speed in the demo video is crazy, to say the least, and completely beyond my impressions of LLMs. The Xiaomi team really brought something to the table.

npn

How? edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough. though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.

moffkalast

42B active params, sliding window attention. There's your tradeoff.

irthomasthomas

I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.

kingstnap

Given that MiMo is as cheap as Deepseek ( previous discussion: https://news.ycombinator.com/item?id=48282814 ) multiplying that by 3x for ultra speed is still shockingly cheap.

serpix

I may sound like a shill, but exponential growth and all. We are going to get near instant software from prompt, multiple ones and then choose the best one. Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.

amunozo

These price and speed optimization from Chinese providers, combined with the raising prices from American ones will change the game sooner than later. Many companies are finding issues with the AI bills already.

scosman

Cerebras is trialing Kimi K2.6 at 3000t/s (invite only). I'm excited for when the fast hardware gets more mainstream for frontier models. Models designed for speed on Nvidia are nice addition that could bridge the gap.

GaggiX

If MiMo v2.5 Pro can run at >1000tk/s on GPUs then I will soon expect the same from OpenAI/Anthropic/Google.

holoduke

Speed is indeed a next big thing what should happen with LLM frontier models. The possibilities with current models but 1000 times faster would be super useful. Earlier this week it took Claude at least full time a week with two max subscriptions to solve a complex issue where we wanted to mimic a occlusion mapping variant used in the game Crimson Desert. Pretty complex mathematical challenge. With a ultra fast LLM and a proper self verification process it would be awesome.

__natty__

With this at 1k tps and Kimi 2.6 1k tps by Cerebras, I believe we are entering the next stage of LLMs, where companies will also compete on throughput

qsera

Tokens per seconds is the "Megapixels" of AI marketing!

harel

A few things in life I can't fully grasp why they are so sought after. One is that constant need to exhibit growth. As if being massive and staying as massive is not good enough, one has to always and continuously grow. The other is constant speed increases. We're already operating at 50x speed. My output is much wider and so much faster, I am sometimes my own bottleneck. And now as if that is not enough we want more speed. "I want a full software product from scratch in 12 seconds, Because 5 minute is too long and I got things to do..." Really?

eli

Neat. The frontier models have gotten pretty impressive, but they're all a bit too slow for interactive, human-in-the-loop coding. It incentivizes vibecoding and running multiple agents in parallel. A fast agent feels more like a partner. For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.

Oras

1k TPS is great, but I’m more fascinated by the amount of AI generated comments in this thread!

Semantic search powered by Rivestack pgvector
10,324 stories · 97,050 chunks indexed