Real-time LLM Inference on Standard GPUs: 3k tokens/s per request
NicoConstant
202 points
91 comments
May 29, 2026
Related Discussions
Found 5 related stories in 101.9ms across 8,861 title embeddings via pgvector HNSW
- Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA yu3zhou4 · 122 pts · May 29, 2026 · 59% similar
- SubQ: Sub-quadratic LLM built for 12M-token context gagan2020 · 17 pts · May 05, 2026 · 58% similar
- MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU chrsw · 280 pts · April 08, 2026 · 57% similar
- Executing programs inside transformers with exponentially faster inference u1hcw9nx · 17 pts · March 12, 2026 · 56% similar
- Making LLM Training Faster with Unsloth and NVIDIA segmenta · 114 pts · May 07, 2026 · 56% similar
Discussion Highlights (20 comments)
ilaksh
Could be amazing, but it's hard to judge if it will really work with say a 27 B model or larger. We can already get pretty good speed with a 2B model.
mungoman2
This looks very interesting. Possible to get those rates without exotic hardware. But I have to say that the comparison is not really fair. Comparison is done with a 2 B model vs frontier models that are likely 100s of times larger. Also taalas with their 15000 tok/s inference are suspiciously missing from the comparison. We need to see the comparison with this framework and useful models, which at present seems to mean ~30 B.
LoganDark
I feel the comparison to Groq is unfair. They're running much larger models (orders of magnitude) and still reaching competitive speeds.
867-5309
> Standard GPUs > 8× NVIDIA H200
kirtivr
I can think of real time video, shader generation, real time worldbuilding type problems could require such a high token throughput. For instant code generatio, 400-500 tok/s should be sufficient, though most frontier models give us closer to 70 tok/s.
0-bad-sectors
When I read "Standard GPUs" in the title I got excited for a second then I read the article itself..
irishcoffee
NVIDIA H200 Is not a standard GPU. 8 of them in a box with a cpu and ram costs close to the same as a house. I am 100% all about using local models instead of sending someone else all my data and paying for the privilege of doing so, this article is misleading. I can get a 27b model to kick out 40 tok/s on 16 gb vram. This is the area ripe for development. If you can’t connect a monitor, it isn’t a standard GPU, at least not in the way people have spoken about GPUs until a few years ago.
gaeld
Follow-up reading the most technical and research people here: Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize... Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra... To try the speed on the playground: http://playground.kog.ai
CastFX
Looks super promising! A couple of questions: For new open weights models, will you need to adapt model code and optimization for your inference engine by hand? It's true that BS=1 is king when it comes to agentic workflows, however these kinds of system serve multiple requests concurrently with dynamic batching. Do you think it will scale as well ? Any plans to release it open source? Congratz again for the release
robmccoll
Making these claims on a 2B parameter model seems a bit like seeing linear scalability from 1 to 4 cores and then assuming 256 cores will give you a 256x speedup. Or demonstrating massive improvement on datasets that fit in cache and then assuming the same improvements will be present on problem sizes that span the memory of multiple machines. Something tells me that scaling to larger models will be more difficult than assumed.
bartkappenburg
Is this the new gateway to a "Model On a Chip"? Is it possible to etch the weights on silicon and get a very efficient way to use a LLM?
ekianjo
Title is pure bait. Where is Datacenter GPU gone?
Hfuffzehn
That's really nice of them. That means Jensen can add another 30 times faster when comparing Rubin to Blackwell without having to actually do anything. Hopefully that means he won't have any problem to make another 150 billion in profit in the next year. Sorry for the sarcasm. Looks like interesting work.
frankensteins
I have a naive question here - first, the token speed is very impressive. but why this is the highlight? I would prefer the actual performance.
bcjdjsndon
H200 isn't a standard GPU at all
paul-rohan
I had to test it myself to believe this unreal inference speed. each time getting 3300+ tps.
cataflam
Congrats gaeld and team The demo is very impressive! disclaimer: I've known the founder for a while, as legitimate as it gets in deep tech, real years of research and engineering behind this, not vaporware
rashkov
Don't miss trying their demo: https://playground.kog.ai/ Feels like a preview of the future
stymaar
This is very cool. I have been lamenting for a while that the memory-bandwidth <-> tps relationship was pretty much working for small models on consumer cards, but not at all on datacenter hardware. It's great to see that with proper care on the inference engine implementation the relationship can be restored.
arjie
Huh, interesting. Some parts of this do generalize even to an RTX 6000 Pro Blackwell, I imagine, though we're going to be solidly bottlenecked then on inter-card throughput through the PCIe interface.