Surpassing vLLM with a Generated Inference Stack
lukebechtel
31 points
11 comments
March 10, 2026
Related Discussions
Found 5 related stories in 38.7ms across 3,471 title embeddings via pgvector HNSW
- Executing programs inside transformers with exponentially faster inference u1hcw9nx · 17 pts · March 12, 2026 · 52% similar
- How I write software with LLMs indigodaddy · 69 pts · March 16, 2026 · 48% similar
- Reliable Software in the LLM Era mempirate · 102 pts · March 12, 2026 · 48% similar
- EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages matt_d · 84 pts · March 19, 2026 · 48% similar
- Show HN: Cq – Stack Overflow for AI coding agents peteski22 · 108 pts · March 23, 2026 · 45% similar
Discussion Highlights (5 comments)
ntonozzi
Why do they need to run benchmarks to confirm performance? Can't they run an example prompt and verify they get the exact same output token probabilities for all prompts? The fact that they are not doing this makes me suspicious that they are in fact not doing the exact same thing as vLLM. It is also a bit weird that they are not incorporating speculative decoding, that seems like a critical performance optimization, especially for decode heavy workloads.
rfw300
OK... we need way more information than this to validate this claim! I can run Qwen-8B at 1 billion tokens per second if you don't check the model's output quality. No information is given about the source code, correctness, batching, benchmark results, quantization, etc. etc. etc.
acuozzo
Luke: Do you have benchmarks for BF16?
storus
Does it support paged attention like vLLM though? Without that they will run into memory fragmentation quickly.
cermicelli
Dumb shit this says nothing you are saying x is better and there is no way to check or look into what it does and how it works or if it didn't just clone vllm code because why not atleast c compiler claude wrote was the verifiable kind of shit. This is plain bullshit.