Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
yu3zhou4
122 points
10 comments
May 29, 2026
Related Discussions
Found 5 related stories in 91.0ms across 8,861 title embeddings via pgvector HNSW
- Show HN: I built a tiny LLM to demystify how language models work armanified · 249 pts · April 06, 2026 · 63% similar
- Show HN: Find the best local LLM for your hardware, ranked by benchmarks andyyyy64 · 279 pts · May 15, 2026 · 60% similar
- Real-time LLM Inference on Standard GPUs: 3k tokens/s per request NicoConstant · 202 pts · May 29, 2026 · 59% similar
- Show HN: A new benchmark for testing LLMs for deterministic outputs khurdula · 50 pts · April 29, 2026 · 57% similar
- Right-sizes LLM models to your system's RAM, CPU, and GPU bilsbie · 76 pts · March 01, 2026 · 56% similar
Discussion Highlights (8 comments)
yu3zhou4
README is in my opinion (author here) the most interesting - I wrote it to help others build useful mental model to be able to recreate the project yourself, without need to even read my code
nazgulsenpai
I love the documentation formatted in lessons. I can't wait to read through it.
juancn
Looks interesting, it reminds me of the first llama.cpp, but better documented.
dwa3592
Very nice job on read me. >>Physically, LLM is a file which contains a lot of float numbers. aka atoms of the LLM.
einpoklum
It seems the author believes checking the return values of CUDA API calls is not "tiny" enough :-(
cookiengineer
Wanted to add that the author has an amazing blog with lots of interesting papers: https://jedrzej.maczan.pl/
xuanlin314
The lesson-style README is a great approach. Breaking down LLM inference into digestible steps makes the codebase approachable even for people who haven't touched CUDA before.
GoldenJade
Thanks for sharing this. As someone currently researching LLMs, I'm sure I'll be referencing this quite a bit going forward.