Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

yu3zhou4 122 points 10 comments May 29, 2026
github.com · View on Hacker News

Discussion Highlights (8 comments)

yu3zhou4

README is in my opinion (author here) the most interesting - I wrote it to help others build useful mental model to be able to recreate the project yourself, without need to even read my code

nazgulsenpai

I love the documentation formatted in lessons. I can't wait to read through it.

juancn

Looks interesting, it reminds me of the first llama.cpp, but better documented.

dwa3592

Very nice job on read me. >>Physically, LLM is a file which contains a lot of float numbers. aka atoms of the LLM.

einpoklum

It seems the author believes checking the return values of CUDA API calls is not "tiny" enough :-(

cookiengineer

Wanted to add that the author has an amazing blog with lots of interesting papers: https://jedrzej.maczan.pl/

xuanlin314

The lesson-style README is a great approach. Breaking down LLM inference into digestible steps makes the codebase approachable even for people who haven't touched CUDA before.

GoldenJade

Thanks for sharing this. As someone currently researching LLMs, I'm sure I'll be referencing this quite a bit going forward.

Semantic search powered by Rivestack pgvector
8,861 stories · 83,648 chunks indexed