GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz
laxmena
37 points
13 comments
June 16, 2026
Related Discussions
Found 5 related stories in 110.1ms across 10,715 title embeddings via pgvector HNSW
- NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute sdpmas · 122 pts · March 19, 2026 · 56% similar
- GPT-5.4 Thinking and GPT-5.4 Pro denysvitali · 92 pts · March 05, 2026 · 51% similar
- 768GB Intel Optane DIMMs to run 1T-parameter LLM with single GPU at 4tps walterbell · 26 pts · May 30, 2026 · 50% similar
- Real-time LLM Inference on Standard GPUs: 3k tokens/s per request NicoConstant · 202 pts · May 29, 2026 · 50% similar
- CPUs Aren't Dead. Gemma2B Out Scored GPT-3.5 Turbo on Test That Made It Famous fredmendoza · 95 pts · April 15, 2026 · 49% similar
Discussion Highlights (3 comments)
amelius
See also: https://rits.shanghai.nyu.edu/ai/karpathys-microgpt-on-fpga-... TL;DR: The CPU implementation was 71x faster than the FPGA. Note: model has only 4192 parameters.
genxy
The context window is 16 characters . Talking about tokens per second is meaningless.
cadamsdotcom
Transformers scale poorly vs. context window size and parameter count. Which means really impressive when those N’s are small! I’m but a pundit in this area so don’t know much. But one wonders if there’s a future in burning larger models to FPGAs - whether big enough FPGAs exist (or can be built), and whether locating specialized compute right with the memory it needs can speed things up. Likely would need a lot of algorithm parallelism work that’d translate back to CPUs/GPUs.