KV Sharing, MHC, and Compressed Attention
gmays
29 points
2 comments
May 19, 2026
Related Discussions
Found 5 related stories in 88.3ms across 8,303 title embeddings via pgvector HNSW
- KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit EGreg · 44 pts · April 21, 2026 · 56% similar
- LLM Neuroanatomy II: Modern LLM Hacking and Hints of a Universal Language? realberkeaslan · 120 pts · March 24, 2026 · 54% similar
- A Visual Guide to Attention Variants in Modern LLMs Anon84 · 17 pts · March 22, 2026 · 53% similar
- Multi-Stream LLMs: new paper on parallelizing/separating prompts, thinking, I/O atomicthumbs · 86 pts · May 21, 2026 · 52% similar
- Apply video compression on KV cache to 10,000x less error at Q4 quant polymorph1sm · 16 pts · March 22, 2026 · 51% similar
Discussion Highlights (2 comments)
nibab
cool stuff. my comp sci major feels almost completely redundant in this new vibecoding era and i feel like the only way to stay relevant as a programmer is to learn all these compute primitives and become an LLM systems guy.
redwood
Has anyone seen a similar deep dive but that looks a little bit more closely at the infrastructure building blocks that power each of the components. I mean something a bit more physically grounded like how much compute would go to each portion to serve a Frontier Model?