KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
EGreg
44 points
36 comments
April 21, 2026
Related Discussions
Found 5 related stories in 65.0ms across 5,126 title embeddings via pgvector HNSW
- TurboQuant: Building a Sub-Byte KV Cache Quantizer from Paper to Production wizzense · 13 pts · March 27, 2026 · 71% similar
- Apply video compression on KV cache to 10,000x less error at Q4 quant polymorph1sm · 16 pts · March 22, 2026 · 65% similar
- TurboQuant: Redefining AI efficiency with extreme compression ray__ · 509 pts · March 25, 2026 · 62% similar
- Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x gmays · 16 pts · March 27, 2026 · 57% similar
- TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS aegis_camera · 76 pts · April 01, 2026 · 53% similar
Discussion Highlights (5 comments)
tomrod
Extraordinary claims! I don't follow the argument though.
ddtaylor
Very intersting. A compression strategy that uses the model itself as the dictionary.
thethirdone
> the ratio remains approximately 914x over TurboQuant, with compression improving rather than degrading as context length grows. This line from the abstract got me really suspicious. Obviously a compression scheme that incorporates the entire sequence shouldn't get worse compared to a per element one as the length increases. It is important to note that this paper is PURELY theoretical. I couldn't find much meat on the bone from a quick skim. The single author, Gregory Magarshak, has only published one paper on arxiv before and appears to be a professor of business / music. I don't plan to give it more of a read hoping for something of value.
sabareesh
Sounds like speculative decoding but for KV cache
aesthesia
> The second layer, predictive delta coding, stores only the residual of each new KV vector from the model's own prediction of it I don't understand this. The key and value vectors for any given layer + token are created by the model. By definition, they are exactly equal to the model's prediction of them! Extreme KV cache compression is easy to get---you can get an infinite compression ratio by just regenerating the key and value vectors on every forward pass. The point of a KV cache is to reduce the amount of repeated computation during generation, though. Compression only helps if you have an efficient decompression algorithm.