KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

EGreg 44 points 36 comments April 21, 2026
arxiv.org · View on Hacker News

Discussion Highlights (5 comments)

tomrod

Extraordinary claims! I don't follow the argument though.

ddtaylor

Very intersting. A compression strategy that uses the model itself as the dictionary.

thethirdone

> the ratio remains approximately 914x over TurboQuant, with compression improving rather than degrading as context length grows. This line from the abstract got me really suspicious. Obviously a compression scheme that incorporates the entire sequence shouldn't get worse compared to a per element one as the length increases. It is important to note that this paper is PURELY theoretical. I couldn't find much meat on the bone from a quick skim. The single author, Gregory Magarshak, has only published one paper on arxiv before and appears to be a professor of business / music. I don't plan to give it more of a read hoping for something of value.

sabareesh

Sounds like speculative decoding but for KV cache

aesthesia

> The second layer, predictive delta coding, stores only the residual of each new KV vector from the model's own prediction of it I don't understand this. The key and value vectors for any given layer + token are created by the model. By definition, they are exactly equal to the model's prediction of them! Extreme KV cache compression is easy to get---you can get an infinite compression ratio by just regenerating the key and value vectors on every forward pass. The point of a KV cache is to reduce the amount of repeated computation during generation, though. Compression only helps if you have an efficient decompression algorithm.

Semantic search powered by Rivestack pgvector
5,126 stories · 48,318 chunks indexed