Can I Buy Your KV Cache?
MediaSquirrel
35 points
27 comments
June 12, 2026
Related Discussions
Found 5 related stories in 104.5ms across 10,324 title embeddings via pgvector HNSW
- KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit EGreg · 44 pts · April 21, 2026 · 58% similar
- TurboQuant: Building a Sub-Byte KV Cache Quantizer from Paper to Production wizzense · 13 pts · March 27, 2026 · 56% similar
- KVarN: Native vLLM backend for KV-cache quantization by Huawei theanonymousone · 127 pts · June 04, 2026 · 55% similar
- Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis kaliades · 16 pts · April 16, 2026 · 48% similar
- Apply video compression on KV cache to 10,000x less error at Q4 quant polymorph1sm · 16 pts · March 22, 2026 · 47% similar
Discussion Highlights (8 comments)
root-parent
Lambda computing for prompts?
sghiassy
A truly global singleton
lumost
The KV cache is order dependent and dependent on the context of tokens which exist before the KV cache. There are some transformation approaches to re-use the kv cache across inferences, but none are in wide use due to accuracy concerns following the transformation.
tonetegeatinst
Does anyone have a good recommendation for explaining or as a primer on KV cache?
mistercow
> Then the part that matters: where the KV lives When your abstract was clearly generated by an LLM and not curated to at least make it sound human, it does not make me want to read your paper.
TuringNYC
Seems Cloudflare is now doing this for scraping, so makes sense to continue down the pipeline!
refulgentis
This paper doesn't make any sense - for background, I've maintained an AI client that's cross-platform, cross-provider, and integrates llama.cpp since 2022. I don't know why they think "agents" don't share prefill work - paid providers cache on the prefill text , llama.cpp, same, and I specifically hooked up llama.cpp so it can do subsets as well. i.e. all the agents would reuse the cache It reads like it started from an underspecification of "agents" x a strain of pop-wisdom about "KV cache" that I've seen enter mainstream discourse over the past 3 months that is Not Even Wrong, then, they solved a non-existent problem. EDIT: based on the rest of comments either requesting a primer on terms, or, pointing out it makes errors in even more obvious ways, flagging.
wren6991
Prefix caching is already widely deployed by all providers, right? llama.cpp does it. vLLM does it. I'm sure everyone hosting LLMs for a living does it. This paper seems to focus entirely on prefixes (i.e. the prefilled content is rooted at 0). This is... nothing. The referenced CacheBlend paper ( https://arxiv.org/pdf/2405.16444 ) which tries to stitch together multiple independent prefills looks more interesting and is new to me. The problem it's trying to solve is: * KV projections for a given token at a given layer are a function of the residual at that layer, * which is a function of the attention contribution of the previous layer, * which is a (nonlinear) function of all earlier tokens' KV values at the previous layer. This is what stops you from just pasting KV blocks together. Intuitively it might feel like you could do the equivalent of an MPEG deblocking filter to fix up the edges, but there's no guarantee the tokens that need fixing up are at the beginning of the KV block, so they have to be sneaky about it. Unfortunately while that paper is quite verbose it's lacking in detail in the most important part: how they perform the approximate KV recomputation. It looks like the rough idea is that they fully recompute the KV for the first layer, and use the deviation between the recompute and the original cached KV as a heuristic for how important it is to recompute the full KV values (i.e. all remaining layers) for that token. They use that to derive a mask for the tokens which most strongly attend to the earlier context, then do a sparse update of those tokens. What's still unclear is how this actually ends up being a performance win, given that the sparse update for each token still requires the exact KV for all the prior tokens in order to actually arrive at the correct value. It just kind of recurses the problem instead of fixing it. Maybe they just use the precomputed KV for the other tokens as input, and live with the approximation? I think this is already somewhat pragmatically solved: just don't pull huge documents into context. Give the LLM tools to search them and retrieve the fragments that are actually relevant.