Advanced Quantization Algorithm for LLMs
lastdong
121 points
16 comments
May 01, 2026
Related Discussions
Found 5 related stories in 82.5ms across 8,303 title embeddings via pgvector HNSW
- SubQ: Sub-quadratic LLM built for 12M-token context gagan2020 · 17 pts · May 05, 2026 · 57% similar
- Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x gmays · 16 pts · March 27, 2026 · 54% similar
- SubQ: a sub-quadratic LLM with 12M-token context mitchwainer · 46 pts · May 05, 2026 · 53% similar
- Apply video compression on KV cache to 10,000x less error at Q4 quant polymorph1sm · 16 pts · March 22, 2026 · 53% similar
- Reliable Software in the LLM Era mempirate · 102 pts · March 12, 2026 · 52% similar
Discussion Highlights (4 comments)
netdur
hmm... at Q4_K_M, stock-style quantization is retaining ~99–99.8% of BF16 accuracy, AutoRound pushes that to ~99.4–100.n% (??) the gap is roughly 0.1–0.7 percentage points https://github.com/intel/auto-round/blob/main/docs/gguf_alg_...
trilogic
You can try it with this model here: https://hugston.com/models/56tps-tested-autoround-qwen35-35b... which is really well done and can run pretty fast with ctx up to 300k. Just 11.65 GB. Get the Mmproj also for vision/image processing.
liuliu
I am actually getting interested in QAT these days, especially for LSQ+ type, but it doesn't seem like people have done that enough in open-source world at least, for 2-bit / 3-bit OPD with LSQ+ basically.
programjames
Anyone willing to dig through the code or papers for the actual algorithm? It looks like the GitHub and papers have not been optimized for communication.