Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency
theanonymousone
318 points
97 comments
June 05, 2026
Related Discussions
Found 5 related stories in 177.0ms across 10,002 title embeddings via pgvector HNSW
- Gemma 4: Byte for byte, the most capable open models meetpateltech · 21 pts · April 02, 2026 · 75% similar
- Gemma 4 12B: A unified, encoder-free multimodal model rvz · 777 pts · June 03, 2026 · 71% similar
- Google releases Gemma 4 open models jeffmcjunkin · 1306 pts · April 02, 2026 · 68% similar
- Google's new Gemma 4 12B model is designed to run on any laptop with 16GB of RAM Bender · 12 pts · June 03, 2026 · 66% similar
- Accelerating Gemma 4: faster inference with multi-token prediction drafters amrrs · 521 pts · May 05, 2026 · 63% similar
Discussion Highlights (19 comments)
minimaxir
It's a bit awkward to release Gemma 4 12B ( https://news.ycombinator.com/item?id=48385906 ), and then a canonical Q4_0 Gemma 4 12B a couple days later. It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so. Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.
netdur
had a good run with Gemma 4 E2B Unsloth 4Q: https://youtube.com/shorts/XLsAnz5aAAI The E4B model doesn’t fit on my phone TPU, so it swaps to RAM, the QAT version means more accuracy, good!
refulgentis
@google.com'ers, there are no GGUFs (blog says there is)
satvikpendem
Unsloth's collection as well [0], with their results [1]. Looks like they can get very close to 100% accuracy compared to the BF16 model that is unquantized, and Unsloth's quants are better than the original Google's QAT as posted in the article. Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones. [0] https://huggingface.co/collections/unsloth/gemma-4-qat [1] https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis
cr3cr3
For a moment I got excited thinking QAT is Intel Quick Assist Technology...
somewhatrandom9
Could these quantized models make MTP (Multi-Token Prediction) significantly faster when used as drafters for larger regular Gemma 4 models?
simonw
I just ran one of these locally on a Mac like this: uvx litert-lm run \ --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \ gemma-4-E2B-it.litertlm \ --backend=gpu \ --prompt="Generate an SVG of a pelican riding a bicycle" The first time you run that it downloads 3.2GB to ~/.cache/huggingface/hub/models--litert-community--gemma-4-E2B-it-litert-lm It can handle audio and image input too, which is pretty cool for a 3.2GB model. For images: uvx litert-lm run \ --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \ gemma-4-E2B-it.litertlm \ --backend=gpu --vision-backend gpu \ --attachment image.jpg --prompt describe And for audio: uvx litert-lm run \ --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \ gemma-4-E2B-it.litertlm \ --backend=gpu --audio-backend cpu \ --attachment audio.wav --prompt transcribe (The pelican is rubbish, but it's only a 3.2GB file so the fact it even outputs valid SVG is impressive to me: https://gist.github.com/simonw/94b318afde4b1ce5ff67d4b5d0362... )
WhiteDawn
Once someone generates a MTP layer for 26B A4B 4 QAT I'll be singing from the hills with my 5 year old GPU.
redox99
I was just testing Gemma E2B and E4B yesterday, and they are just too dumb to be useful outside of niche use cases. Besides, there's no good agent on Android. Having a model that can't run web searches and browse websites is limited in use, particularly small models that really need to be grounded on search results to be factual, because they can't memorize enough. Edit: I'd like to know what kind of usage the people that seem to disagree and downvoted this are having.
zkmon
How can the smaller Unsloth GGUF quant can beat the original google quant? (ref: unsloth/gemma-4-31B-it-qat-GGUF)
Catloafdev
Being able to run the 12B on 8gb VRAM is huge. It's crazy to see how fast these small local models have evolved.
steno132
I don't get this obsession with smaller models. I've been using Claude and GPT models for years and have had zero issues with them. I see absolutely no benefit to me as a end user for a local model which is going to take up more of my CPU and memory and slow down my machine. I almost always have Internet and if I don't then not having access to a AI model is the least of my concerns.
jbarrow
Very impressed with how much the Gemma ecosystem has advanced just this week. Gemma 12B, multitoken prediction, and official quants released. Feels like Google is putting real effort into this string of releases, and I'm very excited to see that!
jhatax
It’s the Friday before WWDC during which Apple is going to announce an “improved” Siri based on Google models (a locked partnership, for now). Maybe it’s a coincidence, but this might be Google releasing models that will be showcased next week by Apple? No knowledge, just speculation.
jack_pp
Ran hf.co/google/gemma-4-12B-it-qat-q4_0-gguf:Q4_0 with ollama on a AMD Ryzen 9 8940HX, NVIDIA GeForce RTX 5060 (8 GB), 14 GB RAM laptop and it is suprisingly fast
Kylejeong21
google pixel intelligence may beat apple intelligence
nazgul17
I don't see these QAT models on Edge Gallery; just the BF16 models are there. Is there anything I am missing?
superkuh
I wish they would release the base (non instruction tuned) models for use with pattern completion.
RandyOrion
From the perspective of a local llm user, I think the qat doesn't solve the major problem of the gemma models. Gemma family (gen 1 to gen 4) is consistent with extreme range of activations, i.e., 6e5, essentially forcing people to use bf16 kv cache and accept a short context window, e.g., 31b, iq4_xs quantization, 100k context window on 32gb memory. Or, people use q8 kv cache, 200k context window, and accept a large performance penalty. Qat training with w4a16 target, while improving performance on inference with low-precision weighs, doesn't solve kv cache problem at all. In the end, a qat is a qat, and efforts are put on qat checkpoints. Thank you gemma team for qat checkpoints.