Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

theanonymousone 318 points 97 comments June 05, 2026
blog.google · View on Hacker News

Discussion Highlights (19 comments)

minimaxir

It's a bit awkward to release Gemma 4 12B ( https://news.ycombinator.com/item?id=48385906 ), and then a canonical Q4_0 Gemma 4 12B a couple days later. It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so. Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.

netdur

had a good run with Gemma 4 E2B Unsloth 4Q: https://youtube.com/shorts/XLsAnz5aAAI The E4B model doesn’t fit on my phone TPU, so it swaps to RAM, the QAT version means more accuracy, good!

refulgentis

@google.com'ers, there are no GGUFs (blog says there is)

satvikpendem

Unsloth's collection as well [0], with their results [1]. Looks like they can get very close to 100% accuracy compared to the BF16 model that is unquantized, and Unsloth's quants are better than the original Google's QAT as posted in the article. Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones. [0] https://huggingface.co/collections/unsloth/gemma-4-qat [1] https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis

cr3cr3

For a moment I got excited thinking QAT is Intel Quick Assist Technology...

somewhatrandom9

Could these quantized models make MTP (Multi-Token Prediction) significantly faster when used as drafters for larger regular Gemma 4 models?

simonw

I just ran one of these locally on a Mac like this: uvx litert-lm run \ --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \ gemma-4-E2B-it.litertlm \ --backend=gpu \ --prompt="Generate an SVG of a pelican riding a bicycle" The first time you run that it downloads 3.2GB to ~/.cache/huggingface/hub/models--litert-community--gemma-4-E2B-it-litert-lm It can handle audio and image input too, which is pretty cool for a 3.2GB model. For images: uvx litert-lm run \ --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \ gemma-4-E2B-it.litertlm \ --backend=gpu --vision-backend gpu \ --attachment image.jpg --prompt describe And for audio: uvx litert-lm run \ --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \ gemma-4-E2B-it.litertlm \ --backend=gpu --audio-backend cpu \ --attachment audio.wav --prompt transcribe (The pelican is rubbish, but it's only a 3.2GB file so the fact it even outputs valid SVG is impressive to me: https://gist.github.com/simonw/94b318afde4b1ce5ff67d4b5d0362... )

WhiteDawn

Once someone generates a MTP layer for 26B A4B 4 QAT I'll be singing from the hills with my 5 year old GPU.

redox99

I was just testing Gemma E2B and E4B yesterday, and they are just too dumb to be useful outside of niche use cases. Besides, there's no good agent on Android. Having a model that can't run web searches and browse websites is limited in use, particularly small models that really need to be grounded on search results to be factual, because they can't memorize enough. Edit: I'd like to know what kind of usage the people that seem to disagree and downvoted this are having.

zkmon

How can the smaller Unsloth GGUF quant can beat the original google quant? (ref: unsloth/gemma-4-31B-it-qat-GGUF)

Catloafdev

Being able to run the 12B on 8gb VRAM is huge. It's crazy to see how fast these small local models have evolved.

steno132

I don't get this obsession with smaller models. I've been using Claude and GPT models for years and have had zero issues with them. I see absolutely no benefit to me as a end user for a local model which is going to take up more of my CPU and memory and slow down my machine. I almost always have Internet and if I don't then not having access to a AI model is the least of my concerns.

jbarrow

Very impressed with how much the Gemma ecosystem has advanced just this week. Gemma 12B, multitoken prediction, and official quants released. Feels like Google is putting real effort into this string of releases, and I'm very excited to see that!

jhatax

It’s the Friday before WWDC during which Apple is going to announce an “improved” Siri based on Google models (a locked partnership, for now). Maybe it’s a coincidence, but this might be Google releasing models that will be showcased next week by Apple? No knowledge, just speculation.

jack_pp

Ran hf.co/google/gemma-4-12B-it-qat-q4_0-gguf:Q4_0 with ollama on a AMD Ryzen 9 8940HX, NVIDIA GeForce RTX 5060 (8 GB), 14 GB RAM laptop and it is suprisingly fast

Kylejeong21

google pixel intelligence may beat apple intelligence

nazgul17

I don't see these QAT models on Edge Gallery; just the BF16 models are there. Is there anything I am missing?

superkuh

I wish they would release the base (non instruction tuned) models for use with pattern completion.

RandyOrion

From the perspective of a local llm user, I think the qat doesn't solve the major problem of the gemma models. Gemma family (gen 1 to gen 4) is consistent with extreme range of activations, i.e., 6e5, essentially forcing people to use bf16 kv cache and accept a short context window, e.g., 31b, iq4_xs quantization, 100k context window on 32gb memory. Or, people use q8 kv cache, 200k context window, and accept a large performance penalty. Qat training with w4a16 target, while improving performance on inference with low-precision weighs, doesn't solve kv cache problem at all. In the end, a qat is a qat, and efforts are put on qat checkpoints. Thank you gemma team for qat checkpoints.

Semantic search powered by Rivestack pgvector
10,002 stories · 93,925 chunks indexed