Accelerating Gemma 4: faster inference with multi-token prediction drafters
amrrs
521 points
233 comments
May 05, 2026
Related Discussions
Found 5 related stories in 87.3ms across 8,303 title embeddings via pgvector HNSW
- Gemma 4: Byte for byte, the most capable open models meetpateltech · 21 pts · April 02, 2026 · 66% similar
- Google releases Gemma 4 open models jeffmcjunkin · 1306 pts · April 02, 2026 · 60% similar
- Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference takumi123 · 278 pts · April 15, 2026 · 57% similar
- Gemini 3.1 Flash-Lite: Built for intelligence at scale meetpateltech · 51 pts · March 03, 2026 · 56% similar
- AlphaEvolve: Gemini-powered coding agent scaling impact across fields berlianta · 274 pts · May 07, 2026 · 55% similar
Discussion Highlights (20 comments)
mchusma
I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?
these
Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.
disiplus
nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.
zdw
MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533 ) and I'd imagine Gemma 4 will come soon. The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.
skybrian
Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.
shay_ker
curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...
nalinidash
technical details are here: https://x.com/googlegemma/status/2051694045869879749
pu_pe
So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?
christina97
I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment. I tried first with Qwen but it was unstable and had ridiculously long thinning traces!
deskamess
Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.
brcmthrowaway
Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?
m3kw9
ok so? Anyone got a verdict/review?
recsv-heredoc
CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider. They serve gemma-4-26b-a4b-it.
julianlam
Really excited to try this once it is merged into llama.cpp. Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing. Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)
msp26
Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic. However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.
AbuAssar
these are the updated models: google/gemma-4-31B-it-assistant google/gemma-4-26B-A4B-it-assistant google/gemma-4-E4B-it-assistant google/gemma-4-E2B-it-assistant
ActorNightly
I found that Gemma 4:26b makes way more mistakes compared to Qwen and Gemma 3. Gemma3 27b QAT was my goto for some time as this was quite fast. Qwen is still king for a balance of accuracy and inference speed. Gemma:31b was more accurate but speed was horrendous.
Patrick_Devine
In my testing the Gemma 4 31b model had the biggest speed boost in Ollama w/ the MLX runner for coding tasks (at about 2x). Unfortunately you'll need a pretty beefy Mac to run it because quantization really hurts the acceptance rate. The three other smaller models didn't perform as well because the validation time of the draft model ate up most of the performance gains. I'm still trying to tune things to see if I can get better performance. You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.
vhiremath4
So this is like branch prediction for operating systems? Except we have probability baked into the model itself so it’s even more reliable.
WarmWash
I don't see it talked about much, but Gemma (and gemini) use enormously less tokens per output than other models, while still staying within arms reach of top benchmark performance. It's not uncommon to see a gemma vs qwen comparison, where qwen does a bit better, but spent 22 minutes on the task, while gemma aligned the buttons wrong, but only spent 4 minutes on the same prompt. So taken at face value, gemma is now under performing leading open models by 5-10%, but doing it in 1/10th the time.