Accelerating Gemma 4: faster inference with multi-token prediction drafters

amrrs 521 points 233 comments May 05, 2026
blog.google · View on Hacker News

Discussion Highlights (20 comments)

mchusma

I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?

these

Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.

disiplus

nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.

zdw

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533 ) and I'd imagine Gemma 4 will come soon. The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

skybrian

Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.

shay_ker

curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...

nalinidash

technical details are here: https://x.com/googlegemma/status/2051694045869879749

pu_pe

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?

christina97

I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment. I tried first with Qwen but it was unstable and had ridiculously long thinning traces!

deskamess

Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.

brcmthrowaway

Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?

m3kw9

ok so? Anyone got a verdict/review?

recsv-heredoc

CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider. They serve gemma-4-26b-a4b-it.

julianlam

Really excited to try this once it is merged into llama.cpp. Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing. Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)

msp26

Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic. However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.

AbuAssar

these are the updated models: google/gemma-4-31B-it-assistant google/gemma-4-26B-A4B-it-assistant google/gemma-4-E4B-it-assistant google/gemma-4-E2B-it-assistant

ActorNightly

I found that Gemma 4:26b makes way more mistakes compared to Qwen and Gemma 3. Gemma3 27b QAT was my goto for some time as this was quite fast. Qwen is still king for a balance of accuracy and inference speed. Gemma:31b was more accurate but speed was horrendous.

Patrick_Devine

In my testing the Gemma 4 31b model had the biggest speed boost in Ollama w/ the MLX runner for coding tasks (at about 2x). Unfortunately you'll need a pretty beefy Mac to run it because quantization really hurts the acceptance rate. The three other smaller models didn't perform as well because the validation time of the draft model ate up most of the performance gains. I'm still trying to tune things to see if I can get better performance. You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.

vhiremath4

So this is like branch prediction for operating systems? Except we have probability baked into the model itself so it’s even more reliable.

WarmWash

I don't see it talked about much, but Gemma (and gemini) use enormously less tokens per output than other models, while still staying within arms reach of top benchmark performance. It's not uncommon to see a gemma vs qwen comparison, where qwen does a bit better, but spent 22 minutes on the task, while gemma aligned the buttons wrong, but only spent 4 minutes on the same prompt. So taken at face value, gemma is now under performing leading open models by 5-10%, but doing it in 1/10th the time.

Semantic search powered by Rivestack pgvector
6,878 stories · 64,638 chunks indexed