A 10 year old Xeon is all you need

cafkafk 684 points 273 comments June 01, 2026

Discussion Highlights (20 comments)

cafkafk

Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers. I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow. I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details. I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.

Eonexus

I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?

potus_kushner

@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?

christkv

Makes you wonder if its possible to squeeze more tps out of a strix halo system using the 16 zen5 cores as well as the gpu.

asimovDev

I have an ancient DDR3 Xeon that doesn't support any AVX (dual x5690 and 96GB 1333 MHz RAM). You reckon it would even build / run at all?

vhaudiquet

The E5 2620-v4 only supports DDR4.

NSUserDefaults

How about the iMac Pro? Would that work? I was able to put 128gb in it (not as easy as the regular iMac but possible).

nurettin

I also run a Qwen 3.6 moe A4B on old hardware. I set it up with numactl --membind=1 so it is constrained to one of the memory sticks which speeds up token generation a little.

hparadiz

I'm now staring at a 10 year old 4U with 256 GB of DDR4 and thinking hmmmmm

bflesch

Might consider going for even older CPUs which don't have the Intel ME ring -3 thing which is full of backdoors

hypfer

> The argument for speculative decoding is stronger on CPU than on GPU. Uh. Uuuh. No? ___ Also > While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip. What purpose does the quoting of "caches" serve there? Is this AI writing written by that model running on that host?

gigatexal

What kind of tokens per second did the op get I saw nothing of this written.

phaser

What intrigues me the most about AI progress, is not AGI or the model du jour by $AI_UNICORN, but rather what can be run locally. I remember having an amusing, but rather useless model in a beefy gaming PC that I had 6 years ago; and now, something that’s a hundred times better on my M5 laptop. Should the market react to the memory shortage, the progress of the Apple silicon continue at the same pace, and what we’ll be able to run locally in 6 years will be very exciting. or frightening. Also I don’t know what this means for the valuation of the AI companies. I remember asking about this very idea to one of their employees at an event and instead of answering he bailed out to grab a cocktail.

car

Similar recent posting with optimizations for older Xeon: High-Performance AI on a Budget: Optimizing llama.cpp for Qwen3.5 Inference on a Dual-GPU HP Z440 https://news.ycombinator.com/item?id=47320244

cykros

Does this mean my 15 year old Phenom is too old? But it has 16 gb of DDR3 RAM! Admittedly web browsers and it don't get along that well. Literally the only thing that drags though on my Slackware 15 system, and even then usually only when it gets to around 15 or so open tabs.

SXX

Now we need someone try run Kimi K2.6 on old Xeon and DDR3. After all these platforms do support up to 768GB RAM.

egorfine

This and the previous one are insanely good articles. Thank you!

haunter

And this is one of those CPUs which had dual slot motherboards so you can have double the fun (and power bill) https://pcpartpicker.com/products/motherboard/#s=20028,20029...

anon-3988

I tried to run gemma 4 on this CPU and it did not go well https://www.techpowerup.com/cpu-specs/ryzen-7-4800u.c2281 It is way too slow

throwaway2027

Glad to see other people realizing this. I've been running Gemma 26B-A4B Q4 on a 2012 Xeon with 16GB to 24GB of RAM in a container. It's getting around 8 to 12 tokens per second. Obviously it's not comparable to huge contexts and running it on a GPU and the image decoder in llama.cpp is super slow compared to a GPU but for some small automation tasks and general trivia questions it's decent. The speed is just enough to not have to wait for it to finish so you can read along. Here's my setup. You may want to figure out what the best optimizations are for your specific CPU like AVX2 because mine didn't have most of them. I did try MTP briefly but I wasn't getting performance improvements. You could play around with the batch sizes for cache or context or go even lower for Q2 and don't overcommit on threads either, but I would suggest either defaults or trying out llama-bench. This isn't by any means the best I assume but it worked decently for me and I sometimes swap out Gemma for Qwen. You could also lower q8_0 to q4_0 for more context but it could hurt quality some say, altough I have noticed it too on some models. # Building cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_OPENMP=ON # Running export OPENBLAS_NUM_THREADS=4 export OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4 \ llama.cpp/build/bin/llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.00 --jinja --host 0.0.0.0 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 --threads 4 --threads-batch 4 --ctx-size 8192 -n 8192 --batch-size 2048 --ubatch-size 512 --no-mmap --mlock --chat-template-kwargs '{"enable_thinking":false}' --no-mmproj -np 1 -fa 1

A 10 year old Xeon is all you need

Discussion Highlights (20 comments)

Related Discussions