GLM-5.2 – How to Run Locally

TechTechTech 294 points 138 comments June 22, 2026
unsloth.ai · View on Hacker News

Discussion Highlights (19 comments)

xrd

So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VRAM and 256GB of RAM for MoE offloading. https://unsloth.ai/docs/models/glm-5.2#usage-guide In a prior thread, someone said it would take $500k in hardware: https://news.ycombinator.com/item?id=48629970

zuzululu

wonder if AMD's new ai chip can run this with ease? I'm seriously consider buying it. GLM 5.2 is just shy of GPT 5.4 so I would welcome offloading any grunt work locally I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole. Nothing beats a local LLM disconnected from the cloud.

pheggs

I feel like the gap is closing to be able to run good enough models locally even for coding and I would assume it could make some companies a bit nervous. Am I wrong about that?

andai

How is this model half the size of DeepSeek V4 Pro? Is it because DeepSeek did more aggressive cost cutting on the attention mechanism?

skiing_crawling

"it can fit" on 256GB of RAM, but it will be heavily quantized and still run very slowly. The headline number is not token generation, its prompt processing. So if you get 10 tok/s and an API gives you 20-30 tok/s, it doesn't seem that bad on its face, but a mac studio or any other machine that's not loading all of it into GPU will do PP 20-50X slower than a purely GPU based setup, which is what actually makes this unusable without $50k in GPUs. On top of that, you will still be heavily quantized.

nullc

Just running cpu only w/ Q6 on 9684X I get about 1tok/s ... also still get about 1tok/s/stream when running 16 in parallel.

Wowfunhappy

> The full model requires 1.51TB of disk space ...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage? I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have. But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!

CGamesPlay

Can somebody help me understand the Quantization Analysis? It says "dynamic 4-bit UD-Q4_K_XL and dynamic 5-bit UD-Q5_K_XL are generally lossless" while showing a top-1% token agreement on the chart of 97.5%. Not what I would consider "generally lossless". Is this implying that some post-processing is going to account for the 2.5% loss? Beam search?

hxii

Any time I see one of these posts about models of this size a quote comes to mind – "Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should". Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.

ramgine

I have up to 1tb of ddr4 in my server but it only has a 12gb vram 3060. Would getting a 24gb vram make this a viable system or am I throwing money away?

dofm

Can't run this myself. But I do like Unsloth Studio, quite a lot. It's nicely designed.

snootypoot

if sam altman didnt exist i could afford to run this

jonathanhefner

> Runing GLM-5.2 on local hardware Do the runes make it smarter or just run faster (or both)?

segmondy

I run Q4_K_XL. All it takes to run to get about 6tk/sec is 512gb of ram and 2 3090 GPUs with llama.cpp -cmoe. I also have crappy DDR4, 2400mhz, 3200mhz will bring that speed up to about 9tk/sec. I also have ok 32core epyc CPU, a better 64core would bring it up to about 11tk/sec. I did a budget build before the crazy hardware cost and I regret it everyday. Nevertheless, it's fantastic being able to run this model at home. It's great for planning, one shot prompting once you have a plan or all the context you need. This entire hardware cost $2400 when it was built. If you're willing to be resourceful, you can find ways to run these models at home. I often get the silly question of why, and suggestions about how much I can save using cloud API, but the Fable drama has opened up eyes on why it's good for us to be independent. Thanks team unsloth, Q4_K_XL is solid, if you are going to grab a quant, make sure to get the K_XL variant if it can fit.

Frannky

There is a push from multiple directions at the same time: - new AI desktops with GB10s. They are relatively cheap and you can cluster them and load 1TB of VRAM - Nvidia, amd, intel, Cerebras etc pushing new hardware - oss models getting crazy good, like glm 5.2 - flash models getting very good like deepseek V4 flash - quantizations - harnesses being able to use different models (big for difficult stuff, small for grunt work) So hopefully soon for the ones who want to break free from APIs, we will be able to host at home a cluster of AI desktops at a reasonable price with Opus-level capabilities, can't wait!!

drudolph914

GLM 5.2 is the first time I'm actually excited about AI! I'm not the most bullish on AI code for several few reasons, but the biggest reason is the ownership model. We all know we're near the tail end of the "subsidized pricing" window for AI, and I've been hoping for so long to get an open weight model that is _close enough_ to the SOTA before this window closes - and we actually got it! I'm excited to be able to in the near future run GLM locally, and use these things like a tool instead of living in this for-rent model for the rest of my life. I'm excited to actually enjoy programming again

numlock86

Is this really worth it, though? Throughout the years my experience with quantized models has been that they feel like a lobotomized version of the original. Doesn't matter if it's an LLM, dedicated diffusion model or some other dedicated task. Sure, they get the job done. But a lot worse. The only ones that can somewhat hold up are the ones provided by the vendor directly. Gemma4 comes to mind. However I suspect they have some secret sauce other than just "let's quantize this" since they have the original model and its data at hand. There should be more native 4bit, 1.25bit and likewise models. Those actually work great while making them smaller in comparison. But I guess there is some reason for them being pretty niche.

edg5000

One advantage about local LLM: You could serialize the context yourself, without being constrained by APIs. And let's not forget, the Big 2 encrypt their thinking. If you use custom clients, which is a very grey area alreay, being able to produce the context string raw is a big bonus. Takes away a lot of annoying constraints and needless mystique/obfuscation. But I don't know how usable GLM 5.2 is vs the Big 2.

c7b

Can someone explain the math to me? Why is 1-bit only ten percent less memory than 2-bit?

Semantic search powered by Rivestack pgvector
11,301 stories · 106,340 chunks indexed