A few words on DS4

caust1c 247 points 84 comments May 14, 2026

Discussion Highlights (16 comments)

bjconlan

This is great! I feel the same way about the deepseek v4 architecture for commodity hardware. Also have enjoyed playing with https://huggingface.co/HuggingFaceTB/nanowhale-100m-base (but early days for me understanding this space)

simonw

I got this running on a 128GB M5 the other day - pretty painless, model runs in about 80GB of RAM and it seemed to be very capable at writing code and tool execution.

kamranjon

Just want to mention that I've been pulling down and using DwarfStar locally and it's incredible. I actually have it running on my personal macbook m4 max with 128gb of ram and I am running the server to share it through tailscale with my work laptop and just have pi running there. The long context reasoning is something I haven't even seen in frontier models - I was running at 124k tokens earlier and it was still just buzzing along with no issues or fatigue. I am amazed at how well it works, I'm using it right now for some pretty complex frontend work, and it is much much faster than, for example running a dense 27b or 31b model (like qwen or gemma) for me (The benefits of MoE) - but the long context capabilities have been what have been absolutely flooring me. Super excited about this project and hope Antirez can keep himself from burning out - i've been following the repo pretty closely and there are a ton of PR's flooding in and it seems like he's had to do a lot of filtering out of slop code.

0xbadcafebee

I don't see an explanation of why they would make a model-specific inference engine vs just using llamacpp. There are already lots of people working on the llamacpp integration. This is a lot of effort spent on a single model which is likely to become obsolete when a different model comes out that does better. In some discussions, people are now making PRs against both the llamacpp branches and ds4... so it's taking a rare commodity (people investing development time in this model) and fragmenting it

minimaxir

A relevant recent tweet from antirez: https://x.com/antirez/status/2054854124848415211 > Gentle reminder on how, in the recent DS4 fiesta, not just me but every other contributor found GPT 5.5 able to help immensely and Opus completely useless. I've noticed the same for lower level squeezing-as-much-performance-as-possible code work.

codedokode

I thought DeepSeek was closed-weights and proprietary? I wonder how it compares against Western open-weight models. The hugging face page contains the comparison only with proprietary models for some reason.

sbinnee

It is a big thing for sure to have a competitive local agentic model. I've replaced gemini 3 flash preview with DeepSeek v4 flash for all of my personal use cases. Starting from chat app, language learning, and even hobby coding. For coding, I couldn't get decent results no matter which sota latest models I used before. It's not close to Opus or Codex models. It's a flash model and makes mistakes here and there (I just saw `from opentele while import trace`, new Python syntax!) But I found its tool calling is reliable than other oss models I tried. I assume that it attributes to interleaved thinking. Its reasoning effort is adjusted automatically by queries. I enjoy reading these reasoning traces from open models because you can't see them from proprietary models. I would love to try DS4 so bad. Well, I don't have a machine for it. I will just stick to openrouter. I wish I can run a competitive oss model on 32GB machine in 3 years.

brcmthrowaway

This guy is falling deep into Yegge-tier psychosis.

FuckButtons

It’s shocking how close this feels to claude, obviously it's much slower, but I don’t know that it’s significantly dumber. Interestingly the imatrix quantization seems to be better than whatever quant the zdr inference backends on open router are using. It was self aware enough yesterday to realize that it’s own server process was itself without me telling it, which is not something I’ve ever observed a local model doing before.

somewhatrandom9

With "intelligence" (or whatever you want to call it) and speed both seeming to ramp up quickly with local models I wonder what the growth rate and ceiling(?) might be in this space. Will this kind of iq and performance work with just e.g: 16GB RAM in a couple years? Is there a new kind of Moore's law to be defined here?

karmakaze

Great to find this narrow focused thing: > We support the following backends: Metal is our primary target. Starting from MacBooks with 96GB of RAM. NVIDIA CUDA with special care for the DGX Spark. AMD ROCm is only supported in the rocm branch. It is kept separate from main since I (antirez) don't have direct hardware access, so the community rebases the branch as needed. > This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors. Edit: aww, doesn't seem to support offloading to system RAM[0] (yet) [0] https://github.com/antirez/ds4/issues/108 Guess I'll have to keep watching the llama.cpp issue[1] [1] https://github.com/ggml-org/llama.cpp/issues/22319

gcr

DwarfStar4 is a small LLM inference runtime that can run DeepSeek 4. The blog post implies that it currently requires 96GB of VRAM. For others who are lacking context :-)

easythrees

I thought for a moment there was a Dark Souls 4

zmmmmm

I'm very curious where we will saturate the curve on "enough" intelligence for coding. At some point, you can let a less smart model hammer at a problem for longer and get to the same result, and as long as you are not involved it comes to the same thing. I feel like DeepSeek V4 Pro is nearly there. Maybe Flash is too. Once we hit that point, I am curious how much of Anthropic's current business model falls apart? So far it's always been clear that you just pay for the most intelligent model you can get because it is worth it. It now seems clear to me that there is limited runway on that concept. It is just a question of how long that runway is. I honestly wonder how much of their frantic push to broaden out into enterprise / productivity is because they see this writing on the wall already.

kgeist

Did someone compare DeepSeek 4 Flash to Qwen3.6-27B on real tasks (quality + speed)? According to the benchmarks at artificialanalysis.ai, Qwen3.6-27B is better at agentic tasks, and DS4 is only 2 points better at coding (both with max reasoning effort, full weights). At the same time, DS4 requires 5 times more VRAM even at 2 bits. Last time I explored this topic, large MoE models at 2-3 bits usually performed worse (quality-wise) than dense ~30B models at 4-8 bits, despite being much heavier to run. Sure, MoE models have more knowledge, but extreme quantization may negate the benefits. And generally for coding tasks, you don't need a model that has memorized all the irrelevant trivia like, I don't know, the list of all villages in country X. DS4 also seems to run much slower on Mac Studio Ultra, which appears to be more or less in the same price range as RTX 5090. RTX 5090 gives me 50-60 tok/sec and 260k context with Unsloth's 5-bit quantization (only some layers are 5-bit too) and an 8-bit KV cache; prefill is instant too. It works flawlessly in OpenCode. If you already have a spare high-end Mac, I can see the benefit, but I'm not sure it's a good configuration overall. Unless Qwen3.6 is more benchmaxxed than DS4 :)

Riany

I think local models need to be good enough that privacy, latency, and control become worth the tradeoff, instead of beat the best cloud models

A few words on DS4

Discussion Highlights (16 comments)

Related Discussions