Ollama is now powered by MLX on Apple Silicon in preview
redundantly
95 points
29 comments
March 31, 2026
Related Discussions
Found 5 related stories in 54.8ms across 3,471 title embeddings via pgvector HNSW
- Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon sanchitmonga22 · 199 pts · March 10, 2026 · 55% similar
- April 2026 TLDR Setup for Ollama and Gemma 4 26B on a Mac mini greenstevester · 298 pts · April 03, 2026 · 53% similar
- Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift ipotapov · 358 pts · March 05, 2026 · 50% similar
- MAUI Is Coming to Linux DeathArrow · 187 pts · March 22, 2026 · 50% similar
- iPhone 17 Pro Demonstrated Running a 400B LLM anemll · 546 pts · March 23, 2026 · 49% similar
Discussion Highlights (7 comments)
babblingfish
LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.
codelion
How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - https://mlx-optiq.pages.dev/
dial9-1
still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram
LuxBennu
Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path
AugSun
"We can run your dumbed down models faster": #The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.
brcmthrowaway
What is the difference between Ollama, llama.cpp, ggml and gguf?
mfa1999
How does this compare to llama.cpp in terms of performance?