Ollama is now powered by MLX on Apple Silicon in preview

redundantly 95 points 29 comments March 31, 2026

Discussion Highlights (7 comments)

babblingfish

LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.

codelion

How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - https://mlx-optiq.pages.dev/

dial9-1

still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram

LuxBennu

Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path

AugSun

"We can run your dumbed down models faster": #The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.

brcmthrowaway

What is the difference between Ollama, llama.cpp, ggml and gguf?

mfa1999

How does this compare to llama.cpp in terms of performance?

Ollama is now powered by MLX on Apple Silicon in preview

Discussion Highlights (7 comments)

Related Discussions