Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code
vbtechguy
232 points
56 comments
April 05, 2026
Related Discussions
Found 5 related stories in 53.4ms across 3,663 title embeddings via pgvector HNSW
- Show HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloud ikessler · 39 pts · April 06, 2026 · 61% similar
- Google releases Gemma 4 open models jeffmcjunkin · 1306 pts · April 02, 2026 · 60% similar
- Gemma 4: Byte for byte, the most capable open models meetpateltech · 21 pts · April 02, 2026 · 59% similar
- April 2026 TLDR Setup for Ollama and Gemma 4 26B on a Mac mini greenstevester · 298 pts · April 03, 2026 · 57% similar
- Gemma 4 on iPhone janandonly · 534 pts · April 05, 2026 · 56% similar
Discussion Highlights (10 comments)
vbtechguy
Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code.
trvz
ollama launch claude --model gemma4:26b
jonplackett
So wait what is the interaction between Gemma and Claude?
Someone1234
Using Claude Code seems like a popular frontend currently, I wonder how long until Anthropic releases an update to make it a little to a lot less turn-key? They've been very clear that they aren't exactly champions of this stuff being used outside of very specific ways.
martinald
Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.
asymmetric
Is a framework desktop with >48GB of RAM a good machine to try this out?
aetherspawn
Can you use the smaller Gemma 4B model as speculative decoding for the larger 31B model? Why/why not?
NamlchakKhandro
I don't know why people bother with Claude code. It's so jank, there are far superior cli coding harness out there
inzlab
awesome, the lighter the hardware running big softwares the more novelty.
edinetdb
Claude Code has become my primary interface for iterating on data pipeline work — specifically, normalizing government regulatory filings (XBRL across three different accounting standards) and exposing them via REST and MCP. The MCP piece is where the workflow gets interesting. Instead of building a client that calls endpoints, you describe tools declaratively and the model decides when to invoke them. For financial data this is surprisingly effective — a query like "compare this company's leverage trend to sector peers over 10 years" gets decomposed automatically into the right sequence of tool calls without you hardcoding that logic. One thing I haven't seen discussed much: tool latency sensitivity is much higher in conversational MCP use than in batch pipelines. A 2s tool response feels fine in a script but breaks conversational flow. We ended up caching frequently accessed tables in-memory (~26MB) to get sub-100ms responses. Have you noticed similar thresholds where latency starts affecting the quality of the model's reasoning chain?