Local LLM inference on an M4 Max 128GB

2026-05-10 · 2026-05-11 · Apple M4 Max · 128GB unified memory · macOS 26.3.1

Benchmarking local LLM runtimes on an Apple M4 Max with 128GB unified memory, comparing sustained decode speed and concurrency. Models tested include Qwen3.6-35B-A3B, Qwen3.6-27B, Gemma 4 26B, and DeepSeek V4 Flash across oMLX, Ollama, and ds4.

Key finding: macOS power mode dramatically affects local LLM throughput. The same machine produced 1.4× to 3.8× higher tok/s in High Power mode (powermode 2) versus Low Power (powermode 1). The dense 27B model and ds4 DeepSeek V4 Flash saw the biggest gains (2–3×). All results below show both modes side by side.

Methodology

Each model was tested with the same four prompt/completion combinations via OpenAI-compatible API:

CasePromptMax tokensPrompt tokens
short_50Short (one-sentence question)50~21
short_200Short (one-sentence question)200~21
medium_100Medium (paragraph-length technical prompt)100~85
medium_200Medium (paragraph-length technical prompt)200~85

Decode speed is reported as wall-clock tokens per second (completion tokens ÷ wall time). Where the runtime exposes its own eval speed (Ollama), that number is also reported. Each case was run once per model per power mode — these are practical single-run measurements, not averaged benchmarks. oMLX and ds4 report wall-clock tok/s; Ollama reports its internal eval tok/s. The medium_200 case (85-token prompt, 200-token completion) is the best single representative of sustained decode speed.

Recommended stack on this machine

Use caseRuntime/modelWhy
Main local agent modeloMLX Qwen3.6-35B-A3B-4bitBest throughput and concurrency.
Dense fallbackoMLX Qwen3.6-27B-4bitBest dense model tested.
Ollama fallbackqwen3.5:35b-a3b-coding-nvfp4Best Ollama model tested.
Gemma 4gemma4:26b-a4b-it-q4_K_MBest Gemma 4 tag tested.
DeepSeek V4 Flashds4 q2 GGUFOnly reliable local DS V4 Flash path.

Speed comparison: Low Power vs High Power

Each model was tested with the same prompts at both AC powermode 1 (Low Power) and powermode 2 (High Power). The medium_200 case (85-token prompt, 200-token completion) is the best single representative of sustained decode speed.

Runtime/modelArchitectureLow Power tok/sHigh Power tok/sMultiplier
Ollama qwen3.5:35b-a3b-coding-nvfp4MoE A3B, NVFP4~51.9~107.12.06×
oMLX Qwen3.6-35B-A3B-4bitMoE A3B, 4bit~53.7~78.11.45×
Ollama gemma4:26b-a4b-it-q4_K_MMoE A4B, Q4_K_M~41.8~75.21.80×
ds4 DeepSeek V4 Flash q2MoE 284B, q2 GGUF~8.9~27.03.03×
oMLX Qwen3.6-27B-4bitDense 27B, 4bit~11.1~22.32.01×

Ollama rows use Ollama internal eval tok/s; oMLX and ds4 rows use wall-clock completion tok/s.

Full results

ModelCaseLow PowerHigh PowerMultiplier
oMLX Qwen3.6-35B-A3B-4bitshort_5044.075.91.72×
oMLX Qwen3.6-35B-A3B-4bitshort_20055.780.51.44×
oMLX Qwen3.6-35B-A3B-4bitmedium_10051.775.51.46×
oMLX Qwen3.6-35B-A3B-4bitmedium_20053.778.11.45×
Ollama Qwen3.5-35B-A3B NVFP4short_5028.5109.63.84×
Ollama Qwen3.5-35B-A3B NVFP4short_20050.9107.32.11×
Ollama Qwen3.5-35B-A3B NVFP4medium_10045.3108.32.39×
Ollama Qwen3.5-35B-A3B NVFP4medium_20051.9107.12.06×
Ollama Gemma4 26B q4medium_20041.875.21.80×
oMLX Qwen3.6-27B-4bitshort_5010.922.22.03×
oMLX Qwen3.6-27B-4bitshort_20011.623.82.05×
oMLX Qwen3.6-27B-4bitmedium_10010.721.52.01×
oMLX Qwen3.6-27B-4bitmedium_20011.122.32.01×
ds4 DeepSeek V4 Flash q2short_50~8.026.03.25×
ds4 DeepSeek V4 Flash q2short_200~8.930.63.44×
ds4 DeepSeek V4 Flash q2medium_100~8.024.13.01×
ds4 DeepSeek V4 Flash q2medium_200~8.927.03.03×

oMLX batching: Low Power vs High Power

oMLX continuous batching scales nearly linearly in both power modes. The absolute numbers shift, but the scaling ratio is consistent.

Concurrent requestsLow Power aggregate tok/sHigh Power aggregate tok/sLow Power multiplierHigh Power multiplier
1~57.075.81.00×1.00×
2~111.8130.11.96×1.72×
4~227.5182.93.99×2.41×
In Low Power mode, oMLX achieves near-perfect linear scaling (3.99× at 4 requests). In High Power mode, the GPU is more saturated per-request, so scaling is sublinear but throughput per request is much higher. Both modes favor oMLX for concurrent agent workloads.

Other models tested

These models were tested in Low Power mode only, but the High Power multiplier pattern (roughly 1.4–3.8×) from the kept models above applies as a reasonable estimate.

RankRuntime/modelArchitecture/quantLow Power speedVerdict
6Ollama Huihui Qwen3.6 27B NVFP4Dense 27B, NVFP4~10.7-11.2 tok/sSlower than oMLX 27B-4bit.
7MTPLX Qwen3.6 27B MTPDense 27B, 4bit + MTP~9.2 tok/sShort-gen benefit only.
9Rapid-MLX Qwen3.6 27B 4bitDense 27B, 4bit~6.7 tok/sSlower than oMLX 27B-4bit.
10Ollama gemma4:31b-nvfp4Gemma 4 31B, NVFP4~6.9 tok/sUnexpectedly slow for MoE A4B.

Ollama gemma4:26b-nvfp4 was tested alongside the q4 tag and produced essentially identical decode speeds (~41.4 tok/s Low Power), so q4 is kept as the primary Gemma 4 tag.

Gemma 4

Gemma 4 was tested after Ollama's Apple Silicon MLX speed update. The headline 2× speedup was for Qwen3.5-35B-A3B NVFP4, not Gemma 4 specifically.

ModelSizePrompt evalLP decodeHP decodeVerdict
gemma4:31b-nvfp420GB~97.6 tok/s~6.88Too slow.
gemma4:26b-nvfp416-17GB~117.6 tok/s~41.4Duplicate of q4.
gemma4:26b-a4b-it-q4_K_M17GB~155.8 tok/s~41.8~75.2Best Gemma 4 tag.

MTPLX and Rapid-MLX

MTPLX MTP speculative decoding helped short generations but lost to plain oMLX for sustained dense 27B output (~9.2 vs ~11.6 tok/s). Rapid-MLX produced some high single-request numbers on 35B-A3B but had weak or broken concurrency; it was also slower on dense 27B.

DeepSeek V4 Flash

MLX/oMLX does not yet support deepseek_v4, and Rapid-MLX estimated a working set too large for 128GB. ds4 with the q2 GGUF is the only reliable local path:

CaseLow PowerHigh PowerMultiplier
short_50~8.026.03.25×
medium_200~8.927.03.03×

Takeaways

  1. Always verify macOS power mode before benchmarking. Low Power mode can understate throughput by 1.4–3.8× depending on model and runtime. Dense and heavy models are most affected.
  2. MoE sparse activation dominates speed. Qwen 35B-A3B is much faster than dense 27B on this machine, in both power modes.
  3. oMLX is the best runtime for local agents because continuous batching scales well in both power modes.
  4. Ollama is a strong compatibility fallback, with the fastest single-request eval speed for MoE models in High Power (~107 tok/s for 35B-A3B).
  5. Gemma 4 26B is good; Gemma 4 31B NVFP4 was not.
  6. DeepSeek V4 Flash is viable locally with ds4 in High Power (~27 tok/s), but awkward at ~9 tok/s in Low Power.

For this machine, the recommendation is oMLX + Qwen3.6-35B-A3B-4bit as the main model, Ollama for compatibility, and ds4 only when local DeepSeek V4 Flash is specifically needed.