Benchmarking local LLM runtimes on an Apple M4 Max with 128GB unified memory, comparing sustained decode speed and concurrency. Models tested include Qwen3.6-35B-A3B, Qwen3.6-27B, Gemma 4 26B, and DeepSeek V4 Flash across oMLX, Ollama, and ds4.
powermode 2) versus Low Power (powermode 1). The dense 27B model and ds4 DeepSeek V4 Flash saw the biggest gains (2–3×). All results below show both modes side by side.Methodology
Each model was tested with the same four prompt/completion combinations via OpenAI-compatible API:
| Case | Prompt | Max tokens | Prompt tokens |
|---|---|---|---|
short_50 | Short (one-sentence question) | 50 | ~21 |
short_200 | Short (one-sentence question) | 200 | ~21 |
medium_100 | Medium (paragraph-length technical prompt) | 100 | ~85 |
medium_200 | Medium (paragraph-length technical prompt) | 200 | ~85 |
Decode speed is reported as wall-clock tokens per second (completion tokens ÷ wall time). Where the runtime exposes its own eval speed (Ollama), that number is also reported. Each case was run once per model per power mode — these are practical single-run measurements, not averaged benchmarks. oMLX and ds4 report wall-clock tok/s; Ollama reports its internal eval tok/s. The medium_200 case (85-token prompt, 200-token completion) is the best single representative of sustained decode speed.
Recommended stack on this machine
| Use case | Runtime/model | Why |
|---|---|---|
| Main local agent model | oMLX Qwen3.6-35B-A3B-4bit | Best throughput and concurrency. |
| Dense fallback | oMLX Qwen3.6-27B-4bit | Best dense model tested. |
| Ollama fallback | qwen3.5:35b-a3b-coding-nvfp4 | Best Ollama model tested. |
| Gemma 4 | gemma4:26b-a4b-it-q4_K_M | Best Gemma 4 tag tested. |
| DeepSeek V4 Flash | ds4 q2 GGUF | Only reliable local DS V4 Flash path. |
Speed comparison: Low Power vs High Power
Each model was tested with the same prompts at both AC powermode 1 (Low Power) and powermode 2 (High Power). The medium_200 case (85-token prompt, 200-token completion) is the best single representative of sustained decode speed.
| Runtime/model | Architecture | Low Power tok/s | High Power tok/s | Multiplier |
|---|---|---|---|---|
Ollama qwen3.5:35b-a3b-coding-nvfp4 | MoE A3B, NVFP4 | ~51.9 | ~107.1 | 2.06× |
oMLX Qwen3.6-35B-A3B-4bit | MoE A3B, 4bit | ~53.7 | ~78.1 | 1.45× |
Ollama gemma4:26b-a4b-it-q4_K_M | MoE A4B, Q4_K_M | ~41.8 | ~75.2 | 1.80× |
| ds4 DeepSeek V4 Flash q2 | MoE 284B, q2 GGUF | ~8.9 | ~27.0 | 3.03× |
oMLX Qwen3.6-27B-4bit | Dense 27B, 4bit | ~11.1 | ~22.3 | 2.01× |
Ollama rows use Ollama internal eval tok/s; oMLX and ds4 rows use wall-clock completion tok/s.
Full results
| Model | Case | Low Power | High Power | Multiplier |
|---|---|---|---|---|
| oMLX Qwen3.6-35B-A3B-4bit | short_50 | 44.0 | 75.9 | 1.72× |
| oMLX Qwen3.6-35B-A3B-4bit | short_200 | 55.7 | 80.5 | 1.44× |
| oMLX Qwen3.6-35B-A3B-4bit | medium_100 | 51.7 | 75.5 | 1.46× |
| oMLX Qwen3.6-35B-A3B-4bit | medium_200 | 53.7 | 78.1 | 1.45× |
| Ollama Qwen3.5-35B-A3B NVFP4 | short_50 | 28.5 | 109.6 | 3.84× |
| Ollama Qwen3.5-35B-A3B NVFP4 | short_200 | 50.9 | 107.3 | 2.11× |
| Ollama Qwen3.5-35B-A3B NVFP4 | medium_100 | 45.3 | 108.3 | 2.39× |
| Ollama Qwen3.5-35B-A3B NVFP4 | medium_200 | 51.9 | 107.1 | 2.06× |
| Ollama Gemma4 26B q4 | medium_200 | 41.8 | 75.2 | 1.80× |
| oMLX Qwen3.6-27B-4bit | short_50 | 10.9 | 22.2 | 2.03× |
| oMLX Qwen3.6-27B-4bit | short_200 | 11.6 | 23.8 | 2.05× |
| oMLX Qwen3.6-27B-4bit | medium_100 | 10.7 | 21.5 | 2.01× |
| oMLX Qwen3.6-27B-4bit | medium_200 | 11.1 | 22.3 | 2.01× |
| ds4 DeepSeek V4 Flash q2 | short_50 | ~8.0 | 26.0 | 3.25× |
| ds4 DeepSeek V4 Flash q2 | short_200 | ~8.9 | 30.6 | 3.44× |
| ds4 DeepSeek V4 Flash q2 | medium_100 | ~8.0 | 24.1 | 3.01× |
| ds4 DeepSeek V4 Flash q2 | medium_200 | ~8.9 | 27.0 | 3.03× |
oMLX batching: Low Power vs High Power
oMLX continuous batching scales nearly linearly in both power modes. The absolute numbers shift, but the scaling ratio is consistent.
| Concurrent requests | Low Power aggregate tok/s | High Power aggregate tok/s | Low Power multiplier | High Power multiplier |
|---|---|---|---|---|
| 1 | ~57.0 | 75.8 | 1.00× | 1.00× |
| 2 | ~111.8 | 130.1 | 1.96× | 1.72× |
| 4 | ~227.5 | 182.9 | 3.99× | 2.41× |
Other models tested
These models were tested in Low Power mode only, but the High Power multiplier pattern (roughly 1.4–3.8×) from the kept models above applies as a reasonable estimate.
| Rank | Runtime/model | Architecture/quant | Low Power speed | Verdict |
|---|---|---|---|---|
| 6 | Ollama Huihui Qwen3.6 27B NVFP4 | Dense 27B, NVFP4 | ~10.7-11.2 tok/s | Slower than oMLX 27B-4bit. |
| 7 | MTPLX Qwen3.6 27B MTP | Dense 27B, 4bit + MTP | ~9.2 tok/s | Short-gen benefit only. |
| 9 | Rapid-MLX Qwen3.6 27B 4bit | Dense 27B, 4bit | ~6.7 tok/s | Slower than oMLX 27B-4bit. |
| 10 | Ollama gemma4:31b-nvfp4 | Gemma 4 31B, NVFP4 | ~6.9 tok/s | Unexpectedly slow for MoE A4B. |
Ollama gemma4:26b-nvfp4 was tested alongside the q4 tag and produced essentially identical decode speeds (~41.4 tok/s Low Power), so q4 is kept as the primary Gemma 4 tag.
Gemma 4
Gemma 4 was tested after Ollama's Apple Silicon MLX speed update. The headline 2× speedup was for Qwen3.5-35B-A3B NVFP4, not Gemma 4 specifically.
| Model | Size | Prompt eval | LP decode | HP decode | Verdict |
|---|---|---|---|---|---|
gemma4:31b-nvfp4 | 20GB | ~97.6 tok/s | ~6.88 | — | Too slow. |
gemma4:26b-nvfp4 | 16-17GB | ~117.6 tok/s | ~41.4 | — | Duplicate of q4. |
gemma4:26b-a4b-it-q4_K_M | 17GB | ~155.8 tok/s | ~41.8 | ~75.2 | Best Gemma 4 tag. |
MTPLX and Rapid-MLX
MTPLX MTP speculative decoding helped short generations but lost to plain oMLX for sustained dense 27B output (~9.2 vs ~11.6 tok/s). Rapid-MLX produced some high single-request numbers on 35B-A3B but had weak or broken concurrency; it was also slower on dense 27B.
DeepSeek V4 Flash
MLX/oMLX does not yet support deepseek_v4, and Rapid-MLX estimated a working set too large for 128GB. ds4 with the q2 GGUF is the only reliable local path:
| Case | Low Power | High Power | Multiplier |
|---|---|---|---|
| short_50 | ~8.0 | 26.0 | 3.25× |
| medium_200 | ~8.9 | 27.0 | 3.03× |
Takeaways
- Always verify macOS power mode before benchmarking. Low Power mode can understate throughput by 1.4–3.8× depending on model and runtime. Dense and heavy models are most affected.
- MoE sparse activation dominates speed. Qwen 35B-A3B is much faster than dense 27B on this machine, in both power modes.
- oMLX is the best runtime for local agents because continuous batching scales well in both power modes.
- Ollama is a strong compatibility fallback, with the fastest single-request eval speed for MoE models in High Power (~107 tok/s for 35B-A3B).
- Gemma 4 26B is good; Gemma 4 31B NVFP4 was not.
- DeepSeek V4 Flash is viable locally with ds4 in High Power (~27 tok/s), but awkward at ~9 tok/s in Low Power.
For this machine, the recommendation is oMLX + Qwen3.6-35B-A3B-4bit as the main model, Ollama for compatibility, and ds4 only when local DeepSeek V4 Flash is specifically needed.