Local LLM inference on an M4 Max 128GB: Low vs High Power

Benchmarking local LLM runtimes on an Apple M4 Max with 128GB unified memory, comparing sustained decode speed and concurrency. Models tested include Qwen3.6-35B-A3B, Qwen3.6-27B, Gemma 4 26B, and DeepSeek V4 Flash across oMLX, Ollama, and ds4.

Key finding: macOS power mode dramatically affects local LLM throughput. The same machine produced 1.4× to 3.8× higher tok/s in High Power mode (powermode 2) versus Low Power (powermode 1). The dense 27B model and ds4 DeepSeek V4 Flash saw the biggest gains (2–3×). All results below show both modes side by side.

Methodology

Each model was tested with the same four prompt/completion combinations via OpenAI-compatible API:

Case	Prompt	Max tokens	Prompt tokens
`short_50`	Short (one-sentence question)	50	~21
`short_200`	Short (one-sentence question)	200	~21
`medium_100`	Medium (paragraph-length technical prompt)	100	~85
`medium_200`	Medium (paragraph-length technical prompt)	200	~85

Decode speed is reported as wall-clock tokens per second (completion tokens ÷ wall time). Where the runtime exposes its own eval speed (Ollama), that number is also reported. Each case was run once per model per power mode — these are practical single-run measurements, not averaged benchmarks. oMLX and ds4 report wall-clock tok/s; Ollama reports its internal eval tok/s. The medium_200 case (85-token prompt, 200-token completion) is the best single representative of sustained decode speed.

Recommended stack on this machine

Use case	Runtime/model	Why
Main local agent model	`oMLX Qwen3.6-35B-A3B-4bit`	Best throughput and concurrency.
Dense fallback	`oMLX Qwen3.6-27B-4bit`	Best dense model tested.
Ollama fallback	`qwen3.5:35b-a3b-coding-nvfp4`	Best Ollama model tested.
Gemma 4	`gemma4:26b-a4b-it-q4_K_M`	Best Gemma 4 tag tested.
DeepSeek V4 Flash	`ds4` q2 GGUF	Only reliable local DS V4 Flash path.

Speed comparison: Low Power vs High Power

Each model was tested with the same prompts at both AC powermode 1 (Low Power) and powermode 2 (High Power). The medium_200 case (85-token prompt, 200-token completion) is the best single representative of sustained decode speed.

Runtime/model	Architecture	Low Power tok/s	High Power tok/s	Multiplier
Ollama `qwen3.5:35b-a3b-coding-nvfp4`	MoE A3B, NVFP4	~51.9	~107.1	2.06×
oMLX `Qwen3.6-35B-A3B-4bit`	MoE A3B, 4bit	~53.7	~78.1	1.45×
Ollama `gemma4:26b-a4b-it-q4_K_M`	MoE A4B, Q4_K_M	~41.8	~75.2	1.80×
ds4 DeepSeek V4 Flash q2	MoE 284B, q2 GGUF	~8.9	~27.0	3.03×
oMLX `Qwen3.6-27B-4bit`	Dense 27B, 4bit	~11.1	~22.3	2.01×

Ollama rows use Ollama internal eval tok/s; oMLX and ds4 rows use wall-clock completion tok/s.

Full results

Model	Case	Low Power	High Power	Multiplier
oMLX Qwen3.6-35B-A3B-4bit	short_50	44.0	75.9	1.72×
oMLX Qwen3.6-35B-A3B-4bit	short_200	55.7	80.5	1.44×
oMLX Qwen3.6-35B-A3B-4bit	medium_100	51.7	75.5	1.46×
oMLX Qwen3.6-35B-A3B-4bit	medium_200	53.7	78.1	1.45×
Ollama Qwen3.5-35B-A3B NVFP4	short_50	28.5	109.6	3.84×
Ollama Qwen3.5-35B-A3B NVFP4	short_200	50.9	107.3	2.11×
Ollama Qwen3.5-35B-A3B NVFP4	medium_100	45.3	108.3	2.39×
Ollama Qwen3.5-35B-A3B NVFP4	medium_200	51.9	107.1	2.06×
Ollama Gemma4 26B q4	medium_200	41.8	75.2	1.80×
oMLX Qwen3.6-27B-4bit	short_50	10.9	22.2	2.03×
oMLX Qwen3.6-27B-4bit	short_200	11.6	23.8	2.05×
oMLX Qwen3.6-27B-4bit	medium_100	10.7	21.5	2.01×
oMLX Qwen3.6-27B-4bit	medium_200	11.1	22.3	2.01×
ds4 DeepSeek V4 Flash q2	short_50	~8.0	26.0	3.25×
ds4 DeepSeek V4 Flash q2	short_200	~8.9	30.6	3.44×
ds4 DeepSeek V4 Flash q2	medium_100	~8.0	24.1	3.01×
ds4 DeepSeek V4 Flash q2	medium_200	~8.9	27.0	3.03×

oMLX batching: Low Power vs High Power

oMLX continuous batching scales nearly linearly in both power modes. The absolute numbers shift, but the scaling ratio is consistent.

Concurrent requests	Low Power aggregate tok/s	High Power aggregate tok/s	Low Power multiplier	High Power multiplier
1	~57.0	75.8	1.00×	1.00×
2	~111.8	130.1	1.96×	1.72×
4	~227.5	182.9	3.99×	2.41×

In Low Power mode, oMLX achieves near-perfect linear scaling (3.99× at 4 requests). In High Power mode, the GPU is more saturated per-request, so scaling is sublinear but throughput per request is much higher. Both modes favor oMLX for concurrent agent workloads.

Other models tested

These models were tested in Low Power mode only, but the High Power multiplier pattern (roughly 1.4–3.8×) from the kept models above applies as a reasonable estimate.

Rank	Runtime/model	Architecture/quant	Low Power speed	Verdict
6	Ollama Huihui Qwen3.6 27B NVFP4	Dense 27B, NVFP4	~10.7-11.2 tok/s	Slower than oMLX 27B-4bit.
7	MTPLX Qwen3.6 27B MTP	Dense 27B, 4bit + MTP	~9.2 tok/s	Short-gen benefit only.
9	Rapid-MLX Qwen3.6 27B 4bit	Dense 27B, 4bit	~6.7 tok/s	Slower than oMLX 27B-4bit.
10	Ollama `gemma4:31b-nvfp4`	Gemma 4 31B, NVFP4	~6.9 tok/s	Unexpectedly slow for MoE A4B.

Ollama gemma4:26b-nvfp4 was tested alongside the q4 tag and produced essentially identical decode speeds (~41.4 tok/s Low Power), so q4 is kept as the primary Gemma 4 tag.

Gemma 4

Gemma 4 was tested after Ollama's Apple Silicon MLX speed update. The headline 2× speedup was for Qwen3.5-35B-A3B NVFP4, not Gemma 4 specifically.

Model	Size	Prompt eval	LP decode	HP decode	Verdict
`gemma4:31b-nvfp4`	20GB	~97.6 tok/s	~6.88	—	Too slow.
`gemma4:26b-nvfp4`	16-17GB	~117.6 tok/s	~41.4	—	Duplicate of q4.
`gemma4:26b-a4b-it-q4_K_M`	17GB	~155.8 tok/s	~41.8	~75.2	Best Gemma 4 tag.

MTPLX and Rapid-MLX

MTPLX MTP speculative decoding helped short generations but lost to plain oMLX for sustained dense 27B output (~9.2 vs ~11.6 tok/s). Rapid-MLX produced some high single-request numbers on 35B-A3B but had weak or broken concurrency; it was also slower on dense 27B.

DeepSeek V4 Flash

MLX/oMLX does not yet support deepseek_v4, and Rapid-MLX estimated a working set too large for 128GB. ds4 with the q2 GGUF is the only reliable local path:

Case	Low Power	High Power	Multiplier
short_50	~8.0	26.0	3.25×
medium_200	~8.9	27.0	3.03×

Takeaways

Always verify macOS power mode before benchmarking. Low Power mode can understate throughput by 1.4–3.8× depending on model and runtime. Dense and heavy models are most affected.
MoE sparse activation dominates speed. Qwen 35B-A3B is much faster than dense 27B on this machine, in both power modes.
oMLX is the best runtime for local agents because continuous batching scales well in both power modes.
Ollama is a strong compatibility fallback, with the fastest single-request eval speed for MoE models in High Power (~107 tok/s for 35B-A3B).
Gemma 4 26B is good; Gemma 4 31B NVFP4 was not.
DeepSeek V4 Flash is viable locally with ds4 in High Power (~27 tok/s), but awkward at ~9 tok/s in Low Power.

For this machine, the recommendation is oMLX + Qwen3.6-35B-A3B-4bit as the main model, Ollama for compatibility, and ds4 only when local DeepSeek V4 Flash is specifically needed.