Running Qwen3.6 27B dense locally on M4 Max: which path works?

tl;dr In Low Power mode, no local path reaches 15 tok/s sustained decode for Qwen3.6 27B dense. In High Power mode, OptiQ serve and Ollama coding NVFP4 both cross 18–22 tok/s on short/medium turns, but long-context work remains slow. DFlash/speculative decoding is the next frontier but gated on Hugging Face.

Qwen3.6 27B dense is known as one of the strongest small coding models. On an M4 Max with 128GB unified memory, the question is whether any local runtime path can make it fast enough for practical agent use, or whether the MoE 35B-A3B (with only ~3.5B active params) is always the better pick for speed.

This benchmark tests every practical local Apple-Silicon path for Qwen3.6 27B dense: oMLX, OptiQ, Ollama NVFP4/MXFP4/Q4_K_M, KV quantization, no-thinking mode, and long-context behavior. Both Low Power and High Power macOS modes were tested.

Methodology

Four prompt/completion combinations via OpenAI-compatible API:

Case	Prompt	Max tokens	Prompt tokens
`short_50`	Short (one-sentence question)	50	~21
`short_200`	Short (one-sentence question)	200	~21
`medium_200`	Medium (paragraph-length technical prompt)	200	~85
`code_200`	Code generation prompt	200	~35
`long_11k_200`	11K-token context (agent session simulation)	200	~11,100

Decode speed is reported as wall-clock tokens per second (completion tokens ÷ wall time). Where the runtime exposes its own eval speed (Ollama), that number is also reported. Each case was run once per config per power mode. oMLX/OptiQ report wall tok/s; Ollama also reports internal eval tok/s.

Low Power results

All Low Power numbers use AC powermode 1. These represent a conservative baseline — High Power results are dramatically faster (1.4–3.8× depending on model/runtime, see the main benchmark post).

Config	short_50	short_200	medium_200	code_200
oMLX 27B-4bit (wall)	9.11	8.31	7.12	7.02
oMLX 27B-4bit no-thinking	8.95	8.41	5.08	6.45
OptiQ serve fp16 KV (wall)	8.21	8.94	8.30	8.05
OptiQ serve KV8	8.50	8.14	8.08	8.16
OptiQ serve KV4	7.21	7.00	7.42	7.42
Ollama 27b-coding-nvfp4 (wall)	8.07	8.78	9.26	9.28
Ollama 27b-coding-nvfp4 (eval)	9.45	9.23	9.67	9.73
Ollama 27b-coding-mxfp8 (wall)	5.16	5.53	5.68	5.66
Ollama 27b-q4_K_M (wall)	6.54	6.80	6.41	6.54

No path crosses ~10 tok/s wall-clock in Low Power. Ollama coding NVFP4 is consistently the best of the 27B routes at ~9.2–9.7 eval tok/s, but that is still far behind the 35B-A3B MoE at ~50–57 tok/s.

High Power results

After discovering the Low Power issue, a targeted re-test was run with AC powermode 2.

Config	Case	Wall tok/s	Eval tok/s	Notes
Ollama `27b-coding-nvfp4`	short_200	23.06	24.22
Ollama `27b-coding-nvfp4`	code_200	23.72	25.28
Ollama `27b-coding-nvfp4`	long_11k_200	2.81	16.25	wall hit by cold long prefill
Ollama `27b-nvfp4`	short_200	17.66	18.49	non-coding variant
Ollama `27b-nvfp4`	code_200	18.67	19.77	non-coding variant
oMLX `27B-4bit`	short_200	14.70	—
oMLX `27B-4bit`	code_200	13.31	—
oMLX `27B-4bit`	long_11k_200	9.56	—
OptiQ serve fp16 KV	short_200	22.57	—	text-only mixed precision
OptiQ serve fp16 KV	code_200	22.22	—	text-only mixed precision
OptiQ serve fp16 KV	long_11k_200	2.90	—	blank preview / poor long-context

Long-context behavior

All 27B paths struggle with an 11K-token prompt. Wall-clock speed drops to 2–10 tok/s because prompt prefill dominates the elapsed time. Ollama reports higher internal eval speed (~16 tok/s for coding NVFP4) but the user-visible wall speed is still ~2.8 tok/s on a cold long prompt because prefill processing is slow.

Config	11K prompt wall tok/s	11K prompt eval tok/s
Ollama `27b-coding-nvfp4`	~2.81	~16.25
oMLX `27B-4bit`	~9.56	—
OptiQ serve fp16 KV	~2.90	—

For long-context agent sessions, measure both TTFT/prefill and decode separately. Internal decode improved in High Power, but end-to-end wall speed on cold long prompts remains much lower.

KV quantization and no-thinking mode

KV quantization did not help OptiQ. --kv-bits 8 produced no real decode gain; --kv-bits 4 was slower. No-thinking mode (requesting the model to skip chain-of-thought) was also neutral or slower in practice — the model sometimes adds reasoning anyway.

Blocked acceleration routes

DFlash/speculative decode: z-lab/Qwen3.6-27B-DFlash is gated on Hugging Face and could not be tested without access.
Generic MTP/draft: Qwen3-0.6B-MLX-4bit as draft model failed — tokenizer mismatch (Draft model tokenizer does not match model tokenizer).
mlx-lm server KV bits: mlx_lm.server does not accept --kv-bits; KV quantization can only be tested through mlx_lm.generate or optiq serve.

oMLX concurrency (High Power)

oMLX 27B-4bit with continuous batching in High Power:

Concurrent requests	Aggregate tok/s	Per-request tok/s
1	17.28	17.31
2	21.87	10.94 / 10.97
4	30.53	7.63–7.69

Concurrency helps aggregate throughput but per-request speed drops. The 35B-A3B MoE scales much better (see main benchmark).

Recommendation

Use case	Best config	High Power speed
Short/medium text-only serving	OptiQ serve fp16 KV	~22 tok/s wall
Ollama-compatible agents	`qwen3.6:27b-coding-nvfp4`	~18–25 eval tok/s (first run peak)
Pi/oMLX integration + concurrency	oMLX `Qwen3.6-27B-4bit`	~13–15 wall tok/s
Long-context agent sessions	No clear winner	2–10 wall tok/s depending on config

The original "no path reaches 15–20 tok/s" conclusion was true for Low Power mode. In High Power, OptiQ serve and Ollama coding NVFP4 both reach the target for short/medium turns. The next real improvement for 27B dense would be authenticated DFlash/speculative decoding, which remains untested due to the gated model.

Context

For the broader benchmark including MoE models (Qwen3.6-35B-A3B, Gemma 4, DeepSeek V4 Flash) and concurrency behavior, see Local LLM inference on an M4 Max 128GB.