Running Qwen3.6 27B dense locally on M4 Max

2026-05-11 · Apple M4 Max · 128GB · macOS 26.3.1

tl;dr In Low Power mode, no local path reaches 15 tok/s sustained decode for Qwen3.6 27B dense. In High Power mode, OptiQ serve and Ollama coding NVFP4 both cross 18–22 tok/s on short/medium turns, but long-context work remains slow. DFlash/speculative decoding is the next frontier but gated on Hugging Face.

Qwen3.6 27B dense is known as one of the strongest small coding models. On an M4 Max with 128GB unified memory, the question is whether any local runtime path can make it fast enough for practical agent use, or whether the MoE 35B-A3B (with only ~3.5B active params) is always the better pick for speed.

This benchmark tests every practical local Apple-Silicon path for Qwen3.6 27B dense: oMLX, OptiQ, Ollama NVFP4/MXFP4/Q4_K_M, KV quantization, no-thinking mode, and long-context behavior. Both Low Power and High Power macOS modes were tested.

Methodology

Four prompt/completion combinations via OpenAI-compatible API:

CasePromptMax tokensPrompt tokens
short_50Short (one-sentence question)50~21
short_200Short (one-sentence question)200~21
medium_200Medium (paragraph-length technical prompt)200~85
code_200Code generation prompt200~35
long_11k_20011K-token context (agent session simulation)200~11,100

Decode speed is reported as wall-clock tokens per second (completion tokens ÷ wall time). Where the runtime exposes its own eval speed (Ollama), that number is also reported. Each case was run once per config per power mode. oMLX/OptiQ report wall tok/s; Ollama also reports internal eval tok/s.

Low Power results

All Low Power numbers use AC powermode 1. These represent a conservative baseline — High Power results are dramatically faster (1.4–3.8× depending on model/runtime, see the main benchmark post).

Configshort_50short_200medium_200code_200
oMLX 27B-4bit (wall)9.118.317.127.02
oMLX 27B-4bit no-thinking8.958.415.086.45
OptiQ serve fp16 KV (wall)8.218.948.308.05
OptiQ serve KV88.508.148.088.16
OptiQ serve KV47.217.007.427.42
Ollama 27b-coding-nvfp4 (wall)8.078.789.269.28
Ollama 27b-coding-nvfp4 (eval)9.459.239.679.73
Ollama 27b-coding-mxfp8 (wall)5.165.535.685.66
Ollama 27b-q4_K_M (wall)6.546.806.416.54

No path crosses ~10 tok/s wall-clock in Low Power. Ollama coding NVFP4 is consistently the best of the 27B routes at ~9.2–9.7 eval tok/s, but that is still far behind the 35B-A3B MoE at ~50–57 tok/s.

High Power results

After discovering the Low Power issue, a targeted re-test was run with AC powermode 2.

ConfigCaseWall tok/sEval tok/sNotes
Ollama 27b-coding-nvfp4short_20023.0624.22
Ollama 27b-coding-nvfp4code_20023.7225.28
Ollama 27b-coding-nvfp4long_11k_2002.8116.25wall hit by cold long prefill
Ollama 27b-nvfp4short_20017.6618.49non-coding variant
Ollama 27b-nvfp4code_20018.6719.77non-coding variant
oMLX 27B-4bitshort_20014.70
oMLX 27B-4bitcode_20013.31
oMLX 27B-4bitlong_11k_2009.56
OptiQ serve fp16 KVshort_20022.57text-only mixed precision
OptiQ serve fp16 KVcode_20022.22text-only mixed precision
OptiQ serve fp16 KVlong_11k_2002.90blank preview / poor long-context

Long-context behavior

All 27B paths struggle with an 11K-token prompt. Wall-clock speed drops to 2–10 tok/s because prompt prefill dominates the elapsed time. Ollama reports higher internal eval speed (~16 tok/s for coding NVFP4) but the user-visible wall speed is still ~2.8 tok/s on a cold long prompt because prefill processing is slow.

Config11K prompt wall tok/s11K prompt eval tok/s
Ollama 27b-coding-nvfp4~2.81~16.25
oMLX 27B-4bit~9.56
OptiQ serve fp16 KV~2.90

For long-context agent sessions, measure both TTFT/prefill and decode separately. Internal decode improved in High Power, but end-to-end wall speed on cold long prompts remains much lower.

KV quantization and no-thinking mode

KV quantization did not help OptiQ. --kv-bits 8 produced no real decode gain; --kv-bits 4 was slower. No-thinking mode (requesting the model to skip chain-of-thought) was also neutral or slower in practice — the model sometimes adds reasoning anyway.

Blocked acceleration routes

oMLX concurrency (High Power)

oMLX 27B-4bit with continuous batching in High Power:

Concurrent requestsAggregate tok/sPer-request tok/s
117.2817.31
221.8710.94 / 10.97
430.537.63–7.69

Concurrency helps aggregate throughput but per-request speed drops. The 35B-A3B MoE scales much better (see main benchmark).

Recommendation

Use caseBest configHigh Power speed
Short/medium text-only servingOptiQ serve fp16 KV~22 tok/s wall
Ollama-compatible agentsqwen3.6:27b-coding-nvfp4~18–25 eval tok/s (first run peak)
Pi/oMLX integration + concurrencyoMLX Qwen3.6-27B-4bit~13–15 wall tok/s
Long-context agent sessionsNo clear winner2–10 wall tok/s depending on config
The original "no path reaches 15–20 tok/s" conclusion was true for Low Power mode. In High Power, OptiQ serve and Ollama coding NVFP4 both reach the target for short/medium turns. The next real improvement for 27B dense would be authenticated DFlash/speculative decoding, which remains untested due to the gated model.

Context

For the broader benchmark including MoE models (Qwen3.6-35B-A3B, Gemma 4, DeepSeek V4 Flash) and concurrency behavior, see Local LLM inference on an M4 Max 128GB.