Qwen3.6 27B dense is known as one of the strongest small coding models. On an M4 Max with 128GB unified memory, the question is whether any local runtime path can make it fast enough for practical agent use, or whether the MoE 35B-A3B (with only ~3.5B active params) is always the better pick for speed.
This benchmark tests every practical local Apple-Silicon path for Qwen3.6 27B dense: oMLX, OptiQ, Ollama NVFP4/MXFP4/Q4_K_M, KV quantization, no-thinking mode, and long-context behavior. Both Low Power and High Power macOS modes were tested.
Methodology
Four prompt/completion combinations via OpenAI-compatible API:
| Case | Prompt | Max tokens | Prompt tokens |
|---|---|---|---|
short_50 | Short (one-sentence question) | 50 | ~21 |
short_200 | Short (one-sentence question) | 200 | ~21 |
medium_200 | Medium (paragraph-length technical prompt) | 200 | ~85 |
code_200 | Code generation prompt | 200 | ~35 |
long_11k_200 | 11K-token context (agent session simulation) | 200 | ~11,100 |
Decode speed is reported as wall-clock tokens per second (completion tokens ÷ wall time). Where the runtime exposes its own eval speed (Ollama), that number is also reported. Each case was run once per config per power mode. oMLX/OptiQ report wall tok/s; Ollama also reports internal eval tok/s.
Low Power results
All Low Power numbers use AC powermode 1. These represent a conservative baseline — High Power results are dramatically faster (1.4–3.8× depending on model/runtime, see the main benchmark post).
| Config | short_50 | short_200 | medium_200 | code_200 |
|---|---|---|---|---|
| oMLX 27B-4bit (wall) | 9.11 | 8.31 | 7.12 | 7.02 |
| oMLX 27B-4bit no-thinking | 8.95 | 8.41 | 5.08 | 6.45 |
| OptiQ serve fp16 KV (wall) | 8.21 | 8.94 | 8.30 | 8.05 |
| OptiQ serve KV8 | 8.50 | 8.14 | 8.08 | 8.16 |
| OptiQ serve KV4 | 7.21 | 7.00 | 7.42 | 7.42 |
| Ollama 27b-coding-nvfp4 (wall) | 8.07 | 8.78 | 9.26 | 9.28 |
| Ollama 27b-coding-nvfp4 (eval) | 9.45 | 9.23 | 9.67 | 9.73 |
| Ollama 27b-coding-mxfp8 (wall) | 5.16 | 5.53 | 5.68 | 5.66 |
| Ollama 27b-q4_K_M (wall) | 6.54 | 6.80 | 6.41 | 6.54 |
No path crosses ~10 tok/s wall-clock in Low Power. Ollama coding NVFP4 is consistently the best of the 27B routes at ~9.2–9.7 eval tok/s, but that is still far behind the 35B-A3B MoE at ~50–57 tok/s.
High Power results
After discovering the Low Power issue, a targeted re-test was run with AC powermode 2.
| Config | Case | Wall tok/s | Eval tok/s | Notes |
|---|---|---|---|---|
Ollama 27b-coding-nvfp4 | short_200 | 23.06 | 24.22 | |
Ollama 27b-coding-nvfp4 | code_200 | 23.72 | 25.28 | |
Ollama 27b-coding-nvfp4 | long_11k_200 | 2.81 | 16.25 | wall hit by cold long prefill |
Ollama 27b-nvfp4 | short_200 | 17.66 | 18.49 | non-coding variant |
Ollama 27b-nvfp4 | code_200 | 18.67 | 19.77 | non-coding variant |
oMLX 27B-4bit | short_200 | 14.70 | — | |
oMLX 27B-4bit | code_200 | 13.31 | — | |
oMLX 27B-4bit | long_11k_200 | 9.56 | — | |
| OptiQ serve fp16 KV | short_200 | 22.57 | — | text-only mixed precision |
| OptiQ serve fp16 KV | code_200 | 22.22 | — | text-only mixed precision |
| OptiQ serve fp16 KV | long_11k_200 | 2.90 | — | blank preview / poor long-context |
Long-context behavior
All 27B paths struggle with an 11K-token prompt. Wall-clock speed drops to 2–10 tok/s because prompt prefill dominates the elapsed time. Ollama reports higher internal eval speed (~16 tok/s for coding NVFP4) but the user-visible wall speed is still ~2.8 tok/s on a cold long prompt because prefill processing is slow.
| Config | 11K prompt wall tok/s | 11K prompt eval tok/s |
|---|---|---|
Ollama 27b-coding-nvfp4 | ~2.81 | ~16.25 |
oMLX 27B-4bit | ~9.56 | — |
| OptiQ serve fp16 KV | ~2.90 | — |
For long-context agent sessions, measure both TTFT/prefill and decode separately. Internal decode improved in High Power, but end-to-end wall speed on cold long prompts remains much lower.
KV quantization and no-thinking mode
KV quantization did not help OptiQ. --kv-bits 8 produced no real decode gain; --kv-bits 4 was slower. No-thinking mode (requesting the model to skip chain-of-thought) was also neutral or slower in practice — the model sometimes adds reasoning anyway.
Blocked acceleration routes
- DFlash/speculative decode:
z-lab/Qwen3.6-27B-DFlashis gated on Hugging Face and could not be tested without access. - Generic MTP/draft:
Qwen3-0.6B-MLX-4bitas draft model failed — tokenizer mismatch (Draft model tokenizer does not match model tokenizer). - mlx-lm server KV bits:
mlx_lm.serverdoes not accept--kv-bits; KV quantization can only be tested throughmlx_lm.generateoroptiq serve.
oMLX concurrency (High Power)
oMLX 27B-4bit with continuous batching in High Power:
| Concurrent requests | Aggregate tok/s | Per-request tok/s |
|---|---|---|
| 1 | 17.28 | 17.31 |
| 2 | 21.87 | 10.94 / 10.97 |
| 4 | 30.53 | 7.63–7.69 |
Concurrency helps aggregate throughput but per-request speed drops. The 35B-A3B MoE scales much better (see main benchmark).
Recommendation
| Use case | Best config | High Power speed |
|---|---|---|
| Short/medium text-only serving | OptiQ serve fp16 KV | ~22 tok/s wall |
| Ollama-compatible agents | qwen3.6:27b-coding-nvfp4 | ~18–25 eval tok/s (first run peak) |
| Pi/oMLX integration + concurrency | oMLX Qwen3.6-27B-4bit | ~13–15 wall tok/s |
| Long-context agent sessions | No clear winner | 2–10 wall tok/s depending on config |
Context
For the broader benchmark including MoE models (Qwen3.6-35B-A3B, Gemma 4, DeepSeek V4 Flash) and concurrency behavior, see Local LLM inference on an M4 Max 128GB.