Updated 2026-04-04
Real benchmarks.
Real hardware.
No cloud required.
We test local LLMs on a 512GB Mac Studio so you don't have to guess. Practical benchmarks for coding, agents, creative work, and more.
🖥️
Mac Studio M2 Ultra
512GB Unified • M2 Ultra (24-core CPU, 76-core GPU)
macOS 15.4 • All tests use llama.cpp or MLX
5
Models Tested
512GB
Unified Memory
5
Benchmark Categories
Weekly
New Results
Top Performers
#1
MiniMax M2.5
MiniMax • 456B MoE
28.4
tok/s
Best all-rounder for agentic tasks. Fast despite size. Our default for Polly and Scout.
agentsgeneralcoding
8.5
Code
8.8
Reason
8.2
Creative
9
Agents
8.7
Context
#2
Qwen3 Coder Next
Alibaba • 32B
42.1
tok/s
Excellent for code generation. Fast inference, strong on structured outputs.
codingrefactoringdebugging
9.2
Code
8.4
Reason
7.5
Creative
8.6
Agents
8.3
Context
#3
Gemma 4 31B
Google • 31B
52.3
tok/s
Apache 2.0 licensed. 256K context. Native function calling. Strong all-rounder.
long-contextagentsgeneral
8.7
Code
8.9
Reason
8.4
Creative
8.8
Agents
9.2
Context
#4
Qwen2.5 Coder 14B
Alibaba • 14B
68.5
tok/s
Fast and efficient for coding tasks. Fits on 32GB Macs easily.
codingautocompletesmall-context
8.9
Code
7.8
Reason
7.2
Creative
8
Agents
7.9
Context
How We Test
⚡
Speed
Tokens per second and time to first token, measured across 100 generations.
🎯
Task Accuracy
Real-world tasks: code completion, multi-step reasoning, creative writing, tool calling.
🧠
Memory Fit
Actual VRAM usage at different quantizations so you know what fits your Mac.