Updated 2026-07-18

Real benchmarks.
Real hardware.
No cloud required.

We test local LLMs on a 512GB Mac Studio so you don't have to guess. Practical benchmarks for coding, agents, creative work, and more.

View Benchmarks Compare Models

🖥️

Mac Studio M2 Ultra

512GB Unified • M2 Ultra (24-core CPU, 76-core GPU)

macOS 15.4 • All tests use llama.cpp or MLX

Models Tested

512GB

Unified Memory

Benchmark Categories

Weekly

New Results

Top Performers

MiniMax M2.5

MiniMax • 456B MoE

28.4

tok/s

Best all-rounder for agentic tasks. Fast despite size. Our default for Polly and Scout.

Qwen3 Coder Next

Alibaba • 32B

42.1

tok/s

Excellent for code generation. Fast inference, strong on structured outputs.

codingrefactoringdebugging

Gemma 4 31B

Google • 31B

52.3

tok/s

Apache 2.0 licensed. 256K context. Native function calling. Strong all-rounder.

long-contextagentsgeneral

Qwen2.5 Coder 14B

Alibaba • 14B

68.5

tok/s

Fast and efficient for coding tasks. Fits on 32GB Macs easily.

codingautocompletesmall-context

How We Test

⚡

Speed

Tokens per second and time to first token, measured across 100 generations.

🎯

Task Accuracy

Real-world tasks: code completion, multi-step reasoning, creative writing, tool calling.

🧠

Memory Fit

Actual VRAM usage at different quantizations so you know what fits your Mac.

New models tested weekly

We benchmark new releases as they drop. Follow us for updates.

Follow on X

Real benchmarks. Real hardware. No cloud required.

Top Performers

MiniMax M2.5

Qwen3 Coder Next

Gemma 4 31B

Qwen2.5 Coder 14B

How We Test

Speed

Task Accuracy

Memory Fit

New models tested weekly

Real benchmarks.
Real hardware.
No cloud required.