MLX vs GGUF on Apple Silicon: Benchmarking the Same Local Model Two Ways
Mihai Perdum
Author
12 min readJuly 2, 2026
You've pulled a local coding model onto your Mac and you're staring at two downloads of the same weights: a GGUF build and an MLX build. Which one do you actually run? The folklore says MLX is faster on Apple Silicon but you pay for it in quality. We got tired of guessing — so we built a harness that makes local coding agents produce real software, grades that software by running it, and then pointed the whole thing at the exact same model in both formats. This is what it found, and how it works.
Why you can't just ask the model how it did
Local coding models are very good at declaring victory. The tests pass, the summary says done, the diff looks plausible — and the actual feature is quietly broken. If you grade a model on its own report, you are measuring its confidence, not its competence. So the first design decision was the most important one: everything is graded by running it. Build an inventory store, add two items, ask for the total, and check the number against a golden value. If the program doesn't run, or prints the wrong answer, it doesn't count — no matter what the model claimed.
The harness: swarm-gym
The harness is called swarm-gym, and it ships inside goose local-edition — our open-source fork of the goose agent, hardened for local models and extended with a multi-node build swarm. One run of swarm-gym does five things in sequence:
Plan — a planner model decomposes the spec into a small DAG of subtasks (core module, CLI entry, tests, an integrate-verify step) with explicit file ownership.
Execute — worker agents build those subtasks in parallel across the fleet, each confined to its own files, seeing the signatures of its siblings.
Judge — a semantic judge watches each worker's live activity and intervenes only on real trouble (a worker looping on a failing test, or shipping a stub).
Integrate-verify — a final sink task wires the modules together and runs a smoke gate: does the package import, does its entry point run.
Grade — deterministic verification runs the built app against golden values and records everything to a ledger.
The whole run is captured — a structured event log, every worker's tool calls, per-device dispatch counts — so a result is never just a number; it's traceable back to what each agent actually did.
"Swarm" isn't marketing. The build is genuinely distributed: the planner drafts a skeleton, the fleet details every subtask's spec in parallel, and workers run concurrently across three Apple Silicon nodes, each serving the model through LM Studio. A dynamic replanner fills idle workers with independent work so the machines stay busy, and a per-task idle watchdog re-routes a genuinely stalled worker to another node. The point of the swarm is to get a real, multi-module app out of small, weak local models — the kind that fall over if you ask a single context to build everything at once.
Two modes: benchmark and exploratory
swarm-gym runs in two explicit, ledger-tagged modes, and it matters which one produced a number.
Benchmark — a frozen suite of fixed specs replayed identically every time, so results from different models or runtimes pair up. Reproducible by construction. This article is entirely benchmark mode.
Exploratory — an operator-driven agent invents new prompts, vibes follow-ups, and tunes swarm knobs while an active monitor watches reasoning, output quality, and node utilisation (catching idle nodes and re-balancing). It's for finding failure modes and improvements, not for producing comparable numbers.
The archetypes
Benchmark mode replays three fixed app specs — the archetypes. They're chosen to stress the three axes where weak local models fail differently: data modelling, algorithms, and stateful logic. Each ships golden-value checks.
Archetype
The app
What it stresses
A golden check
crud-multiformat
an inventory CLI (JSON + table output)
data modelling, serialisation, aggregation
add 4 × $5 → total is $20.00
compute-parser
a recursive-descent calculator
parsing, precedence, associativity
2^3^2 = 512 (right-associative)
transaction
a nested-transaction KV store
state, nested BEGIN/ROLLBACK, multi-command
nested rollback prints 2 1 1
4 rows × 4 columnsHeader row enabled
The transaction archetype is the mean one: it needs correct nested-rollback semantics and a multi-command exec path, and it's where models most often ship passing unit tests over a broken feature. Hold that thought.
The tiers
The suite scales through tiers — smoke (one app), light (one of each archetype), medium (five of each), high (ten of each), and extreme (ten of each, harder specs). More repetitions means the run measures the model's consistency, not just one lucky build. For this study we ran medium: 5 runs × 3 archetypes × 2 model builds = 30 full app builds.
The experiment
A clean paired A/B. Same model — qwopus3.6-27b — in two builds: the original GGUF (from jackrong) and the MLX translation (from mlx-community). Identical frozen prompts, identical harness binary, identical flags, identical 3-node fleet. The only variable in the entire setup was the model format.
bash
1# what each variant ran — byte-identical except the loaded model2python -m harness bench --tier medium --variant gguf
3python -m harness bench --tier medium --variant mlx
Results: speed
Median build time per run — lower is faster:
Spec
GGUF
MLX
Winner
Overall
1945.8s
1923.9s
~tie
compute
1860.5s
2192.6s
GGUF +15%
txn
1945.8s
1722.9s
MLX +13%
crud
2239.2s
2342.9s
~tie
5 rows × 4 columnsHeader row enabled
Overall it's a wash. But the consistency diverges sharply: MLX had lower run-to-run variance (stdev 580s vs 751s), a lower p90 (3050s vs 3601s), and — the headline — it never once hit our 60-minute per-run cap. GGUF hit that cap twice, both on the crud archetype. More on those caps below, because they turned out to be our bug, not the model's.
Results: quality
Does the built app actually work:
Metric
GGUF
MLX
Raw checks-pass
80%
87%
App works (judged by running)
~87% (13/15)
~87% (13/15)
Swarm task success
96%
100%
Per-spec checks (compute / txn / crud)
100 / 60 / 80
80 / 80 / 100
5 rows × 3 columnsHeader row enabled
On raw checks MLX looks ahead, 87% to 80%. But when you actually run every app, the gap closes to a tie (~13 of 15 each). GGUF's lower raw number is dragged down by grading artifacts — two runs that were capped mid-cleanup but had already produced a correct app, and one transient false-partial. Judged on whether the software works, the two builds are even.
The interesting part: what running the apps revealed
A benchmark that only prints medians is boring. The value is in what the harness caught by watching itself and by running the software. Three findings mattered.
1. The caps were a swarm bug, not the model
GGUF's two capped runs weren't the model grinding — they were the swarm's own integrate-verify step churning in circles after the app was already built and correct. The step is a heavy critical-path task with no wall-clock budget, and the judge's repeated "ok" verdict was a no-op with nothing to force it to finish. So a healthy-but-slow finisher could run until the external cap killed it. That's a scheduler bug. We shipped an env-gated fix (a graceful wall-clock cap on that step) and are re-benchmarking to see if it closes the last of the gap. This is the entire point of the harness: it doesn't just rank models, it turns each run into a concrete fix.
2. Passing tests, broken feature — on both builds
The transaction archetype produced the same failure three separate times, across both model formats: the model wrote a KV store whose own unit tests passed, while a required feature was broken or missing — an exec path that printed nothing, or a COUNT command it simply forgot to implement. The generated tests didn't cover it, so the model shipped it green. Because this showed up on GGUF and MLX alike, it isn't a runtime difference — it's a weak-model completeness limit. And it's the clearest possible argument for grading by running a spec-derived command instead of trusting the model's own tests. The harness's golden checks caught every one of these; the model's tests caught none.
3. Raw pass-rate lies; judged-by-running doesn't
If we'd stopped at the raw checks column, we'd have called MLX the quality winner by 7 points. Running the apps erased that lead: two of GGUF's three "failures" were a correct app the harness mis-scored (a cap and a transient). The lesson generalises beyond this benchmark — report what the software does, not what the grader's first pass said.
The verdict
Remarkably close, and the folklore was wrong. MLX is not a downgrade: it ties GGUF on whether the app works, and it's the steadier of the two — lower variance, and it never blew the cap. GGUF is faster specifically on compute-heavy work and has a higher peak, at the cost of being spikier. If you want a safe default on Apple Silicon, MLX is it. If your workload is compute-bound and you can tolerate more variance, GGUF has an edge. Either is a reasonable choice — which, given how much simpler the MLX story is to reason about, is itself a win for MLX.
Run it yourself
swarm-gym is open. One command runs a tier and prints a stats report plus CSVs you can paste straight into a write-up — the same numbers behind this article. Load your local models, then:
bash
1python -m harness bench --tier medium --variant mlx
The full benchmark, with the per-spec breakdown, lives on our Agentic Benchmarks page — where you can also publish your own run with our system. If you've got a model and a Mac, we'd genuinely like to see your numbers.