Blog - Atlassian Insights & Tutorials

You've pulled a local coding model onto your Mac and you're staring at two downloads of the same weights: a GGUF build and an MLX build. Which one do you actually run? The folklore says MLX is faster on Apple Silicon but you pay for it in quality. We got tired of guessing — so we built a harness that makes local coding agents produce real software, grades that software by running it, and then pointed the whole thing at the exact same model in both formats. This is what it found, and how it works.

Why you can't just ask the model how it did

Local coding models are very good at declaring victory. The tests pass, the summary says done, the diff looks plausible — and the actual feature is quietly broken. If you grade a model on its own report, you are measuring its confidence, not its competence. So the first design decision was the most important one: everything is graded by running it. Build an inventory store, add two items, ask for the total, and check the number against a golden value. If the program doesn't run, or prints the wrong answer, it doesn't count — no matter what the model claimed.

The harness: swarm-gym

The harness is called swarm-gym, and it ships inside goose local-edition — our open-source fork of the goose agent, hardened for local models and extended with a multi-node build swarm. One run of swarm-gym does five things in sequence:

Plan — a planner model decomposes the spec into a small DAG of subtasks (core module, CLI entry, tests, an integrate-verify step) with explicit file ownership.
Execute — worker agents build those subtasks in parallel across the fleet, each confined to its own files, seeing the signatures of its siblings.
Judge — a semantic judge watches each worker's live activity and intervenes only on real trouble (a worker looping on a failing test, or shipping a stub).
Integrate-verify — a final sink task wires the modules together and runs a smoke gate: does the package import, does its entry point run.
Grade — deterministic verification runs the built app against golden values and records everything to a ledger.

The whole run is captured — a structured event log, every worker's tool calls, per-device dispatch counts — so a result is never just a number; it's traceable back to what each agent actually did.

Archetype	The app	What it stresses	A golden check
crud-multiformat	an inventory CLI (JSON + table output)	data modelling, serialisation, aggregation	add 4 × $5 → total is $20.00
compute-parser	a recursive-descent calculator	parsing, precedence, associativity	2^3^2 = 512 (right-associative)
transaction	a nested-transaction KV store	state, nested BEGIN/ROLLBACK, multi-command	nested rollback prints 2 1 1

Spec	GGUF	MLX	Winner
Overall	1945.8s	1923.9s	~tie
compute	1860.5s	2192.6s	GGUF +15%
txn	1945.8s	1722.9s	MLX +13%
crud	2239.2s	2342.9s	~tie

Metric	GGUF	MLX
Raw checks-pass	80%	87%
App works (judged by running)	~87% (13/15)	~87% (13/15)
Swarm task success	96%	100%
Per-spec checks (compute / txn / crud)	100 / 60 / 80	80 / 80 / 100

MLX vs GGUF on Apple Silicon: Benchmarking the Same Local Model Two Ways

Why you can't just ask the model how it did

The harness: swarm-gym

Tags

Comments

Add a Comment

Enjoyed?

The swarm

Two modes: benchmark and exploratory

The archetypes

The tiers

The experiment

Results: speed

Results: quality

The interesting part: what running the apps revealed

1. The caps were a swarm bug, not the model

2. Passing tests, broken feature — on both builds

3. Raw pass-rate lies; judged-by-running doesn't

The verdict

Run it yourself