A Self-Verifying Swarm of Small Local Models — and Watching the Ceiling Move
Mihai Perdum
Author
5 min readJuly 1, 2026
Key takeaways
"It runs" and "tests pass" are not "it is correct." Every win was confirmed by executing the app on the spec's own commands and checking the values it printed.
On a same-spec A/B, a community qwopus3.6-27b variant beat the official qwen3.6-27b 3–2–0 — the model you load can matter more than any orchestration on top of it.
Small local models fail at wiring modules together, not at writing them. Almost every feature we built attacks integration and verification, not module quality.
The run status lied in both directions — green when the feature was dead, red when the app worked. Making the verdict honest was its own multi-week project.
A Self-Verifying Swarm of Small Local Models
The wager
A single large model can write a command-line app. That's not interesting anymore. The question we actually wanted to answer was harder: can a fleet of small local models — each far weaker than a frontier model, each running on a machine you already own — be coordinated into producing correct, runnable software?
The whole project rests on one doctrine, and it's worth stating before anything else because everything else is downstream of it: "it runs" and "tests pass" are not "it is correct." Every win we counted was confirmed by executing the produced app on the spec's own example commands and checking the values it printed — never by a green test suite. A weak model will cheerfully write forty-five tests that all pass while the real program crashes on the first realistic input, because those tests only ever exercised each module in isolation. So the verification had to be adversarial, and it had to run the binary.
What follows is the six-day arc of building that, told as it actually unfolded: one agent → a parallel fleet → keep the fleet busy → stop it flailing → make it agree on interfaces → make it verify itself → make it heal itself → make its verdict honest. More than 330 commits later, the last measured limiting factor was no longer the swarm.
The fork and the fabric
Upstream goose (from Block) is a single-agent coding CLI: one model, one conversation, one working directory. Goose local edition turns that into goose swarm — a multi-device orchestrator that farms subtasks across a fleet.
The fleet is three Macs on a LAN — an M4 Max MacBook, an M3 Ultra "workhorse," and a third node — each running qwopus3.6-27b-coder-mtp in LM Studio at about 20 GB, wired together over LM Link, all weighted equally. No cloud, no A100s.
Before any orchestration could exist, two unglamorous foundations had to land. The first was a hard context cap (GOOSE_LOCAL_CONTEXT_CAP) with per-turn proactive compaction — because goose's built-in check fires only once, before the agent loop starts, and on a local model a bloated context is both slow and quality-degrading. Local edition re-checks after every turn and compacts to stay lean. The second is a distinction to hold onto for the rest of this post: there are two entirely separate LLM layers in play. The swarm-under-test is always the local Qwen fleet. The harness that invents tasks and grades them — introduced at the end — is a separate "brain," Claude by default. Keep them straight and everything else makes sense.
A real run: the swarm discovers the three resident qwopus nodes and runs SCOUT and PLAN across all of them in parallel — no node left idle while the smart model thinks.
From one agent to a task DAG
The concurrency core lives in its own crate, goose-swarm, and it is deliberately model-agnostic — it never imports an LLM client, so the entire scheduler is unit-testable against mocks. A spec is decomposed into a validated, weighted dependency graph: duplicate-id and unknown-dependency checks plus a Kahn's-algorithm cycle check at load time, and if a plan is partly broken it reports "N of M orderable" so you can see how much is salvageable rather than failing blind.
One piece of shared state sits behind a single mutex, and the invariant that makes the whole thing safe is simple: mutation of the graph is always serialized; model calls never are. The loop locks, extracts a batch of work, unlocks, and only then does the slow async model calls. The ready set is a max-heap keyed on fan-out — the task that unblocks the most downstream work runs first, which shrinks the critical path — with a deterministic tie-break so a run is reproducible. That skeleton is what everything else hangs on.
Keeping every node busy
A swarm that leaves nodes idle is pointless, so a surprising amount of the work went into pure utilization. The device picker is work-stealing first: the least-loaded node wins, and the planner's model preference is only a tie-breaker. That ordering is deliberate — if preference came first, every same-model task would pile onto one node and strand the rest, which is the opposite of what a swarm is for. On an identical-model fleet the only real differentiator is host speed, so the hardest tasks route to the proven-fastest node by observed milliseconds-per-task, seeded from a configured weight before any measurement exists.
The war stories here are the interesting part, because most were invisible in the swarm's own logs and only showed up when you watched the actual fleet. Equal-load ties always starved one machine until a dispatch-count rotation was added as the tie-break. Idle-time jobs like the judge now claim a device, because otherwise a worker dispatch and a judging call would stack two model calls on one node — a "+1 queued on gabee while workhorse sits idle" bug you could only see in lms ps, never in the scheduler's own event log. Even the planning phase was parallelized so no node sleeps while the smart model is thinking.
Stopping the flailing
Weak local models don't fail the way you expect. They fail behaviorally, in ways a static test never catches: they narrate what they're about to do instead of calling a tool, they over-read files into paralysis, they spam cd and mkdir, they cat-loop on the swarm's own scratch files, and — the classic — they announce "done" without ever writing a file.
Every one of these got a structural fix, and every fix was traced to a real session transcript rather than guessed. A write-first directive. Dependency content injected straight into the worker prompt, so a worker never has to read a sibling file — it's handed to them. Owned-file directories pre-created so no worker ever runs mkdir. Containment to the working directory with absolute paths only, after workers were caught wandering into sibling apps. A hard cap on tool output — 30,000 characters, tuned up from a pathological 8,000 that made a routine cat spill to a temp file, which the model then re-catted, which re-spilled, looping to a timeout. And the timeout itself carries a nuance worth stating plainly: it is a no-progress window, not a wall-clock limit. It resets on every agent event, so an honest task that legitimately runs 885 seconds is never killed — only a genuinely dead stream trips it.
Making modules agree before anyone writes code
Here is the single most important failure class, and the one that shaped the most features. Parallel workers each pass their own isolation unit tests and then drift on the shared interface. One module calls add(a, b); another defines add(x, y, z). Worse — and this really happened — one file writes a database row as fixtures(league, home, away) while the schema module defines fixtures(league_id, home_team, away_team). The app compiles. Forty-five unit tests pass. It crashes only when you run the whole thing end to end.
The fix is CONTRACTS: before anyone writes real code, the fleet fans out one signature-only stub-generation call per module, and the results are assembled into a single frozen bundle — exact names, type-annotated signatures, empty bodies, and a # SCHEMA block listing every table with its exact column names and types. That bundle is injected into every worker prompt, so each worker builds against a stable interface instead of a moving one.
Success
Unit tests that never run the end-to-end pipeline lie. A weak fleet's tests pass module-by-module while the assembled program crashes on the first real command — which is exactly why almost every gate we built verifies the wired-together app, not the parts.
There's an honest limit lurking here that motivates a later section — a contract conveys signatures, not data shapes. A formatter that expects a list of dicts while the CLI hands it strings still slips through.
Judge by running: the gate stack
This is the heart of the "self-verifying" claim, and it's a stack of oracles, layered cheapest-and-most-deterministic first. Two design rules run through all of them. First, every deterministic gate is engineered to never fail on a missing tool or a timeout — those are always "inconclusive," never a red — so a gate only ever reds on a genuine, reproducible defect. Second, each corrective re-dispatch is bounded to exactly one attempt, because the traceback is the worker's instruction.
The smoke gate is the clearest illustration of how this evolved. It began as import-only: collect the tests (which surfaces cross-module import errors) and run --help (which proves the entry point exists). That shipped a member-list crash as a green build once, because --help never executes a real code path. So the capstone added a step that actually runs the generated test suite — the generated tests are the model's own representative invocations of its program, so running them needs zero command synthesis and catches exactly the runtime crashes the import checks were blind to.
collect-only imports the modules; the new pytest -q step actually runs them; --help proves the entry point works. Three deterministic oracles, and each is inconclusive — never a false red — on a missing tool.
Above the smoke gate sit a model-free AST reviewer that walks the import graph to catch a module that was built but wired to nothing, and a stub function whose whole body is pass; an in-flight semantic judge watching live workers; and an idle-node correctness pre-review whose findings are fed forward. The backstop under all of it is the integrate-verify sink — a final task that depends on every other, builds the advertised entry point, and runs it on the spec's exact commands with a golden-value check. That sink is what made the whole thing honest, and it backstopped genuine runtime bugs on three separate later apps.
One spec becomes a fleet of parallel subtasks and, if the gates pass, a program that runs. The judge watches every task the whole way through.
Making the verdict honest — in both directions
The subtlest, most underappreciated stretch of the whole project: the run status lied both ways, and each direction needed its own fix.
False green looked like 135 tests plus smoke all passing while the actual feature was dead — a worker had written a stray cli.py to the repo root, so the test suite imported that file and passed, while the real -m byte_oracle entry errored. False red looked like a working app reported as FAILED because the judge over-killed a slow-but-working worker, and that false kill cascaded through the dependency graph to block the final verifier.
The fixes split kill authority. Only unambiguous deterministic signals — won't-compile, wrote-nothing-while-reading-a-lot — are allowed to kill a worker. The LLM judge is advisory, and it only acts above a high 0.85-confidence bar, because the judge is itself a weak local model and its bad verdict must not cut a healthy worker. The elegant capstone is salvage: the "finalize-spin" verdict only fires after the owned file was written to disk, which means the worker did produce real output — so instead of failing it (and cascading that failure), the swarm marks it done and lets the integrate-verify sink be the real gate. The first fully honest clean win — "done: 7, failed: none, integrate-verify ran and passed" — was a genuine milestone, not a metric.
The judge counts tool calls and file writes rather than reading tokens, so it catches 'explores forever, writes nothing' behaviorally — and only deterministic signals are allowed to actually kill a worker.
The A/B that justified the fleet
Early on we ran the most persuasive experiment of the whole campaign: a controlled A/B, same specs, same frozen binary, only the model changed. The fleet's community qwopus3.6-27b against the official qwen3.6-27b build. qwopus won the head-to-head, roughly 3–2–0 across the contested apps, and higher on every dimension we graded.
Same apps, same frozen binary — the model was the only variable. qwopus won 3–2–0 and scored higher on correctness, test depth, quality, and spec fidelity.
Two examples carry it. On a Barnsley-fern renderer, the official model committed the classic broken-default-path failure: the wired default carried corrupted fractal parameters and rendered a malformed fern, while a correct implementation sat unused in another file. qwopus wrote a single correct implementation and rendered a real fern. On a byte-content sniffer, the official model's failure was a built-but-unwired duplicate — a whole detection module sitting dead while the entry point re-implemented all 220 lines of it inline — and running the app, running the unit tests, and a manual human read all gave it a clean bill of health. Only the deterministic AST import-graph reviewer caught it. That pair makes two points at once: the model you load matters, and the gates catch what humans and green suites miss.
The ceiling moved
As we raised the bar to feature-dense apps in the 800-to-1500-line range, a three-win streak broke — by design — at an app we called UNIQ21: a contacts manager with two entities, multiple output formats everywhere, a JSON round-trip, revenue aggregation, and a ten-command surface. It failed on a genuine weak-model coding error at that combined complexity — a cross-module data-shape crash, dicts where strings were expected. Crucially, reading the trace showed this was a capability failure, not a mechanism gap, so no fix was built. (The discipline matters: you don't overbuild on one partial result from an unconfirmed cause.)
Calibration runs then proved each hard dimension worked individually — the round-trip alone, two dimensions together, two entities together. UNIQ21 had failed only on the full four-dimension combination. Then the headline: a later app built that exact four-dimension combination cleanly — the entry split into a parser plus command handlers, with the DB-schema contracts and stub-first discipline holding the modules consistent — and a second app in a different domain confirmed it.
Note
The app class that used to fail now builds clean and works — twice, across two domains. The cumulative-overload ceiling genuinely moved once the fixes targeted integration instead of module quality. And beyond it, entirely new axes landed as clean wins: a recursive-descent expression parser with correct right-associativity (2^3^2 = 512, no eval()), a jq-style JSON-path engine, and nested transactional rollback.
The honest limits
This is where the credibility lives, so let me be plain about what still fails and why — framed as weak-model capacity, not swarm coordination. Recursive-algorithm cores can defeat the 27B across every attempt: a TypeScript expression evaluator that crashed on every input, a recursive JSON-schema validator it couldn't get right. But the ceiling is scoping-specific — the same model, given a dedicated module, produced a correct topological sort and a correct graph coloring on the first try. Weak-model self-repair of a hard runtime bug is unreliable: handed the exact compile error three times, it still couldn't fix an unterminated string literal. Cross-module data-shape consistency at cumulative complexity remains the real frontier, and a data-shape contract extension was honestly parked as low-confidence rather than shipped on a promise. And the one systemic non-correctness gap is speed: Python apps run about 40 to 47 minutes against a 15-to-25-minute goal, with the test tasks, the entry-point chokepoint, and the integrate-verify tail as the dominant sinks.
The autonomous loop that built all of it
To close the frame from the top: the whole campaign ran as an autonomous loop over a self-driving harness. An AI brain (Claude) invents a coding task, drives it through goose swarm across several "vibing" turns like a real, slightly demanding user, and verifies the output from logs and artifacts — but crucially, the real judge-by-running is a deterministic layer that builds, runs --help, executes the tests, and pipes the app end to end. The meta-lessons are worth ending on, because they're what kept the whole thing honest: read the actual session trace before blaming the model (an "over-eager judge" hypothesis was overturned by reading the killed attempts and finding genuine flailing); grade the output after the run finishes, never mid-run (a stale mid-run check once invented a bug the run's own later phases had already fixed); and capture the real process exit code, not a pipe's. Every knob in the stack exists because a specific benchmarked app broke without it.
The features, and why each exists
If you want the inventory rather than the narrative — here's what got built, grouped by phase, each with the failure it fixes and how it works.
Planning and research
Parallel research scouts — the planner writing scoping questions first is a serial bottleneck that idles the fleet, so instead fixed-lens scouts (codebase, libraries, architecture, edge-cases) fan out in parallel, one per device, read-only, each returning partial results if it overruns its budget.
Best-of-N skeleton planning — the smart 27B is the bottleneck and low-quant workers can't reliably emit a structured DAG, so it drafts several structural skeletons in parallel and a pure-Rust scorer picks the widest, flattest, least-conflicting one (rewarding parallel width up to fleet size, penalizing depth, file overlap, and chokepoints) using the same loader the executor uses, so planner and executor can never disagree.
Fleet detailing — expanding every terse subtask into an implementation-ready spec is fanned across the fleet, each detailer handed the subtask's exact owned filenames, because a detailer that invents a contradicting filename makes the worker write the wrong file and fail forever.
Scheduling and utilization
Work-stealing, speed-aware scheduler — least-loaded node wins first (model preference is only a tie-break), and on an identical-model fleet the hardest tasks route to the fastest host by observed milliseconds-per-task.
Idle re-route with transient retry — an idle-based watchdog re-queues a genuinely stalled task onto a different node, while a slow-but-progressing local model runs untouched.
Dynamic replan — when two-plus slots go idle while work is still in flight, the planner injects fresh bonus tasks (tests, edge cases, hardening — never README or CI busywork) whose failure never fails the run.
Anti-flail worker discipline
Confidence / ASK gate — below a strength-scaled confidence floor (weaker planners ask sooner) the swarm generates genuine clarifying questions instead of committing to a bad decomposition; default-off so upstream builds stay byte-identical.
Stub-first / skeleton-first — write a compiling skeleton of each owned file first, which both exposes a bad import immediately and mechanically exempts the worker from the over-read kill, then fill the bodies.
CLI-contract and keyword-name rules — the entry file is the CLI and is where a weak worker most drifts the shape, so a structure contract freezes it (nested stays nested, positional stays positional, no renaming), including the rule that add_parser("import") must stay a verbatim string and never become import_.
DONE and hallucinated-completion gates — a worker that claims "done" with a syntactically broken file, or with no file written at all, is re-dispatched with the exact error as a supervisor note rather than blindly retried.
Interface agreement
DB-schema contracts / frozen interfaces — signature-only stubs plus an exact # SCHEMA block, frozen before execution and injected into every worker prompt, so cross-module signature and column drift can't happen.
Verification gates
Smoke gate — deterministic oracles: collect-only for import drift, then running the tests for runtime crashes, then --help for entry wiring; each inconclusive (never a false red) on a missing tool.
Model-free AST reviewer — walks the import graph to flag a built-but-unwired module or a stub function that a passing suite hides, chasing only new findings versus a pre-execution snapshot.
In-flight judge and idle-node pre-review — a live semantic judge that may only kill on deterministic signals, plus a spare node that correctness-reviews finished work and feeds its findings to the final verifier.
Integrate-verify sink — the backstop: a final task that builds and runs the advertised entry on the spec's exact commands with a golden-value check per command.
Honesty and recovery
Salvage — a finalize-spin verdict fires only after a file was written, so salvaging that task as done (rather than failing and cascading it) lets the real verifier be the gate.
Model-free judge — kill authority split so only unambiguous deterministic signals can cut a worker; the LLM judge is advisory, high-confidence-gated, with a re-judge cooldown so it doesn't waste calls.
Infrastructure
Per-turn compaction — caps the effective context window and re-checks after every turn, because large context is slow and quality-degrading on local models.
MCP worker extensions — library-docs, web-search, and doc-processor tools built from runtime env at dispatch and handed only to workers, with a missing secret simply skipping that extension.
If there's a single takeaway, it's the shape of that list: almost none of it is about making the model write better code. It's about coordination, interface agreement, and verification — because on a fleet of small local models, that's where the leverage is. The companion tutorial, Inside goose-swarm, takes the same system apart at the level of the scheduler loop, the judge's thresholds, and the test harness that found every one of these failures in the first place.