Inside goose-swarm: How We Turned One Local Model Into a Self-Verifying Fleet
Mihai Perdum
Author
5 min readJuly 1, 2026
Key takeaways
The concurrency core is decoupled from any LLM by three traits, so the whole scheduler is unit-testable against mocks — no model, no network.
The judge needs no model for the verdicts that can kill: it counts tool calls and file writes, so 'explores forever, writes nothing' is caught in ~150s, not 7 minutes.
Every gate is engineered to be inconclusive on a missing tool or timeout, so it only ever reds on a genuine, reproducible defect.
The test harness (swarm-gym) is where every one of these features came from — it reproduces the behavioral failures a static unit test can't see.
Inside goose-swarm
This is the engineering teardown of the swarm in goose local-edition — how one spec becomes a fleet of parallel subtasks, how a model-free scheduler keeps every node busy without stepping on itself, how the judge decides whether a worker is healthy, and how a stack of gates decides whether the finished program is any good. If the companion blog post is the why, this is the how, down to the thresholds.
The one-line arc to keep in mind: single agent → parallel fleet → keep the fleet busy → stop it flailing → make it agree on interfaces → make it verify itself → make it heal itself → make its verdict honest → prove each gate pays off on ever-harder apps. It was built across more than 330 commits in about a week, on a fleet of three qwopus3.6-27b-class models in LM Studio (gabee, mihai, workhorse) over LM Link, all weight-1.
The architecture: three traits, and why it's testable
The goose-swarm crate is the model-agnostic concurrency core. It owns the validated DAG, the scheduler engine, the judge, the replanner, and the event log — and it never imports an LLM client. Everything that actually touches a model lives behind three traits, implemented over in the CLI (goose-cli/src/commands/swarm.rs, a little over 7,000 lines):
therust
1traitTaskDispatcher{/* run one subtask on one device */}2traitJudge{/* is a running task healthy? */}3traitReplanner{/* add bonus work when nodes idle */}
The payoff of that boundary is that the entire scheduler — attempt epochs, file locks, deadlock detection — is pinned by tests that never call a model. It also enforces the single most important concurrency rule in the codebase: the loop locks the mutex, extracts a plan, unlocks, and only then awaits the model call. Mutation of the graph is always serialized; model calls never are. And because the core is inert until an implementation is attached, every feature ships default-off behind an env gate, so the crate stays byte-identical to upstream when nothing is wired in.
Trait
Core call-site
Implemented in goose-cli as
TaskDispatcher
the executor spawn
a real goose agent conversation on one device
Judge / PreReviewer
idle-capacity ticks
deterministic checks + an LLM verdict
Replanner
idle-fill
a planner call for bonus subtasks
4 rows × 3 columnsHeader row enabled
The scheduler loop: one owner, a fan-out heap
There is exactly one piece of shared State — it owns the DAG, the ready-heap, the per-device counters, and the set of held files. The ready set is a binary max-heap keyed on fan-out, and the ordering is worth seeing because it encodes a scheduling decision and a reproducibility decision at once:
ready-setrust
1// higher fan-out pops first; ties break to the smallest id (deterministic)2implOrdforRanked{3fncmp(&self, other:&Self)->Ordering{4self.fan_out.cmp(&other.fan_out)5.then_with(|| other.id.cmp(&self.id))6}7}
Fan-out first, because the task that unblocks the most downstream work should start first — it shrinks the critical path. The inverted id comparison makes ties resolve to the lexicographically smallest id, so the same plan schedules the same way every time.
Each pass claims everything it can place, and the claiming is drain-then-refill rather than peek: file-conflict and capacity decisions depend on claims made earlier in the same pass (each claim mutates the held-file set and the in-flight counts), so you have to pop, decide, mutate, and requeue. Each claim becomes a spawned task whose completion re-locks the state and records the result. A deadlock check — nothing dispatched, nothing in flight, but not everything terminal — emits a "stuck" event and bails rather than hanging.
Two subtleties earn their keep. The loop ticks every 15 seconds only when a judge or pre-reviewer is attached, so time-based thresholds still evaluate even when a lone stuck worker is emitting no completions; without those features it's purely completion-driven. And an attempt-epoch guard means that if a killed worker's future finally returns a stale Ok after its replacement is already running, the stale result is dropped instead of clobbering the healthy re-dispatch.
pick_device: work-stealing first, then speed
Device selection filters to nodes with free capacity, drops the node to avoid (but never strands — it falls back to the full free set if that empties the pool), and then minimizes over a five-element tuple. Each slot is a distinct design decision:
in_flight first is the work-stealing: the least-loaded node wins. If the planner's preferred_model were honored first, every same-model task would pile onto one node and leave the fleet idle. speed applies only to hard tasks — observed average milliseconds-per-task, seeded from a configured weight as a near-maximum before any measurement exists, so the heaviest task (including the high-fan-in integrate-verify sink) routes to the known-fastest host from the very first dispatch. prefers_rank is only a tie-break — the planner's suggestion is honored only when it costs no imbalance. And weighted_load skews cumulative share toward faster hosts while rotating work so no host starves — the dispatch-count component of that specifically fixed a last-device-starvation bug where equal-load ties always went to the same node.
Work-stealing in action: scouts and skeleton drafts fanned across all three nodes, and hard tasks routed to the fastest observed host.
Idle timeouts, Transient vs ContentRetry, and the supervisor note
The anti-stall spine spans the trait boundary, so it's best taught as a round-trip. The watchdog lives in the dispatcher: a timeout around the next stream event, which resets on every agent event. That's the crucial choice — it's a no-progress window, not a wall-clock limit. Killing on wall-clock murders honest slow work (a local model can legitimately take 885 seconds); the failure you actually want to catch is a stream that has gone dead. The default is 900 seconds of silence; zero disables it.
On a trip, the error carries the word "stalled," which the classifier treats — alongside "model is unloaded," "server error," and connection failures — as a transient error. The scheduler re-queues the task at fan-out priority and records the failing node so the next attempt steers away from it. But only one class of retry threads a hint forward:
twotext
1Transient → blind re-roll onto a different device (no hint)
2ContentRetry → re-dispatch with the exact error as a SUPERVISOR NOTE
The distinction matters because a stale content note ("you wrote nothing") on a "model unloaded" infra retry would actively mislead the worker. When a content hint does apply, the dispatcher prepends it verbatim:
text
1SUPERVISOR NOTE — your previous attempt was stopped: {hint}
There are two content-retry sources: the hallucinated-completion guard (the worker called "done" but wrote nothing → "your very first action must be to write each of these files") and the opt-in done gate (an owned .py file won't parse → the exact syntax error is handed back).
Parallel planning: best-of-N skeleton, a pure-Rust scorer, fleet detailing
Planning is three phases with a deliberately model-free merge, and the shape follows from one fact: the 27B is the bottleneck and low-quant workers can't reliably produce a structured DAG. So planning is one fast structural draft on the smart model, fleet-parallel prose detailing, and a pure-Rust merge — determinism buys reproducibility and avoids a second slow model call.
Phase one drafts several skeletons in parallel across the planner plus the worker models. The system prompt forces plan-only output: roughly two-to-three times the worker count in cohesive subtasks (not one per function — micro-tasks serialize badly on minutes-per-subtask models), dependency depth at most two, non-overlapping files, one layout, and a mandatory final integrate-verify sink. Each draft is wrapped in a 480-second wall-clock timeout — necessary precisely because the planner watchdog is idle-based and a runaway generation can stream for twenty minutes without ever going idle.
Phase two is the scorer, and it's pure Rust so scorer and executor can never disagree — it validates each candidate through the same DAG loader the live path uses, then ranks:
Term
Direction
What it optimizes
independent width
reward (up to fleet size)
parallelism on the first wave
depth > 2
penalty
shorter critical path
file overlap
penalty
fewer two-writers-one-file conflicts
max fan-in
penalty
no single chokepoint everything waits on
size fit
reward
cohesive, not micro or mega tasks
6 rows × 3 columnsHeader row enabled
Phase three expands each one-line brief into a roughly 150-word spec, one call per device, and hands each detailer the subtask's exact owned filenames with an instruction to use them verbatim — because a detailer that invents formula_parser.py when the skeleton said parser.py makes the worker write the wrong file and fail its owned-file check on every attempt. (That exact filename drift once cascaded through the dependency graph and tanked five of seven subtasks.) With N=1 the whole path is byte-identical to the old single-planner behavior.
Plan confidence, the ASK gate, and dynamic replan
Two confidence signals gate whether the swarm asks the human before committing. The trustworthy one is a pure-Rust self-consistency score across the N drafts — how much the subtask counts, the owned-file sets (mean pairwise Jaccard), and the independent-task counts agree. The other is a verbalized self-rating from the planner, which is systematically overconfident, so the two are blended 0.7 in favor of self-consistency.
Below a floor, the ASK gate has the swarm generate genuine user-decidable clarifying questions rather than commit to a low-confidence decomposition — and it forces parallel planning on, because a solo plan produces no agreement signal to measure. The floor is strength-scaled: a marker like 35b-a3b is parsed as a mixture-of-experts model exposing only about 3B active parameters, so it's treated as weak and asks sooner. In the autonomous harness this runs as a file handshake — the swarm writes its questions to .swarm/clarify-questions.json, emits a low-confidence event, and polls for an answers file — which is how the harness answers "as the human" without blocking.
Dynamic replan fills idle slots on the run's tail: when two-plus slots are free while a task is still in flight, the planner injects real extra work (tests, edge cases, hardening — never README or CI busywork), whose new ids may depend on done tasks but never failed ones, and whose failure is non-fatal. One honest result worth stating: re-planning after an ASK answer defaults off, because an A/B found it produced two equally-correct apps at a cost of about 15 minutes — flagged with its single-sample confound rather than asserted as a win.
CONTRACTS, stub-first, and the CLI-contract rules
Four mechanisms all fight the same number-one failure — parallel workers passing isolation tests while drifting on the shared interface — so they belong together.
Contracts fan one signature-stub call per module across the fleet before execution, freeze the bundle, and inject it into every worker prompt: exact type-annotated signatures, empty bodies, and a # SCHEMA block listing each table's exact column names and types. That # SCHEMA block is the specific fix for a real database drift where one module wrote fixtures(league, home, away) against a schema of fixtures(league_id, home_team, away_team).
Stub-first comes in two forms, both keyed on flipping "has written an owned file" true early so the worker is exempt from the over-read kill. For the entry file, the first write is a compiling skeleton with every subcommand registered and placeholder bodies, run once to confirm imports, then each handler is filled. For non-entry multi-file owners, the same skeleton-first rule applies — the fix for a "list, tree, find, cat, all written at the end, none written in time" cascade.
The CLI-contract rule freezes the shape of the entry point for its worker: nested stays nested, global flags stay global, positional-versus-flag preserved exactly, no option renaming (--from/--to, not --source/--dest). It carries the keyword rule too: add_parser("import") must stay a verbatim string and never become import_, because argparse subcommand names are strings, not Python identifiers.
Gate
What it injects
The failure it fixed
contracts
frozen signatures + # SCHEMA block
cross-module signature + DB-column drift
skeleton-first
write a compiling entry skeleton first
over-read kill on a big multi-command entry
multifile-stub
stub every owned file first
multi-file owner writing nothing in time
cli-contract
frozen CLI shape + keyword rule
flat-vs-nested drift; import → import_
5 rows × 3 columnsHeader row enabled
The judge: verdicts, thresholds, and salvage
The in-flight judge is the "stop it flailing / keep the verdict honest" core. Its verdict type is Ok | OverReading | Looping | BrokenCode | SpecDrift | Split, and the verdicts that are allowed to kill come from a model-free function that runs in priority order — trusted even before the LLM judge, because "code that won't compile, and a worker that read a lot while writing nothing, are not judgment calls":
The behavioral over-reading check is the point of pride: it catches "explores forever, writes nothing" at around 150 seconds, minutes before any wall-clock fallback. The critical exemption is that all of this requires the task to own files — which shields the file-less integrate-verify sink, a task that legitimately reads the entire program and never writes. That exemption exists because the sink was, in an earlier version, judge-killed three times in a row, making the run report a working app as FAILED.
When a verdict does fire, the outcome logic guards on the attempt epoch, then requires the problem to clear a 0.85-confidence bar to be actionable — the LLM judge is itself a weak local model, so observe-only is the default. Actionable problems either re-dispatch (abort the worker, thread a hint forward, count an intervention) or fail (only once the intervention cap is exhausted and the final attempt ran long enough to be terminal). The cap is two, because a hard task often needs a second "simplify" round, and judge kills are excluded from the transient-exhaustion budget so a supervisory kill never burns real retry attempts.
Salvage is the elegant capstone. When a non-test task terminal-fails via the Looping verdict and a non-empty owned file exists on disk, it's marked done rather than failed — because Looping only fires after the file was written, so the worker did produce output. Failing it would cascade through the dependency graph and report a working app as FAILED; salvaging it lets the integrate-verify sink be the real gate.
Only deterministic signals may kill; the LLM verdict is advisory and high-confidence-gated, because the judge is itself a weak local model.
The post-run gates: three smoke oracles, an AST reviewer, a pre-review
Three independent gates run after the scheduler completes, each with exactly one bounded corrective re-dispatch — the traceback is the instruction.
The smoke gate dispatches by language; the Python path runs three oracles. First, pytest --collect-only surfaces the cross-module import errors that isolation tests miss. Second — the newest step — pytest -q actually runs the generated suite, plugging the exact hole the compile-only judge and import-only smoke were both blind to (one app shipped a member-list crash green because --help never touched the crashing path). Third, python3 -m <pkg> --help must exit zero, and the absence of a runnable entry point is itself a finding — that's the built-but-unwired class. Every one of these is engineered to be inconclusive, never a red, on a missing tool or a timeout: the smoke runner uses null stdin, a hard timeout, and kill-on-drop so a produced daemon can't hang the finish line.
collect-only for import drift, pytest -q for runtime crashes, --help for entry wiring — each inconclusive on a missing tool, so the gate only reds on a real defect.
The model-free AST reviewer walks the built tree and flags two things: a non-test logic module imported by nobody (resolving from pkg import mod so a real __main__ entry isn't false-flagged), and a function whose entire body is pass, ..., or raise NotImplementedError (skipping dunders, @overload, and Protocols). It subtracts a pre-execution snapshot so only new findings are chased. Worth correcting a common misconception: it does not check import-of-undefined-symbol — that's smoke's collect-only job, because a static drift check false-positives on re-exports and star-imports — and it has no subcommand-handler check.
Then the two model-based reviewers: the in-flight semantic judge (the LLM half of the section above, which is where SpecDrift comes from — it catches a CLI that diverges from the spec), and an idle-node pre-review that correctness-reviews a completed subtask and persists its findings so they're injected into the integrate-verify prompt. That's how the sink confirms and fixes a defect rather than merely greening the suite — in one run it turned an entry with all eight handlers stubbed as NotImplementedError into eight working handlers.
swarm-gym: the self-driving test harness
Now the part that produced everything above: the test harness. First, hold the two LLM layers apart, because confusing them makes the whole thing incomprehensible. Layer one is the swarm under test — goose swarm, always the local Qwen fleet. Layer two is the harness brain — the AI that invents tasks, drives follow-up turns, grades, and proposes knob tweaks — which is Claude by default (an Opus judge, a Sonnet generator), flippable to local Qwen for a fully offline gym. swarm-gym tests layer one using layer two. It's also the gate we run after pulling upstream goose changes.
A Claude brain invents the task, drives it through the local fleet, collects the evidence, and grades on seven dimensions — then cluster reds auto-tune a knob while behavioral reds send a human to read the trace and ship a structural fix.
Module
Role
orchestrator.py
the multi-turn "vibing" loop
generator.py
invents the opener and each follow-up move
runner.py
shells goose swarm run --output-format json
collector.py
assembles the evidence bundle
verifier/
the grading stack
brain/
the pluggable Claude-or-Qwen transport
tweaker.py
proposes knob deltas, scope-guarded
ledger.py + report.py
append-only history and the HTML report
9 rows × 2 columnsHeader row enabled
It exists because weak local models fail behaviorally — narrating instead of calling tools, faking green suites, leaving stubs and scratch-file litter, over-reading into paralysis — and only realistic pressure surfaces those modes. A companion operator log pairs each red observation with the local-edition commit it drove.
How one episode runs, and how it's graded
An episode picks one of three archetypes, each stressing a different surface. Heavy-spec is a dense, fully-specified plan with many machine-checkable deterministic checks. Minimal-spec is a terse one-liner plus hidden requirements the swarm never sees — testing gap-filling judgment and MCP tool use. Continue-existing is an amendment on an evolving codebase, where the substrate is a previously-kept green app (falling back to heavy-spec if nothing green exists).
A session seeds a persona — demanding PM, vague user, perfectionist, feature-hungry, pragmatic — then runs open → collect → verify → next-move for up to six turns, where each move is a feature, fix, test pass, refactor, MCP feature, or direction change. One robustness detail is quietly important: if the swarm is killed and never prints its final JSON report, the collector rebuilds per-task data from the .swarm event log, so verification still works on a run that exited with a signal.
The grade has seven dimensions, and only three come from the AI judge:
zero-tool-call ("narrated instead of acting"), stub/TODO smells
requirements
AI judge
met against visible and hidden requirements
code quality
AI judge
1–5 review
bugs
AI judge
with file:line
8 rows × 3 columnsHeader row enabled
The key subtlety: the AI judge is a static reviewer — it never runs anything and never even sees the raw session traces. Execution is delegated entirely to the deterministic "checks" layer, and the traces are collected for the human operator. That separation is the whole "judge by running" philosophy encoded in the harness itself.
Success
"It runs" and "tests pass" are not "it is correct." The AI judge can be fooled by a green suite; the deterministic checks layer builds the app and runs it end-to-end on real input. Four of the seven grading dimensions never ask a model anything — and those four are the ones that catch a fleet reporting PASS on a program that crashes.
Feedback then flows two ways. Cluster and distribution reds trigger a bounded, scope-guarded automatic knob-tweak A/B — but the guard rejects anything touching upstream core, so the harness can only ever adjust pool weights, the planner model, and the context cap. Behavioral and code reds get no auto-fix: the harness surfaces the hints and the traces, the operator reads the actual trace, and a structural fix is shipped to local-edition by hand.
What the harness revealed
Every gate in this teardown exists because a benchmarked app broke without it. Here is the taxonomy, with the real apps that produced it:
Failure class
Example app
The fix it drove
lone-node stall
(various)
idle re-route + transient retry
contract drift, hidden by isolation tests
fsdrift — snapshot writes an ISO timestamp, diff parses a float; 45 tests pass, pipeline crashes
contracts + integrate-verify
built-but-unwired
byte-oracle — a dead detector module; running, tests, and a human all missed it
the model-free AST reviewer
no end-to-end run
a scheduler where add prints "Added" but list shows nothing (44 tests pass in isolation)
integrate-verify + smoke's pytest -q
detailer filename drift
a spreadsheet where the detailer invented formula_parser.py vs the skeleton's parser.py
thread exact owned filenames into detailing
DB-schema drift
a league app: fixtures(league, round, home, away) vs the schema's _id/_team/_num columns
the # SCHEMA contract block
cross-module data-shape mismatch
a contacts app: a formatter expects list-of-dicts, the CLI passes strings
still open — signatures aren't shapes
8 rows × 3 columnsHeader row enabled
The meta-lesson that connects all of them is the doctrine from the top: "it runs" and "tests pass" are not "it is correct." Run status lied in both directions — a false green where 135 tests passed against a stray root cli.py, and false negatives where working apps were reported FAILED by a finalize-spin kill. Every genuine win was confirmed by running the app.
Phase payoffs, the ceiling that moved, and staying honest
Each gate has to earn its wall-clock, so it was measured, not asserted:
Phase
Without it
With it
CONTRACTS
a spreadsheet failed 2/0/5/2; a scheduler shipped unwired
a three-for-three sweep across three distinct draw classes
AST reviewer
built-but-unwired shipped past running + tests + human
the single highest-signal gate for that class
integrate-verify
broken entries shipped green
backstopped runtime bugs three times over
CLI-contract freeze
flat-vs-nested drift failed the build
drift-class apps became compliant builds
skeleton-first
—
an honest wash on simple apps, a win on complex entries
6 rows × 3 columnsHeader row enabled
And the headline result, stated with its evidence. A three-win streak broke — by design — at a contacts app that combined two entities, multi-format output, a JSON round-trip, revenue aggregation, and a ten-command surface. Reading the trace showed genuine weak-model coding errors at that combined complexity, not a mechanism gap, so no fix was built — the discipline being that you don't overbuild on one partial from an unconfirmed cause. Calibration runs then proved each hard dimension worked individually, and finally two later apps — in two different domains — built that exact four-dimension combination cleanly, with zero non-ok verdicts. That's the cumulative-overload ceiling moving, N=2. One of them even closed a complete find-fix-validate loop: the import → import_ bug found in the first was fixed by a shipped CLI-contract rule and confirmed live in the second.
The honest residual limits are worth stating as clearly as the wins. Recursive-algorithm cores can defeat the 27B on every attempt, though the same model handles a topological sort or graph coloring correctly when the problem is scoped to a dedicated module. Weak-model self-repair of a hard runtime bug is unreliable. Cross-module data-shape consistency at cumulative complexity is the real open frontier — a data-shape contract was deliberately parked as low-confidence rather than shipped on a promise. Speculative execution was fully built but stays off, because its cwd-shadow "jail" isn't a real sandbox and the true fix lives in upstream core, out of scope. And the one systemic non-correctness gap is speed: Python apps run 40-plus minutes against a 15-to-25-minute goal.
The through-line is the measurement discipline that made the numbers trustworthy in the first place: judge by running, not by run status; grade after the run finishes, never mid-run (a stale mid-run check once retroactively invented a bug the run's own later phases had already fixed); capture the real exit code, not a pipe's; and read the actual session trace before blaming the model — the "over-eager judge" theory was overturned by reading the killed attempts and finding real flailing, which is the only reason a correct gate wasn't wrongly disarmed. Build the gate because an app broke without it; keep the gate because the app works with it; and never trust a green checkmark you didn't earn by running the thing.