LeanZero - Atlassian & AI Expertise, Made Personal

Trait	Core call-site	Implemented in goose-cli as
TaskDispatcher	the executor spawn	a real goose agent conversation on one device
Judge / PreReviewer	idle-capacity ticks	deterministic checks + an LLM verdict
Replanner	idle-fill	a planner call for bonus subtasks

4 rows × 3 columnsHeader row enabled

ready-setrust

1// higher fan-out pops first; ties break to the smallest id (deterministic)
2impl Ord for Ranked {
3    fn cmp(&self, other: &Self) -> Ordering {
4        self.fan_out.cmp(&other.fan_out)
5            .then_with(|| other.id.cmp(&self.id))
6    }
7}

Term	Direction	What it optimizes
independent width	reward (up to fleet size)	parallelism on the first wave
depth > 2	penalty	shorter critical path
file overlap	penalty	fewer two-writers-one-file conflicts
max fan-in	penalty	no single chokepoint everything waits on
size fit	reward	cohesive, not micro or mega tasks

6 rows × 3 columnsHeader row enabled

Gate	What it injects	The failure it fixed
contracts	frozen signatures + # SCHEMA block	cross-module signature + DB-column drift
skeleton-first	write a compiling entry skeleton first	over-read kill on a big multi-command entry
multifile-stub	stub every owned file first	multi-file owner writing nothing in time
cli-contract	frozen CLI shape + keyword rule	flat-vs-nested drift; import → import_

5 rows × 3 columnsHeader row enabled

Deterministic verdict	Confidence	Trigger
BrokenCode	1.0	an owned file won't parse
OverReading (behavioral)	0.9	owns files, nothing written, ≥ 90s elapsed, ≥ 16 tool calls (~150s)
OverReading (time fallback)	—	420s
Looping / finalize-spin	0.9	owned file written but untouched ≥ 420s

5 rows × 3 columnsHeader row enabled

Module	Role
orchestrator.py	the multi-turn "vibing" loop
generator.py	invents the opener and each follow-up move
runner.py	shells goose swarm run --output-format json
collector.py	assembles the evidence bundle
verifier/	the grading stack
brain/	the pluggable Claude-or-Qwen transport
tweaker.py	proposes knob deltas, scope-guarded
ledger.py + report.py	append-only history and the HTML report

9 rows × 2 columnsHeader row enabled

Dimension	Source	Catches
swarm	deterministic	exit code + failed tasks
checks	deterministic	judge-by-running: build, --help, pytest -q, end-to-end pipe
cluster	deterministic	starved device, retries, MCP wiring
diagnostics	deterministic	zero-tool-call ("narrated instead of acting"), stub/TODO smells
requirements	AI judge	met against visible and hidden requirements
code quality	AI judge	1–5 review
bugs	AI judge	with file:line

8 rows × 3 columnsHeader row enabled

Failure class	Example app	The fix it drove
lone-node stall	(various)	idle re-route + transient retry
contract drift, hidden by isolation tests	fsdrift — snapshot writes an ISO timestamp, diff parses a float; 45 tests pass, pipeline crashes	contracts + integrate-verify
built-but-unwired	byte-oracle — a dead detector module; running, tests, and a human all missed it	the model-free AST reviewer
no end-to-end run	a scheduler where add prints "Added" but list shows nothing (44 tests pass in isolation)	integrate-verify + smoke's pytest -q
detailer filename drift	a spreadsheet where the detailer invented formula_parser.py vs the skeleton's parser.py	thread exact owned filenames into detailing
DB-schema drift	a league app: fixtures(league, round, home, away) vs the schema's _id/_team/_num columns	the # SCHEMA contract block
cross-module data-shape mismatch	a contacts app: a formatter expects list-of-dicts, the CLI passes strings	still open — signatures aren't shapes

8 rows × 3 columnsHeader row enabled

Phase	Without it	With it
CONTRACTS	a spreadsheet failed 2/0/5/2; a scheduler shipped unwired	a three-for-three sweep across three distinct draw classes
AST reviewer	built-but-unwired shipped past running + tests + human	the single highest-signal gate for that class
integrate-verify	broken entries shipped green	backstopped runtime bugs three times over
CLI-contract freeze	flat-vs-nested drift failed the build	drift-class apps became compliant builds
skeleton-first	—	an honest wash on simple apps, a win on complex entries

6 rows × 3 columnsHeader row enabled

Inside goose-swarm: How We Turned One Local Model Into a Self-Verifying Fleet

Key takeaways

Tags

Comments

Add a Comment

Have Questions?

Inside goose-swarm

The architecture: three traits, and why it's testable

The scheduler loop: one owner, a fan-out heap

pick_device: work-stealing first, then speed

Idle timeouts, Transient vs ContentRetry, and the supervisor note

Parallel planning: best-of-N skeleton, a pure-Rust scorer, fleet detailing

Plan confidence, the ASK gate, and dynamic replan

CONTRACTS, stub-first, and the CLI-contract rules

The judge: verdicts, thresholds, and salvage

The post-run gates: three smoke oracles, an AST reviewer, a pre-review

swarm-gym: the self-driving test harness

How one episode runs, and how it's graded

What the harness revealed

Phase payoffs, the ceiling that moved, and staying honest