Fixes the large-PR token burn: a ~250K-token diff was re-sent every agent step across models × lenses × passes, draining a metered usage block in minutes. Small PRs are untouched (every mitigation is size-gated / no-op under threshold). - Re-platform the in-process review path onto executus run.Executor: context compaction (executus/compact, threshold from the model's real context window via executus/model), run-bounding, a per-PR budget gate (Ports.Budget), and the wrap-up nudge re-expressed as a run.Critic. Lens fan-out now uses executus/fanout. gadfly keeps its own model.go, so GADFLY_ENDPOINT_<NAME> aliases and the claude-code engine are unaffected. No majordomo bump; the binary stays static (executus core is majordomo+stdlib only). - Paginate get_diff (per-file `path` + start_line/limit) instead of dumping the whole diff; trim the recheck diff embed (60k -> 20k chars). - entrypoint.sh: downshift the fleet above GADFLY_HUGE_DIFF_BYTES (one cheap model, fewer lenses/steps, no recheck) + a swarm-wide GADFLY_PR_BUDGET_SECS wall-clock backstop (adds procps for pkill). All advisory; CI never fails. - README + CLAUDE.md + tests updated. Note: run.Result exposes no transcript, so the old transcript-based forced- finalization fallback is dropped; the wrap-up critic nudge is the remaining "always emit something" mechanism. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
12 KiB
Gadfly — Developer Guide
Gadfly (🪰) is an agentic adversarial code reviewer that runs in Gitea Actions. On a pull request it reads the checked-out repository with read-only tools, hunts for real problems, verifies each one against the actual code, and posts its findings as a comment. It is advisory only — it never blocks a merge.
This is a public, vibe-coded project (built largely by an AI agent). Keep that framing honest in the README; don't oversell it.
Project goals (keep changes aligned to these)
- Find real problems, not nits. The whole point of the agentic tools + two-pass recheck is to kill diff-only false positives. Anything that raises the false-positive rate (or removes verification) works against the project.
- Advisory, never blocking. Gadfly must never fail a CI job for review content, never merge, never deploy. Non-zero exit only on usage/config errors; even then run.sh posts a notice rather than failing. Do not add it to branch-protection required checks.
- Easy to turn on for any repo. Consumers should need only a ~15-line stub workflow + a
couple of secrets/vars. All real logic lives in the image (
entrypoint.sh), not in the consumer's YAML (Gitea's act_runner has weak YAML expression support). - Provider-agnostic. Powered by majordomo, so it can target Ollama (local/cloud), OpenAI, Anthropic, Google, or any OpenAI/Ollama-compatible endpoint. Don't re-hardcode a single provider.
- Portable & self-contained.
cmd/gadflydepends only on the Go stdlib, majordomo, and executus (whose core —run/compact/model/fanout/tool— is itself majordomo+stdlib only, so the binary stays static; do NOT pull executus'scontrib/storeor any battery that drags in a DB driver). No heavyweight deps, no coupling to any one consumer repo (e.g. mort). Gadfly is executus's canonical light consumer.
Architecture
cmd/gadfly/ the reviewer binary — pure producer of review markdown (stdout)
main.go orchestration: fan specialists out (executus/fanout), each a review pass + recheck
engine.go reviewEngine abstraction: executus run.Executor (majordomo agent loop +
compaction/bounding/budget/critic) vs claude-code CLI shell-out
executus.go executus wiring: tool.Registry over the repo tools, the run.Executor build
(compact + model context-limit threshold + per-PR budget + wrap-up critic)
specialists.go specialist lenses: built-ins, default suite, env + .gadfly.yml resolution
auto.go dynamic `auto` selection: a selector model picks lenses per-diff (may invent)
delegate.go worker-tier delegate_investigation tool (cheap sub-agent does legwork)
consolidate.go verdict parsing + one-comment consolidation (a section per specialist)
model.go provider/model + selector + worker resolution (majordomo.Parse) + endpoint aliases
tools.go the 5 read-only repo tools (read_file/list_dir/grep/find_files/get_diff)
recheck.go second-pass verification prompt + verdict recompute
*_test.go sandbox, recheck, wrap-up, spec/endpoint-parse, specialist-resolution tests
scripts/run.sh fetch PR diff+meta, run the binary, upsert ONE labeled PR comment
scripts/status-board.sh render+upsert ONE live status-board comment (per-model/per-lens progress)
scripts/system-prompt.txt the reviewer persona + verification discipline (generic, not repo-specific)
entrypoint.sh container brains: trigger gating, PR clone, model loop (the logic that
used to live in workflow YAML)
Dockerfile multi-stage; private-module creds via BuildKit secrets never reach the final image
.gitea/workflows/build-image.yml push main → :latest; tag v* → :<tag>+:latest; PR → build-only
.gitea/workflows/review-reusable.yml reusable (workflow_call) review job; resolves swarm config at
RUNTIME: consumer `with:` input → owner user-scope var (GADFLY_DEFAULT_MODELS /
_SPECIALISTS / _PROVIDER_CONCURRENCY / _PROVIDER_LENS_CONCURRENCY, +
GADFLY_ENDPOINT_RAGNAROS) → image default. Vars are injected per-run, so editing
one var retunes the whole fleet even though long-lived act_runners CACHE this file
by ref (a moved tag is NOT re-fetched — only a runtime value or a fresh @<sha>
bypasses the cache). Consumers subscribe with an ~8-line caller forwarding only the
secrets the reviewer needs and pinned to an immutable @<sha> (Phase 4);
gadfly's own adversarial-review.yml is a thin caller of it (dogfoods the path).
examples/ copy-paste consumer stub workflows for different providers
Data flow: consumer stub workflow → container entrypoint.sh (gate + clone) →
scripts/run.sh (per model) → cmd/gadfly binary (agentic review) → markdown → run.sh
upserts a PR comment as gitea-actions.
Two passes: a review pass drafts findings; an adversarial recheck pass independently
re-verifies each finding against the code and drops the unconfirmed ones, recomputing the
verdict. Verdict is one of: No material issues found / Minor issues / Blocking issues found.
Build / test
go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
go test ./...
gofmt -l cmd/ # must be clean
docker build -t gadfly:dev --secret id=REGISTRY_USER,env=REGISTRY_USER --secret id=REGISTRY_PASSWORD,env=REGISTRY_PASSWORD .
Run it locally against a real diff without CI:
git -C /path/to/repo diff main > /tmp/x.diff
GADFLY_PROVIDER=ollama GADFLY_MODEL=qwen2.5-coder:7b \
GADFLY_REPO_DIR=/path/to/repo GADFLY_DIFF_FILE=/tmp/x.diff \
GADFLY_SYSTEM_FILE=scripts/system-prompt.txt ./gadfly
Release / deploy
- Push to
main→ CI builds and pushes:latest(+:sha-<short>). - Tag
v*→ publishes:<tag>(+:latest). Pin consumers to:vNfor stability. - Required CI secrets:
REGISTRY_USER/REGISTRY_PASSWORD(registry push + read access to the private majordomo module). OptionalDISCORD_WEBHOOK_URL.
Configuration
The full env reference lives in the README (Specialists, Models & providers,
Configuration). Provider selection: GADFLY_PROVIDER (default ollama-cloud),
GADFLY_MODEL/GADFLY_MODELS, GADFLY_BASE_URL, GADFLY_API_KEY. Named endpoint aliases via
GADFLY_ENDPOINT_<NAME> / GADFLY_ALIAS_<NAME> (http-capable) and majordomo LLM_* DSNs
(HTTPS-only).
Specialists (the swarm): the reviewer runs a suite of focused lenses, one consolidated
comment with a section each. Default suite = security/correctness/maintainability/performance/
error-handling; opt-in built-ins = tests/docs/conventions/improvements. Select via
GADFLY_SPECIALISTS (csv or all); define/override via GADFLY_SPECIALIST_<NAME> env or a repo
.gadfly.yml (specialists: + define:). See cmd/gadfly/specialists.go. Cost ≈
specialists × models × 2 passes — the image/entrypoint default stays minimal (one model) for
that reason; the reusable workflow (review-reusable.yml) deliberately ships a heavier
opinionated default swarm (3 cloud + Claude Code, 5 lenses) for steve's own fleet, which consumers
inherit or override per-input.
Dynamic auto (GADFLY_SPECIALISTS=auto): a selector (GADFLY_SELECTOR_MODEL or the review
model) picks lenses per-diff and may invent ad-hoc ones (cmd/gadfly/auto.go). Worker-tier
(GADFLY_WORKER_MODEL): a delegate_investigation tool offloads grep/read legwork to a cheap
sub-agent (cmd/gadfly/delegate.go).
Tested vs untested: only the Ollama paths (local + OpenAI-compatible pointed at Ollama) are actually exercised. OpenAI/Anthropic/Google come from majordomo's abstraction and are untested (no spend). Keep the README honest about this; update it if that changes.
When making changes — maintenance rules
- Keep the README and
examples/current. Any change to env vars, flags, defaults, triggers, provider support, or the consumer stub MUST be reflected inREADME.mdand the relevant files underexamples/in the same change. The README'sConfigurationtable, theModels & providerstable, and the example workflows are the contract users rely on — stale docs are a bug. - Preserve the advisory-only invariant (goal #2). If you touch exit codes or the workflow, re-confirm a review can never fail/block a consumer's CI.
- Don't add mort-specific (or any single-consumer) assumptions to the binary or system prompt. The system prompt is intentionally generic; repo-specific conventions should be discovered by the agent at runtime (it can read the repo's own CONTRIBUTING/CLAUDE.md), not hardcoded here.
- Keep secrets out of image layers. Private-module creds flow via BuildKit
--mount=type=secretin the build stage only; never bake them into the final image or commit them. - Add a test when you add logic (see the
*_test.gopatterns). Keepgofmtclean andgo vetquiet.
Lessons
- majordomo's
LLM_*env DSNs are HTTPS-only (DSN.BaseURL()forceshttps://), so they can't express a plaintext local Ollama. That's why Gadfly adds the http-capableGADFLY_ENDPOINT_<NAME>="provider|base-url[|key]"mechanism (seecmd/gadfly/model.go). - Gitea
vars/secretsare not auto-exposed as env in a job — the consumer stub must map each one explicitly in itsenv:block (dynamic alias names can't be auto-enumerated). uses: docker://…:latestis CACHED by act_runner — a freshly-pushed:latestis often NOT re-pulled, so the job silently runs the previous image. For a run that must use a specific build (e.g. validating a just-pushed fix), pin the consumer stub to the immutable:sha-<short>tag the build publishes, not:latest.- Concurrency is per-provider (
entrypoint.sh): each provider is a lane, lanes run in parallel,cap(fromGADFLY_PROVIDER_CONCURRENCYelseGADFLY_CONCURRENCY, default 1) bounds models-at-once within a lane. The review timeout (GADFLY_TIMEOUT_SECS) is per-lens, not shared across the suite — a slow model can't starve later lenses (the original timeout bug). - Large-PR token burn: the agent loop re-sends the whole transcript every step, so a giant
diff (the old
get_diffdumped it untruncated, and it was embedded in both the review and recheck task) was re-transmitted ~steps × lenses × passes × models times — a ~250 K-token PR could drain a metered usage block in minutes. Fixed in three size-gated layers (small PRs untouched): paginatedget_diff+executus/compactcompaction in the binary; anentrypoint.shdownshift aboveGADFLY_HUGE_DIFF_BYTES(one cheap model, fewer lenses/steps, no recheck); and a swarm-wideGADFLY_PR_BUDGET_SECSwall-clock backstop. Compaction's threshold is intentionally LOW (GADFLY_COMPACT_RATIO0.45, not executus's 0.7) because the burning transcript on the embedded path rarely reaches 0.7×context. - executus re-platform: the in-process review path runs through
executus/run'srun.Executor(compaction, run-bounding,Ports.Budget, the wrap-up nudge asPorts.Critic), wiring it incmd/gadfly/executus.go. Gadfly KEEPS its ownmodel.goresolution (soGADFLY_ENDPOINT_<NAME>http aliases + the claude-code engine survive) and only handsrun.Executorthe already-resolved model via a trivial resolver — do NOT route review-model resolution throughmodel.ParseModelForContext(it bypasses gadfly's endpoint aliases).run.Resultexposes no transcript, so the old transcript-based forced-finalization fallback is gone; the wrap-up critic nudge is the remaining "always emit something" mechanism. The claude-code engine still shells out and is unaffected.