# Gadfly β€” Developer Guide Gadfly (πŸͺ°) is an **agentic adversarial code reviewer** that runs in Gitea Actions. On a pull request it reads the *checked-out repository* with read-only tools, hunts for real problems, verifies each one against the actual code, and posts its findings as a comment. It is **advisory only** β€” it never blocks a merge. > This is a public, **vibe-coded** project (built largely by an AI agent). Keep that framing > honest in the README; don't oversell it. ## Project goals (keep changes aligned to these) 1. **Find *real* problems, not nits.** The whole point of the agentic tools + two-pass recheck is to kill diff-only false positives. Anything that raises the false-positive rate (or removes verification) works against the project. 2. **Advisory, never blocking.** Gadfly must never fail a CI job for review *content*, never merge, never deploy. Non-zero exit only on usage/config errors; even then run.sh posts a notice rather than failing. Do not add it to branch-protection required checks. 3. **Easy to turn on for any repo.** Consumers should need only a ~15-line stub workflow + a couple of secrets/vars. All real logic lives in the image (`entrypoint.sh`), not in the consumer's YAML (Gitea's act_runner has weak YAML expression support). 4. **Provider-agnostic.** Powered by [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo), so it can target Ollama (local/cloud), OpenAI, Anthropic, Google, or any OpenAI/Ollama-compatible endpoint. Don't re-hardcode a single provider. 5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib, majordomo, and [executus](https://gitea.stevedudenhoeffer.com/steve/executus) (whose *core* β€” `run`/`compact`/ `model`/`fanout`/`tool` β€” is itself majordomo+stdlib only, so the binary stays static; do NOT pull executus's `contrib/store` or any battery that drags in a DB driver). No heavyweight deps, no coupling to any one consumer repo (e.g. mort). Gadfly is executus's canonical *light* consumer. ## Architecture ``` cmd/gadfly/ the reviewer binary β€” pure producer of review markdown (stdout) main.go orchestration: fan specialists out (executus/fanout), each a review pass + recheck engine.go reviewEngine abstraction: executus run.Executor (majordomo agent loop + compaction/bounding/budget/critic) vs claude-code CLI shell-out executus.go executus wiring: tool.Registry over the repo tools, the run.Executor build (compact + model context-limit threshold + per-PR budget + wrap-up critic) specialists.go specialist lenses: built-ins, default suite, env + .gadfly.yml resolution auto.go dynamic `auto` selection: a selector model picks lenses per-diff (may invent) delegate.go worker-tier delegate_investigation tool (cheap sub-agent does legwork) consolidate.go verdict parsing + one-comment consolidation (a section per specialist) model.go provider/model + selector + worker resolution (majordomo.Parse) + endpoint aliases tools.go the 5 read-only repo tools (read_file/list_dir/grep/find_files/get_diff) recheck.go second-pass verification prompt + verdict recompute *_test.go sandbox, recheck, wrap-up, spec/endpoint-parse, specialist-resolution tests scripts/run.sh fetch PR diff+meta, run the binary, upsert ONE labeled PR comment scripts/status-board.sh render+upsert ONE live status-board comment (per-model/per-lens progress) scripts/system-prompt.txt the reviewer persona + verification discipline (generic, not repo-specific) entrypoint.sh container brains: trigger gating, PR clone, model loop (the logic that used to live in workflow YAML) Dockerfile multi-stage; private-module creds via BuildKit secrets never reach the final image .gitea/workflows/build-image.yml push main β†’ :latest; tag v* β†’ :+:latest; PR β†’ build-only .gitea/workflows/review-reusable.yml reusable (workflow_call) review job; resolves swarm config at RUNTIME: consumer `with:` input β†’ owner user-scope var (GADFLY_DEFAULT_MODELS / _SPECIALISTS / _PROVIDER_CONCURRENCY / _PROVIDER_LENS_CONCURRENCY, + GADFLY_ENDPOINT_RAGNAROS) β†’ image default. Vars are injected per-run, so editing one var retunes the whole fleet even though long-lived act_runners CACHE this file by ref (a moved tag is NOT re-fetched β€” only a runtime value or a fresh @ bypasses the cache). Consumers subscribe with an ~8-line caller forwarding only the secrets the reviewer needs and pinned to an immutable @ (Phase 4); gadfly's own adversarial-review.yml is a thin caller of it (dogfoods the path). examples/ copy-paste consumer stub workflows for different providers ``` **Data flow:** consumer stub workflow β†’ container `entrypoint.sh` (gate + clone) β†’ `scripts/run.sh` (per model) β†’ `cmd/gadfly` binary (agentic review) β†’ markdown β†’ run.sh upserts a PR comment as `gitea-actions`. **Two passes:** a *review* pass drafts findings; an adversarial *recheck* pass independently re-verifies each finding against the code and drops the unconfirmed ones, recomputing the verdict. Verdict is one of: `No material issues found` / `Minor issues` / `Blocking issues found`. ## Build / test ```sh go build ./cmd/gadfly # needs read access to the private majordomo + executus modules go test ./... gofmt -l cmd/ # must be clean docker build -t gadfly:dev --secret id=REGISTRY_USER,env=REGISTRY_USER --secret id=REGISTRY_PASSWORD,env=REGISTRY_PASSWORD . ``` Run it locally against a real diff without CI: ```sh git -C /path/to/repo diff main > /tmp/x.diff GADFLY_PROVIDER=ollama GADFLY_MODEL=qwen2.5-coder:7b \ GADFLY_REPO_DIR=/path/to/repo GADFLY_DIFF_FILE=/tmp/x.diff \ GADFLY_SYSTEM_FILE=scripts/system-prompt.txt ./gadfly ``` ## Release / deploy - **Push to `main`** β†’ CI builds and pushes `:latest` (+ `:sha-`). - **Tag `v*`** β†’ publishes `:` (+ `:latest`). Pin consumers to `:vN` for stability. - Required CI secrets: `REGISTRY_USER` / `REGISTRY_PASSWORD` (registry push + read access to the private majordomo module). Optional `DISCORD_WEBHOOK_URL`. ## Configuration The full env reference lives in the **README** (`Specialists`, `Models & providers`, `Configuration`). Provider selection: `GADFLY_PROVIDER` (default `ollama-cloud`), `GADFLY_MODEL`/`GADFLY_MODELS`, `GADFLY_BASE_URL`, `GADFLY_API_KEY`. Named endpoint aliases via `GADFLY_ENDPOINT_` / `GADFLY_ALIAS_` (http-capable) and majordomo `LLM_*` DSNs (HTTPS-only). **Specialists (the swarm):** the reviewer runs a suite of focused lenses, one consolidated comment with a section each. Default suite = security/correctness/maintainability/performance/ error-handling; opt-in built-ins = tests/docs/conventions/improvements. Select via `GADFLY_SPECIALISTS` (csv or `all`); define/override via `GADFLY_SPECIALIST_` env or a repo `.gadfly.yml` (`specialists:` + `define:`). See `cmd/gadfly/specialists.go`. Cost β‰ˆ specialists Γ— models Γ— 2 passes β€” the **image/entrypoint** default stays minimal (one model) for that reason; the **reusable workflow** (`review-reusable.yml`) deliberately ships a heavier opinionated default swarm (3 cloud + Claude Code, 5 lenses) for steve's own fleet, which consumers inherit or override per-input. **Dynamic `auto`** (`GADFLY_SPECIALISTS=auto`): a selector (`GADFLY_SELECTOR_MODEL` or the review model) picks lenses per-diff and may invent ad-hoc ones (`cmd/gadfly/auto.go`). **Worker-tier** (`GADFLY_WORKER_MODEL`): a `delegate_investigation` tool offloads grep/read legwork to a cheap sub-agent (`cmd/gadfly/delegate.go`). **Tested vs untested:** only the Ollama paths (local + OpenAI-compatible pointed at Ollama) are actually exercised. OpenAI/Anthropic/Google come from majordomo's abstraction and are **untested** (no spend). Keep the README honest about this; update it if that changes. ## When making changes β€” maintenance rules - **Keep the README and `examples/` current.** Any change to env vars, flags, defaults, triggers, provider support, or the consumer stub MUST be reflected in `README.md` and the relevant files under `examples/` in the *same* change. The README's `Configuration` table, the `Models & providers` table, and the example workflows are the contract users rely on β€” stale docs are a bug. - **Preserve the advisory-only invariant** (goal #2). If you touch exit codes or the workflow, re-confirm a review can never fail/block a consumer's CI. - **Don't add mort-specific (or any single-consumer) assumptions** to the binary or system prompt. The system prompt is intentionally generic; repo-specific conventions should be discovered by the agent at runtime (it can read the repo's own CONTRIBUTING/CLAUDE.md), not hardcoded here. - **Keep secrets out of image layers.** Private-module creds flow via BuildKit `--mount=type=secret` in the build stage only; never bake them into the final image or commit them. - Add a test when you add logic (see the `*_test.go` patterns). Keep `gofmt` clean and `go vet` quiet. ## Lessons - majordomo's `LLM_*` env DSNs are **HTTPS-only** (`DSN.BaseURL()` forces `https://`), so they can't express a plaintext local Ollama. That's why Gadfly adds the http-capable `GADFLY_ENDPOINT_="provider|base-url[|key]"` mechanism (see `cmd/gadfly/model.go`). - Gitea `vars`/`secrets` are **not** auto-exposed as env in a job β€” the consumer stub must map each one explicitly in its `env:` block (dynamic alias names can't be auto-enumerated). - **`uses: docker://…:latest` is CACHED by act_runner** β€” a freshly-pushed `:latest` is often NOT re-pulled, so the job silently runs the previous image. For a run that must use a specific build (e.g. validating a just-pushed fix), pin the consumer stub to the immutable `:sha-` tag the build publishes, not `:latest`. - **Concurrency is per-provider** (`entrypoint.sh`): each provider is a lane, lanes run in parallel, `cap` (from `GADFLY_PROVIDER_CONCURRENCY` else `GADFLY_CONCURRENCY`, default 1) bounds models-at-once within a lane. The review timeout (`GADFLY_TIMEOUT_SECS`) is **per-lens**, not shared across the suite β€” a slow model can't starve later lenses (the original timeout bug). - **Large-PR token burn**: the agent loop re-sends the whole transcript every step, so a giant diff (the old `get_diff` dumped it untruncated, and it was embedded in both the review and recheck task) was re-transmitted ~steps Γ— lenses Γ— passes Γ— models times β€” a ~250 K-token PR could drain a metered usage block in minutes. Fixed in three size-gated layers (small PRs untouched): paginated `get_diff` + `executus/compact` compaction in the binary; an `entrypoint.sh` downshift above `GADFLY_HUGE_DIFF_BYTES` (one cheap model, fewer lenses/steps, no recheck); and a swarm-wide `GADFLY_PR_BUDGET_SECS` wall-clock backstop. Compaction's threshold is intentionally LOW (`GADFLY_COMPACT_RATIO` 0.45, not executus's 0.7) because the burning transcript on the embedded path rarely reaches 0.7Γ—context. - **executus re-platform**: the in-process review path runs through `executus/run`'s `run.Executor` (compaction, run-bounding, `Ports.Budget`, the wrap-up nudge as `Ports.Critic`), wiring it in `cmd/gadfly/executus.go`. Gadfly KEEPS its own `model.go` resolution (so `GADFLY_ENDPOINT_` http aliases + the claude-code engine survive) and only hands `run.Executor` the already-resolved model via a trivial resolver β€” do NOT route review-model resolution through `model.ParseModelForContext` (it bypasses gadfly's endpoint aliases). `run.Result` exposes no transcript, so the old transcript-based forced-finalization fallback is gone; the wrap-up critic nudge is the remaining "always emit something" mechanism. The claude-code engine still shells out and is unaffected.