Files
steve ac6ce06cdd
Build & push image / build-and-push (push) Successful in 33s
feat: re-platform agentic review onto executus + large-PR cost controls (#20)
Makes gadfly a consumer of executus (run.Executor compaction/bounding/budget/critic + fanout) and fixes the large-PR token burn in size-gated layers: paginated get_diff, downshift above GADFLY_HUGE_DIFF_BYTES, and a swarm-wide GADFLY_PR_BUDGET_SECS backstop. Small PRs untouched; advisory-only and the static binary preserved. Dogfood swarm reviewed it (6 models, 21 real findings graded + folded in).

Co-authored-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
Co-committed-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
2026-06-30 15:41:03 +00:00

176 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gadfly — Developer Guide
Gadfly (🪰) is an **agentic adversarial code reviewer** that runs in Gitea Actions. On a pull
request it reads the *checked-out repository* with read-only tools, hunts for real problems,
verifies each one against the actual code, and posts its findings as a comment. It is
**advisory only** — it never blocks a merge.
> This is a public, **vibe-coded** project (built largely by an AI agent). Keep that framing
> honest in the README; don't oversell it.
## Project goals (keep changes aligned to these)
1. **Find *real* problems, not nits.** The whole point of the agentic tools + two-pass
recheck is to kill diff-only false positives. Anything that raises the false-positive rate
(or removes verification) works against the project.
2. **Advisory, never blocking.** Gadfly must never fail a CI job for review *content*, never
merge, never deploy. Non-zero exit only on usage/config errors; even then run.sh posts a
notice rather than failing. Do not add it to branch-protection required checks.
3. **Easy to turn on for any repo.** Consumers should need only a ~15-line stub workflow + a
couple of secrets/vars. All real logic lives in the image (`entrypoint.sh`), not in the
consumer's YAML (Gitea's act_runner has weak YAML expression support).
4. **Provider-agnostic.** Powered by [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo),
so it can target Ollama (local/cloud), OpenAI, Anthropic, Google, or any
OpenAI/Ollama-compatible endpoint. Don't re-hardcode a single provider.
5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib, majordomo, and
[executus](https://gitea.stevedudenhoeffer.com/steve/executus) (whose *core*`run`/`compact`/
`model`/`fanout`/`tool` — is itself majordomo+stdlib only, so the binary stays static; do NOT
pull executus's `contrib/store` or any battery that drags in a DB driver). No heavyweight deps,
no coupling to any one consumer repo (e.g. mort). Gadfly is executus's canonical *light* consumer.
## Architecture
```
cmd/gadfly/ the reviewer binary — pure producer of review markdown (stdout)
main.go orchestration: fan specialists out (executus/fanout), each a review pass + recheck
engine.go reviewEngine abstraction: executus run.Executor (majordomo agent loop +
compaction/bounding/budget/critic) vs claude-code CLI shell-out
executus.go executus wiring: tool.Registry over the repo tools, the run.Executor build
(compact + model context-limit threshold + per-PR budget + wrap-up critic)
specialists.go specialist lenses: built-ins, default suite, env + .gadfly.yml resolution
auto.go dynamic `auto` selection: a selector model picks lenses per-diff (may invent)
delegate.go worker-tier delegate_investigation tool (cheap sub-agent does legwork)
consolidate.go verdict parsing + one-comment consolidation (a section per specialist)
model.go provider/model + selector + worker resolution (majordomo.Parse) + endpoint aliases
tools.go the 5 read-only repo tools (read_file/list_dir/grep/find_files/get_diff)
recheck.go second-pass verification prompt + verdict recompute
*_test.go sandbox, recheck, wrap-up, spec/endpoint-parse, specialist-resolution tests
scripts/run.sh fetch PR diff+meta, run the binary, upsert ONE labeled PR comment
scripts/status-board.sh render+upsert ONE live status-board comment (per-model/per-lens progress)
scripts/system-prompt.txt the reviewer persona + verification discipline (generic, not repo-specific)
entrypoint.sh container brains: trigger gating, PR clone, model loop (the logic that
used to live in workflow YAML)
Dockerfile multi-stage; private-module creds via BuildKit secrets never reach the final image
.gitea/workflows/build-image.yml push main → :latest; tag v* → :<tag>+:latest; PR → build-only
.gitea/workflows/review-reusable.yml reusable (workflow_call) review job; resolves swarm config at
RUNTIME: consumer `with:` input → owner user-scope var (GADFLY_DEFAULT_MODELS /
_SPECIALISTS / _PROVIDER_CONCURRENCY / _PROVIDER_LENS_CONCURRENCY, +
GADFLY_ENDPOINT_RAGNAROS) → image default. Vars are injected per-run, so editing
one var retunes the whole fleet even though long-lived act_runners CACHE this file
by ref (a moved tag is NOT re-fetched — only a runtime value or a fresh @<sha>
bypasses the cache). Consumers subscribe with an ~8-line caller forwarding only the
secrets the reviewer needs and pinned to an immutable @<sha> (Phase 4);
gadfly's own adversarial-review.yml is a thin caller of it (dogfoods the path).
examples/ copy-paste consumer stub workflows for different providers
```
**Data flow:** consumer stub workflow → container `entrypoint.sh` (gate + clone) →
`scripts/run.sh` (per model) → `cmd/gadfly` binary (agentic review) → markdown → run.sh
upserts a PR comment as `gitea-actions`.
**Two passes:** a *review* pass drafts findings; an adversarial *recheck* pass independently
re-verifies each finding against the code and drops the unconfirmed ones, recomputing the
verdict. Verdict is one of: `No material issues found` / `Minor issues` / `Blocking issues found`.
## Build / test
```sh
go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
go test ./...
gofmt -l cmd/ # must be clean
docker build -t gadfly:dev --secret id=REGISTRY_USER,env=REGISTRY_USER --secret id=REGISTRY_PASSWORD,env=REGISTRY_PASSWORD .
```
Run it locally against a real diff without CI:
```sh
git -C /path/to/repo diff main > /tmp/x.diff
GADFLY_PROVIDER=ollama GADFLY_MODEL=qwen2.5-coder:7b \
GADFLY_REPO_DIR=/path/to/repo GADFLY_DIFF_FILE=/tmp/x.diff \
GADFLY_SYSTEM_FILE=scripts/system-prompt.txt ./gadfly
```
## Release / deploy
- **Push to `main`** → CI builds and pushes `:latest` (+ `:sha-<short>`).
- **Tag `v*`** → publishes `:<tag>` (+ `:latest`). Pin consumers to `:vN` for stability.
- Required CI secrets: `REGISTRY_USER` / `REGISTRY_PASSWORD` (registry push + read access to the
private majordomo module). Optional `DISCORD_WEBHOOK_URL`.
## Configuration
The full env reference lives in the **README** (`Specialists`, `Models & providers`,
`Configuration`). Provider selection: `GADFLY_PROVIDER` (default `ollama-cloud`),
`GADFLY_MODEL`/`GADFLY_MODELS`, `GADFLY_BASE_URL`, `GADFLY_API_KEY`. Named endpoint aliases via
`GADFLY_ENDPOINT_<NAME>` / `GADFLY_ALIAS_<NAME>` (http-capable) and majordomo `LLM_*` DSNs
(HTTPS-only).
**Specialists (the swarm):** the reviewer runs a suite of focused lenses, one consolidated
comment with a section each. Default suite = security/correctness/maintainability/performance/
error-handling; opt-in built-ins = tests/docs/conventions/improvements. Select via
`GADFLY_SPECIALISTS` (csv or `all`); define/override via `GADFLY_SPECIALIST_<NAME>` env or a repo
`.gadfly.yml` (`specialists:` + `define:`). See `cmd/gadfly/specialists.go`. Cost ≈
specialists × models × 2 passes — the **image/entrypoint** default stays minimal (one model) for
that reason; the **reusable workflow** (`review-reusable.yml`) deliberately ships a heavier
opinionated default swarm (3 cloud + Claude Code, 5 lenses) for steve's own fleet, which consumers
inherit or override per-input.
**Dynamic `auto`** (`GADFLY_SPECIALISTS=auto`): a selector (`GADFLY_SELECTOR_MODEL` or the review
model) picks lenses per-diff and may invent ad-hoc ones (`cmd/gadfly/auto.go`). **Worker-tier**
(`GADFLY_WORKER_MODEL`): a `delegate_investigation` tool offloads grep/read legwork to a cheap
sub-agent (`cmd/gadfly/delegate.go`).
**Tested vs untested:** only the Ollama paths (local + OpenAI-compatible pointed at Ollama)
are actually exercised. OpenAI/Anthropic/Google come from majordomo's abstraction and are
**untested** (no spend). Keep the README honest about this; update it if that changes.
## When making changes — maintenance rules
- **Keep the README and `examples/` current.** Any change to env vars, flags, defaults,
triggers, provider support, or the consumer stub MUST be reflected in `README.md` and the
relevant files under `examples/` in the *same* change. The README's `Configuration` table,
the `Models & providers` table, and the example workflows are the contract users rely on —
stale docs are a bug.
- **Preserve the advisory-only invariant** (goal #2). If you touch exit codes or the workflow,
re-confirm a review can never fail/block a consumer's CI.
- **Don't add mort-specific (or any single-consumer) assumptions** to the binary or system
prompt. The system prompt is intentionally generic; repo-specific conventions should be
discovered by the agent at runtime (it can read the repo's own CONTRIBUTING/CLAUDE.md), not
hardcoded here.
- **Keep secrets out of image layers.** Private-module creds flow via BuildKit `--mount=type=secret`
in the build stage only; never bake them into the final image or commit them.
- Add a test when you add logic (see the `*_test.go` patterns). Keep `gofmt` clean and `go vet` quiet.
## Lessons
- majordomo's `LLM_*` env DSNs are **HTTPS-only** (`DSN.BaseURL()` forces `https://`), so they
can't express a plaintext local Ollama. That's why Gadfly adds the http-capable
`GADFLY_ENDPOINT_<NAME>="provider|base-url[|key]"` mechanism (see `cmd/gadfly/model.go`).
- Gitea `vars`/`secrets` are **not** auto-exposed as env in a job — the consumer stub must map
each one explicitly in its `env:` block (dynamic alias names can't be auto-enumerated).
- **`uses: docker://…:latest` is CACHED by act_runner** — a freshly-pushed `:latest` is often
NOT re-pulled, so the job silently runs the previous image. For a run that must use a specific
build (e.g. validating a just-pushed fix), pin the consumer stub to the immutable
`:sha-<short>` tag the build publishes, not `:latest`.
- **Concurrency is per-provider** (`entrypoint.sh`): each provider is a lane, lanes run in
parallel, `cap` (from `GADFLY_PROVIDER_CONCURRENCY` else `GADFLY_CONCURRENCY`, default 1) bounds
models-at-once within a lane. The review timeout (`GADFLY_TIMEOUT_SECS`) is **per-lens**, not
shared across the suite — a slow model can't starve later lenses (the original timeout bug).
- **Large-PR token burn**: the agent loop re-sends the whole transcript every step, so a giant
diff (the old `get_diff` dumped it untruncated, and it was embedded in both the review and
recheck task) was re-transmitted ~steps × lenses × passes × models times — a ~250 K-token PR
could drain a metered usage block in minutes. Fixed in three size-gated layers (small PRs
untouched): paginated `get_diff` + `executus/compact` compaction in the binary; an
`entrypoint.sh` downshift above `GADFLY_HUGE_DIFF_BYTES` (one cheap model, fewer lenses/steps,
no recheck); and a swarm-wide `GADFLY_PR_BUDGET_SECS` wall-clock backstop. Compaction's threshold
is intentionally LOW (`GADFLY_COMPACT_RATIO` 0.45, not executus's 0.7) because the burning
transcript on the embedded path rarely reaches 0.7×context.
- **executus re-platform**: the in-process review path runs through `executus/run`'s `run.Executor`
(compaction, run-bounding, `Ports.Budget`, the wrap-up nudge as `Ports.Critic`), wiring it in
`cmd/gadfly/executus.go`. Gadfly KEEPS its own `model.go` resolution (so `GADFLY_ENDPOINT_<NAME>`
http aliases + the claude-code engine survive) and only hands `run.Executor` the already-resolved
model via a trivial resolver — do NOT route review-model resolution through
`model.ParseModelForContext` (it bypasses gadfly's endpoint aliases). `run.Result` exposes no
transcript, so the old transcript-based forced-finalization fallback is gone; the wrap-up critic
nudge is the remaining "always emit something" mechanism. The claude-code engine still shells out
and is unaffected.