Files
steve ac6ce06cdd
Build & push image / build-and-push (push) Successful in 33s
feat: re-platform agentic review onto executus + large-PR cost controls (#20)
Makes gadfly a consumer of executus (run.Executor compaction/bounding/budget/critic + fanout) and fixes the large-PR token burn in size-gated layers: paginated get_diff, downshift above GADFLY_HUGE_DIFF_BYTES, and a swarm-wide GADFLY_PR_BUDGET_SECS backstop. Small PRs untouched; advisory-only and the static binary preserved. Dogfood swarm reviewed it (6 models, 21 real findings graded + folded in).

Co-authored-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
Co-committed-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
2026-06-30 15:41:03 +00:00

481 lines
30 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 🪰 Gadfly
**An AI gadfly for your pull requests.** Gadfly is an *adversarial* code reviewer that
runs in Gitea Actions: on every PR it reads your actual repository, hunts for real
problems, verifies them against the code, and posts its findings as a comment. It does not
praise your code. A gadfly does not let things slide.
> ### 🤖 Heads up: this is a vibe-coded project
> Gadfly was built almost entirely by an AI agent (Claude Code), prompts and all — the
> reviewer's "brain" is a language model, and so was most of the author. It works and it's
> tested, but treat it accordingly: **it is advisory only, it never blocks a merge, and you
> should still review its reviews.** Issues and PRs welcome; expect the occasional
> AI-flavored rough edge.
## What makes it different
Most LLM "review my diff" bots read the diff in isolation and hallucinate problems they
can't actually see — a "missing import" that's three lines above the hunk, a "broken
caller" in a file they never opened. Gadfly is **agentic**: the model has read-only tools
over the checked-out repo and is *required* to use them before reporting anything.
- **Tools:** `read_file`, `list_dir`, `grep`, `find_files`, `get_diff`.
- **Verify-before-claiming discipline:** baked into the system prompt — open the file,
grep the symbol, or drop the finding.
- **Two passes:** a *review* pass drafts findings, then an adversarial *recheck* pass
independently re-verifies each one against the code and drops the ones it can't confirm,
recomputing the verdict. This is what kills "confident but wrong."
- **Semantic-bug hunting:** it's told not to trust a plausible-looking constant, conversion
factor, or formula — re-derive the expected value, because that's where real bugs hide.
Every review leads with a one-line verdict: **No material issues found**, **Minor issues**,
or **Blocking issues found**.
## Turn it on for a repo
Gadfly ships as a container image, so consuming repos don't build anything — they just run
it. Drop one file in your repo and set a couple of secrets/vars:
1. Copy a stub from [`examples/`](examples/) to `.gitea/workflows/adversarial-review.yml` in
your repo. Two flavors: the slim [`reusable.yml`](examples/reusable.yml) — a tiny caller of
Gadfly's **reusable workflow** (`uses: steve/gadfly/.gitea/workflows/review-reusable.yml@…`,
forwarding only the secrets the reviewer needs), whose **default swarm is set centrally via owner
variables** (see [Central config via variables](#central-config-via-variables)) and inherited by omitting `with:` — or the full self-contained
[`adversarial-review.yml`](examples/adversarial-review.yml) (Ollama Cloud default, with inline
notes for every provider / local Ollama / OpenAI-compatible / endpoint aliases). See the
[examples index](examples/README.md).
2. Add repo config:
- **secret** `OLLAMA_CLOUD_API_KEY` — your [Ollama Cloud](https://ollama.com) key (empty
⇒ Gadfly posts a harmless "not configured" notice instead of reviewing). *Not needed if
you point Gadfly at a different provider — see [Models & providers](#models--providers).*
- **var** `OLLAMA_REVIEW_MODELS` *(optional)* — comma-separated model ids
(default `qwen3-coder:480b-cloud,gpt-oss:120b-cloud`). One comment per model.
- **var** `GADFLY_ALLOWED_USERS` *(optional)* — who may re-trigger via comment; empty ⇒
any repo collaborator.
`GITEA_TOKEN` is provided automatically by Actions; comments post as the `gitea-actions`
user, scoped to that repo — no bot account needed.
## Models & providers
Gadfly is built on [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo), so the
reviewer model is not hard-wired — it can target anything majordomo supports. Pick a provider
by setting `GADFLY_PROVIDER` (used to prefix bare model ids); point at a custom endpoint with
`GADFLY_BASE_URL`; supply a key with `GADFLY_API_KEY` or the provider's standard env var. A
`GADFLY_MODEL`/`GADFLY_MODELS` value that already contains a `provider/` prefix (or is a
majordomo failover chain / alias) is used verbatim.
| Provider | `GADFLY_PROVIDER` | Key env | Status |
|----------|-------------------|---------|--------|
| **Ollama Cloud** (default) | `ollama-cloud` | `OLLAMA_API_KEY` / `OLLAMA_CLOUD_API_KEY` | ✅ in active use |
| **Local Ollama** | `ollama` | none (`OLLAMA_HOST` or `GADFLY_BASE_URL` for a remote daemon) | ✅ tested |
| **[foreman](https://gitea.stevedudenhoeffer.com/steve/foreman)** (native-Ollama queue daemon) | `foreman` + `GADFLY_BASE_URL`, or a `GADFLY_ENDPOINT_*` / `LLM_*` `foreman://` entry | optional bearer (via the endpoint/DSN) | ✅ native-Ollama path |
| **[llama-swap](https://github.com/mostlygeek/llama-swap)** (model-swapping proxy) | `llama-swap`/`llama-swaps` (un-hyphenated `llamaswap`/`llamaswaps` also accepted) + `GADFLY_BASE_URL` or a `GADFLY_ENDPOINT_*` entry, or an `LLM_*` `llama-swap://` / `llama-swaps://` DSN | optional bearer | ⚠️ wired, **untested** |
| **OpenAI-compatible** (incl. local Ollama's `/v1`) | `openai` + `GADFLY_BASE_URL` | `OPENAI_API_KEY` (any non-empty for Ollama) | ✅ tested against Ollama |
| **OpenAI** | `openai` | `OPENAI_API_KEY` | ⚠️ wired, **untested** |
| **Anthropic** | `anthropic` | `ANTHROPIC_API_KEY` | ⚠️ wired, **untested** |
| **Google (Gemini)** | `google` | `GOOGLE_API_KEY` / `GEMINI_API_KEY` | ⚠️ wired, **untested** |
> ### 🧪 Honest status
> Only the **Ollama** paths above are actually exercised. The OpenAI / Anthropic / Google
> providers come "for free" from majordomo's abstraction and *should* work, but I haven't
> spent money verifying them — treat them as untested. The OpenAI-**compatible** path **is**
> tested, because you can point it at a local Ollama (`GADFLY_BASE_URL=http://localhost:11434/v1`)
> and exercise the exact same code an OpenAI/OpenRouter endpoint would hit, for free. If you
> try a cloud provider and it works (or doesn't), please open an issue.
### Claude Code engine (`claude-code`)
Besides the majordomo model loop, Gadfly can review through the **[Claude Code](https://claude.com/claude-code)
CLI**: for each lens it shells out to `claude -p` *inside the checked-out repo*, so Claude Code
uses its **own** read tools (Read/Grep/Glob) to verify findings against real code, then Gadfly
parses the result and runs the same verdict-parse → recheck → consolidate → emit pipeline. The
CLI is bundled in the image (Node + `@anthropic-ai/claude-code`).
Select it as a model id — bare `claude-code` (CLI default model) or `claude-code/<model>` (the
suffix becomes `--model`, e.g. `claude-code/sonnet`, `claude-code/opus`). An optional
`:<thinking>` suffix forces an extended-thinking budget for that reviewer — `:max` (the high
"ultrathink" tier) or `:<n>` for a specific token budget — so you can run the same model at two
thinking depths as separate reviewers:
```yaml
GADFLY_MODELS: "claude-code/sonnet,claude-code/opus,claude-code/opus:max"
```
The thinking budget is applied via the `MAX_THINKING_TOKENS` env on the CLI subprocess; it's
best-effort (a no-op if the installed CLI build doesn't honor it).
Auth is read from the environment: the default is a **Pro/Max subscription** via
`CLAUDE_CODE_OAUTH_TOKEN` (from `claude setup-token`; no `--bare`), falling back to
`ANTHROPIC_API_KEY`. Don't set both. Tuning knobs (all optional):
| Env | Default | Meaning |
|-----|---------|---------|
| `GADFLY_CLAUDE_MODEL` | *(from the spec suffix)* | overrides the `--model` value |
| `GADFLY_CLAUDE_PERMISSION_MODE` | `plan` | `--permission-mode` (read-only `plan` keeps it from editing) |
| `GADFLY_CLAUDE_ALLOWED_TOOLS` | *(unset)* | `--allowedTools` value, passed verbatim (e.g. `Read,Grep,Glob`) |
| `GADFLY_CLAUDE_EXTRA_ARGS` | *(unset)* | extra CLI args, **whitespace-split** (no shell quoting) and appended after the defaults (e.g. `--max-turns 30`) |
| `GADFLY_CLAUDE_BIN` | `claude` | CLI binary path |
> These are **operator** knobs (workflow env), not PR-author input. Because
> `GADFLY_CLAUDE_EXTRA_ARGS` is appended *after* the defaults, it can override the
> read-only `--permission-mode plan` (e.g. passing `--permission-mode acceptEdits`),
> so keep it read-only unless you mean otherwise. It's whitespace-split, so values
> can't contain spaces — use `GADFLY_CLAUDE_ALLOWED_TOOLS` / `_PERMISSION_MODE` /
> `_MODEL` for those. The subprocess runs with a **minimal environment** (its auth
> token + `PATH`/`HOME`/locale/`GADFLY_CLAUDE_*`), not the runner's full env, so the
> Gitea token and provider keys aren't handed to the CLI.
**Alternate backends (example only, not validated here).** Because the subprocess env forwards
`ANTHROPIC_*` and `CLAUDE_*`, you can point the same engine at a non-Anthropic backend by setting
`ANTHROPIC_BASE_URL` (and `ANTHROPIC_AUTH_TOKEN`/`ANTHROPIC_API_KEY`) to an **Anthropic-API-compatible
proxy** — e.g. [claude-code-router](https://github.com/musistudio/claude-code-router) or LiteLLM in
front of Ollama — to run *Ollama models through Claude Code's harness* and compare it against the
native majordomo loop. Whether tool-use survives a given proxy/backend varies, so this is documented
as an example, not wired or tested here.
> **The Pro/Max path is dogfooded but otherwise lightly tested.** `claude-code/sonnet` now runs on
> gadfly's own PRs (see `.gitea/workflows/adversarial-review.yml`), but treat the engine as new —
> and note that subscription auth in automated CI is a gray area in Anthropic's terms. `auto`
> specialist selection and the `delegate_investigation` worker are majordomo-only and are skipped
> with this engine (Claude Code does its own legwork).
### Endpoint aliases via env vars
For multiple named backends (e.g. a couple of Ollama boxes on your LAN), register them by
name with env vars and then reference `name/model` in `GADFLY_MODEL`/`GADFLY_MODELS`:
```sh
# http-capable (Gadfly-native) — base URL used verbatim, so plaintext LAN works:
GADFLY_ENDPOINT_BIGBOX="ollama|http://192.168.1.50:11434"
GADFLY_ENDPOINT_GPU="openai|http://gpu.lan:8000/v1|sk-local"
GADFLY_ENDPOINT_M1="foreman|http://foreman-m1:8080|tok" # native-Ollama queue daemon
GADFLY_MODELS="bigbox/qwen2.5-coder:7b,gpu/llama3.1,m1/qwen3:14b"
# pure spec alias (a model, or a failover chain):
GADFLY_ALIAS_FAST="bigbox/qwen2.5-coder:7b,ollama-cloud/gpt-oss:120b-cloud"
GADFLY_MODEL="fast"
```
`<NAME>` is lowercased to form the registry name (`GADFLY_ENDPOINT_BIGBOX``bigbox`). This
is the same idea as majordomo's built-in **`LLM_*` env DSNs** (`LLM_BIGBOX=ollama://tok@host`,
`LLM_M1=foreman://tok@host`), which Gadfly also honors — but those are **HTTPS-only**, so for a
plaintext local Ollama or `http://` foreman use `GADFLY_ENDPOINT_*` instead.
> **Gitea Actions note:** repo `vars`/`secrets` aren't auto-exposed as env — add each alias to
> the stub workflow's `env:` block, e.g. `GADFLY_ENDPOINT_BIGBOX: ${{ vars.GADFLY_ENDPOINT_BIGBOX }}`.
## Specialists (the review swarm)
Instead of one generic reviewer, Gadfly runs a **suite of specialists** — each a focused lens
with its own review (+recheck) pass — and merges them into **one comment**, a collapsible
section per lens, led by an overall verdict (the worst across lenses; the optional
`improvements` lens never escalates it).
**Default suite** (when nothing is configured):
`security`, `correctness`, `maintainability` (code cleanliness), `performance`, `error-handling`.
**Also built in** (opt-in by name): `tests`, `docs`, `conventions`, and `improvements`
(strict & quiet — at most 12 high-value, non-blocking suggestions, silent otherwise).
Select which run with **`GADFLY_SPECIALISTS`** (comma-separated names, or `all`):
```yaml
GADFLY_SPECIALISTS: "security,correctness,maintainability,tests"
```
**Define your own** — two ways, which compose (env overrides file overrides built-ins):
```yaml
# 1. env: GADFLY_SPECIALIST_<NAME>="<focus>" (also overrides a built-in by reusing its name)
GADFLY_SPECIALIST_MIGRATIONS: "Review DB migrations for destructive or unindexed changes."
GADFLY_SPECIALISTS: "security,correctness,migrations"
```
```yaml
# 2. a repo .gadfly.yml at the repo root (version-controlled). See examples/.gadfly.yml:
specialists: [security, correctness, maintainability, migrations]
define:
- name: migrations
title: "🗃️ DB migrations"
focus: "Review schema migrations for destructive ops, missing indexes, table locks."
```
**Dynamic selection (`auto`):** set `GADFLY_SPECIALISTS: auto` and a selector model reads the
changed files + PR description and picks only the lenses that materially apply (and may invent
an ad-hoc one — e.g. a "migrations" lens for a schema change). The selector is
`GADFLY_SELECTOR_MODEL` if set (a cheap tier is ideal), else the review model. Capped and
de-duplicated; falls back to the default suite if selection fails.
**Worker-tier delegation:** set `GADFLY_WORKER_MODEL` (a cheap/fast model) to give every
reviewer a `delegate_investigation` tool — it offloads mechanical legwork (trace all callers,
gather every usage, check a pattern across files) to a worker sub-agent that returns a concise,
evidence-cited digest, so the expensive model reasons over summaries instead of raw file dumps.
Unset = no delegation (current behavior).
> **Cost:** each specialist is its own review+recheck, so cost ≈ *specialists × models × 2*.
> The default suite runs on a **single** model. Trim with `GADFLY_SPECIALISTS`, let `auto` pick
> only what a diff needs, and point heavy legwork at a cheap `GADFLY_WORKER_MODEL`.
### Concurrency (per-provider lanes)
With multiple models, each **provider** is its own lane and lanes run in **parallel**, so a fast
cloud provider isn't stuck behind a slow local box. Within a lane, at most `cap` models run at
once — `cap` comes from `GADFLY_PROVIDER_CONCURRENCY` (a `provider=N` map) else `GADFLY_CONCURRENCY`
(default `1`). The timeout is **per-lens** (`GADFLY_TIMEOUT_SECS`), so a slow model on one lens
can't starve the others.
```yaml
# One local box (serial — it serves one model at a time) + 3 cloud reviews at once,
# both lanes running concurrently:
GADFLY_PROVIDER_CONCURRENCY: "ollama-cloud=3,m1pro=1"
GADFLY_MODELS: "m1pro/qwen3:14b,qwen3-coder:480b-cloud,gpt-oss:120b-cloud"
```
A model's provider is the spec's first segment (`m1pro/…``m1pro`), or `GADFLY_PROVIDER`/
`ollama-cloud` for a bare id. Default (`cap 1`) keeps a single-provider pool fully sequential.
**Lens fan-out (within a model).** By default the specialist lenses run **sequentially** inside
each model (`GADFLY_LENS_CONCURRENCY=1`). Raise it to overlap the independent per-lens
review+recheck passes — the model then posts its consolidated comment as soon as its lenses
finish (so with sequential models, results stream in per model and per-model timings stay
clean). Like the model cap, it's **per-provider configurable**: `GADFLY_PROVIDER_LENS_CONCURRENCY`
takes a `provider=N` map keyed by the **same provider lanes** as `GADFLY_PROVIDER_CONCURRENCY`,
falling back to the `GADFLY_LENS_CONCURRENCY` scalar (default `1`). **It multiplies with the
model cap:** total in-flight requests ≈ *models-at-once × lenses-at-once*, so to fan lenses out
without oversubscribing a backend, keep its model cap low and raise its lens cap:
```yaml
# Per provider: cloud runs one model at a time but fans its 3 lenses out (3 concurrent requests);
# the slow local box stays fully serial. Both provider lanes still run in parallel.
GADFLY_PROVIDER_CONCURRENCY: "ollama-cloud=1,m1=1"
GADFLY_PROVIDER_LENS_CONCURRENCY: "ollama-cloud=3,m1=1"
GADFLY_SPECIALISTS: "security,correctness,error-handling"
```
### Live status board
When several models (each with several lenses) review a PR, the individual findings land in
**one comment per model** — but while that's in flight all you'd see is a row of
`⏳ Reviewing…` placeholders. So Gadfly also upserts **one consolidated status-board comment**
that aggregates every model's per-lens progress as it happens:
```
## 🪰 Gadfly — live review status
1/3 reviewers finished · updated 2026-06-27 18:14:56Z
#### `glm-5.2:cloud` · ollama-cloud — ⏳ 2/4 lenses
- ✅ security — No material issues found
- 🔄 correctness — running
- ⏸️ performance — queued
```
Each model process publishes its lenses (queued → running → finished + verdict) to a small
JSON file, and a background renderer in `entrypoint.sh` re-renders + upserts the single comment
every `GADFLY_STATUS_POLL_SECS` (default 12s) until the swarm finishes. It's advisory and
best-effort — the per-model findings comments are unaffected — and entirely separate from those.
Turn it off with `GADFLY_STATUS_BOARD=0`.
### Consensus consolidation
With **two or more models**, posting one comment each means a reader faces N walls of prose that
mostly agree. Instead Gadfly consolidates: every model writes its findings to a shared file, and
after the whole swarm finishes a single pass clusters those findings by location, counts **how
many models independently flagged each one**, and posts **one consensus comment**:
```
## 🪰 Gadfly review — consensus across 7 models
**Verdict: Blocking issues found** · 9 findings (3 with multi-model agreement)
| | Finding | Where | Models | Lens |
|--|--|--|--|--|
| 🔴 | Auth bypass: token not verified | `auth/login.go:42` | 6/7 | security |
| 🟠 | Unbounded retry loop | `sync/worker.go:88` | 3/7 | error-handling |
<details><summary>4 single-model findings (lower confidence)</summary> … </details>
<details><summary>Per-model detail</summary> … each model's full review, folded … </details>
```
Cross-model agreement is the strongest real-vs-false-positive signal available, so findings are
ranked by it (a lone low-severity finding folds away; a lone *critical* still surfaces). The
per-model comments are suppressed in this mode — each model's full review is preserved, folded,
inside the consensus comment — and nothing is lost: if consolidation can't run, Gadfly falls
back to posting the per-model comments. Controlled by `GADFLY_CONSOLIDATE`: `auto` (default — on
for ≥2 models), `1` (force on), `0` (force off, one comment per model). Single-model runs are
unaffected.
**Inline PR review.** Alongside the consensus comment, Gadfly also posts a single Gitea **pull
review** (state `COMMENT` — advisory, **never** request-changes or approve, so it can't block a
merge) whose inline comments anchor each consensus finding to the exact changed line it's about.
Only findings that land on a line in the diff are anchored (Gitea rejects comments off the diff);
the rest stay in the consensus comment. A re-run replaces the previous review instead of stacking.
It's the "reviewer integrated with Gitea" without the blocking — turn it off with
`GADFLY_INLINE_REVIEW=0`.
### Triggers
1. A **new/reopened/ready** non-draft PR — automatic.
2. Commenting **`@gadfly review`** on a PR — re-review on demand (gated to allowed users).
3. **workflow_dispatch** — manual, with a `pr_number` input.
(Pushing new commits does *not* auto-re-review — comment `@gadfly review` after pushing
fixes. This keeps usage down.)
> **Comment trigger needs the workflow on your default branch.** Gitea runs `issue_comment`
> workflows from the **default branch**, so `@gadfly review` only works once this stub is
> merged to `main` (the `pull_request` auto-trigger works from the PR branch immediately).
>
> **Security:** the example stubs gate the comment trigger with a job-level
> `if: github.event_name != 'issue_comment' || github.actor == '<you>'` so an untrusted
> commenter can't start a secret-bearing run — edit it to your maintainers and keep it in
> sync with `GADFLY_ALLOWED_USERS` (the in-container check). `@gadfly review` is plain-text
> matched (configurable via `GADFLY_TRIGGER_PHRASE`), so no bot account is required; comments
> post as `gitea-actions`.
## How it's packaged
```
cmd/gadfly/ the agentic reviewer binary (majordomo + Ollama Cloud); zero deps beyond stdlib + majordomo
scripts/run.sh fetches the PR diff, runs the reviewer, upserts one labeled comment
scripts/status-board.sh renders + upserts the single live status-board comment (per-lens progress)
scripts/system-prompt.txt the reviewer persona + verification discipline
entrypoint.sh the container brains: trigger gating, clone, model loop (logic lives here, not in YAML)
Dockerfile multi-stage; build-time module creds (BuildKit secrets) never reach the final image
.gitea/workflows/build-image.yml push to main → :latest; tag v* → :<tag> + :latest
examples/ the ~15-line stub a consuming repo drops in
```
The image is published to `gitea.stevedudenhoeffer.com/steve/gadfly`. Every push to `main`
rebuilds and republishes `:latest` (plus `:sha-<short>`); pushing a `v*` tag publishes that
pinned version (plus `:latest`). Pin full-stub consumers to a `:vN` image tag for stability, or track
`:latest` to ride main. **Reusable-workflow consumers should pin the workflow ref to an immutable
`review-reusable.yml@<sha>`** — long-lived act_runners *cache the workflow file by ref*, so a moved tag
(`@v1`) or `@main` is often **not** re-fetched and silently runs a stale copy. A fresh `@<sha>` is the
only reliable way to roll out a *structural* change to the reusable.
### Central config via variables
So you don't have to re-pin every consumer just to retune the swarm, the reusable resolves its config
at **runtime**`with:` input → owner **user/org-level variable** → image default — and variables are
injected per-run (not part of the cached file), so changing one variable propagates to every consumer
on its next review **without** a re-pin or a tag move:
| Variable (user/org scope) | Sets |
|---|---|
| `GADFLY_DEFAULT_MODELS` | `GADFLY_MODELS` (csv) |
| `GADFLY_DEFAULT_SPECIALISTS` | the lens suite |
| `GADFLY_DEFAULT_PROVIDER_CONCURRENCY` | models-at-once per provider |
| `GADFLY_DEFAULT_PROVIDER_LENS_CONCURRENCY` | lenses-at-once per provider |
| `GADFLY_ENDPOINT_RAGNAROS` | a named endpoint, e.g. `llamaswap\|https://host` |
Adding a *new* named endpoint still needs a one-line reusable edit (Gitea can't auto-expose arbitrary
`vars.GADFLY_ENDPOINT_*`); the values of already-wired ones are pure variables.
## Configuration (advanced)
The reviewer binary reads these (the stub/entrypoint set sane defaults):
| Env | Default | Meaning |
|-----|---------|---------|
| `GADFLY_MODEL` | — | model id, or `provider/model` spec, or majordomo alias/chain |
| `GADFLY_PROVIDER` | `ollama-cloud` | provider prefix for a bare model id |
| `GADFLY_BASE_URL` | — | override endpoint (OpenAI/Ollama-compatible servers) |
| `GADFLY_API_KEY` | — | provider key; falls back to the provider's standard env |
| `claude-code` model id | — | route a model through the bundled Claude Code CLI (`claude-code` / `claude-code/<model>`); see [Claude Code engine](#claude-code-engine-claude-code) for its `GADFLY_CLAUDE_*` knobs |
| `GADFLY_SPECIALISTS` | default suite | csv of lenses, `all`, or `auto` (dynamic selection) |
| `GADFLY_SELECTOR_MODEL` | review model | model that picks lenses in `auto` mode |
| `GADFLY_WORKER_MODEL` | — | cheap model for `delegate_investigation`; unset = no delegation |
| `GADFLY_WORKER_MAX_STEPS` | 8 | tool-step cap for a delegated worker run |
| `GADFLY_CONCURRENCY` | 1 | default max models run at once **per provider** |
| `GADFLY_PROVIDER_CONCURRENCY` | — | per-provider overrides, e.g. `ollama-cloud=3,m1pro=1` |
| `GADFLY_LENS_CONCURRENCY` | 1 | specialist lenses run at once **within a model** (× model cap = total in-flight) |
| `GADFLY_PROVIDER_LENS_CONCURRENCY` | — | per-provider lens overrides, same lanes as `GADFLY_PROVIDER_CONCURRENCY`, e.g. `ollama-cloud=3,m1=1` |
| `GADFLY_MAX_STEPS` | 24 | review-pass tool-step cap |
| `GADFLY_TIMEOUT_SECS` | 300 | deadline **per specialist lens** (review+recheck) |
| `GADFLY_RECHECK` | on | set `0`/`false` to skip the recheck pass |
| `GADFLY_RECHECK_MAX_STEPS` | 16 | recheck-pass step cap |
| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the **review** prompt (the full diff is reachable via the paginated `get_diff` tool, scoped per file with its `path` arg) |
| `GADFLY_RECHECK_DIFF_CHARS` | 20000 | diff chars embedded in the **recheck** prompt (smaller — the recheck pages `get_diff` for the hunks it verifies) |
| `GADFLY_COMPACT` | on | context compaction (via [executus](https://gitea.stevedudenhoeffer.com/steve/executus)): fold the transcript's runaway middle into a summary as it nears the model's context window, so a big diff + accumulating tool output can't balloon every step. `0` disables |
| `GADFLY_COMPACT_RATIO` | 0.45 | fraction of the model's context window at which compaction fires |
| `GADFLY_COMPACT_MODEL` | worker, else review model | cheap model the compactor uses to summarize the folded middle |
| `GADFLY_COMPACT_KEEP_RECENT` | 8 | most-recent messages kept verbatim during compaction |
| `GADFLY_COMPACT_SUMMARY_WORDS` | 200 | word cap on the compaction summary |
| `GADFLY_MODEL_CONTEXT_TOKENS` | *(auto)* | override the model's context-window size (tokens) for the compaction threshold; set it for self-hosted endpoints executus can't introspect (Ollama Cloud models resolve automatically) |
| `GADFLY_PR_TOKEN_BUDGET` | — | per-model token ceiling for this PR; once spent, remaining lenses/passes are skipped (advisory). 0 = off |
| `GADFLY_PR_TIME_BUDGET_SECS` | — | per-model wall-clock ceiling for this PR (advisory). 0 = off |
| `GADFLY_STATUS_BOARD` | on | set `0` to disable the live status-board comment |
| `GADFLY_STATUS_POLL_SECS` | 12 | how often the status board re-renders/upserts |
| `GADFLY_CONSOLIDATE` | `auto` | cross-model consensus comment: `auto` (on for ≥2 models), `1` (force on), `0` (off — one comment per model) |
| `GADFLY_INLINE_REVIEW` | on | when consolidating, also post a `COMMENT`-state PR review with inline comments on changed lines; `0` disables |
| `GADFLY_TRIGGER_PHRASE` | `@gadfly review` | comment phrase that re-triggers |
| `GADFLY_ALLOWED_USERS` | *(collaborators)* | comma-separated allow-list for comment triggers |
| `GADFLY_FINDINGS_URL` | — | gadfly-reports store base URL; set to enable findings telemetry (off when empty) |
| `GADFLY_FINDINGS_TOKEN` | — | bearer token for the gadfly-reports store (sent as `Authorization: Bearer …`) |
| `GADFLY_REPO` | *(from `GITEA_API`)* | `owner/repo` slug stamped on emitted runs/findings (set by `entrypoint.sh`) |
| `GADFLY_PR` | *(from event)* | PR number stamped on emitted runs/findings (set by `entrypoint.sh`) |
### Large-PR cost controls
A very large diff is the one thing that can blow the budget: every review step
re-sends it, multiplied across models × lenses × passes × steps (a single
~250 K-token PR can otherwise burn a whole metered usage block). Gadfly handles
big PRs in three layers, all **size-gated so small PRs are untouched**:
1. **Paginated `get_diff` + compaction** (reviewer binary, on by default) —
`get_diff` returns a paginated, optionally per-file window instead of the whole
diff, and once a transcript nears the model's context window its middle is
folded into a summary (powered by [executus](https://gitea.stevedudenhoeffer.com/steve/executus)'s
`compact`). Tune with the `GADFLY_COMPACT_*` knobs above.
2. **Downshift** (`entrypoint.sh`) — above `GADFLY_HUGE_DIFF_BYTES` the whole fleet
collapses to a single cheap model + a focused lens subset, fewer steps, and no
recheck. A finished shallow review beats a budget-nuking one, and the posted
comment says so.
3. **Hard backstop** (`entrypoint.sh`) — `GADFLY_PR_BUDGET_SECS` is a wall-clock
ceiling across the *entire* fleet; on expiry the review is stopped and whatever
was found so far is posted. Like everything else, it never fails CI.
| Env | Default | Meaning |
|-----|---------|---------|
| `GADFLY_HUGE_DIFF_BYTES` | 600000 | downshift the fleet when the PR diff exceeds this many bytes (0 = never downshift) |
| `GADFLY_HUGE_DIFF_MODELS` | first model | model(s) to run on a downshifted huge PR |
| `GADFLY_HUGE_DIFF_SPECIALISTS` | `security,correctness,error-handling` | lenses on a downshifted huge PR |
| `GADFLY_HUGE_DIFF_MAX_STEPS` | 12 | review step cap on a huge PR |
| `GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS` | 8 | recheck step cap on a huge PR |
| `GADFLY_HUGE_DIFF_RECHECK` | 0 | run the recheck pass on a huge PR (off by default) |
| `GADFLY_HUGE_DIFF_MAX_DIFF_CHARS` | 20000 | embedded review-diff chars on a huge PR |
| `GADFLY_PR_BUDGET_SECS` | — | swarm-wide wall-clock backstop; stops the whole fleet when reached (0 = off) |
## Findings telemetry (optional)
Gadfly can record what it found so model quality can be tracked over time. It is
**off by default** and purely advisory: set **`GADFLY_FINDINGS_URL`** to a
[gadfly-reports](https://gitea.stevedudenhoeffer.com/steve/gadfly-reports) store base URL and,
after each review, the binary best-effort `POST`s the run (`/runs`) and the
findings it surfaced (`/reports`) to that store. Add **`GADFLY_FINDINGS_TOKEN`**
to send an `Authorization: Bearer …` header. `entrypoint.sh` supplies the run
context (`GADFLY_REPO`, `GADFLY_PR`) automatically.
Findings are extracted heuristically from each lens's markdown — a `path:line`
reference anchors a finding, titled by the nearest preceding heading / numbered
item / bold lead-in. A lens whose verdict is **"No material issues found"**
emits **no** findings: its `path:line` references are verification notes
("verified X is safe"), not problems, so extracting them would record false
positives and unfairly penalize thorough clean-pass reviewers. The emit is
strictly best-effort: a short (~10s) timeout, any error (or a non-2xx response)
is logged to stderr only, and it **never** changes the review output or the exit
code.
## Building locally
```sh
go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
go test ./...
```
## License
MIT — see [LICENSE](LICENSE).