Files
gadfly/README.md
T
2026-06-28 02:23:40 +00:00

381 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 🪰 Gadfly
**An AI gadfly for your pull requests.** Gadfly is an *adversarial* code reviewer that
runs in Gitea Actions: on every PR it reads your actual repository, hunts for real
problems, verifies them against the code, and posts its findings as a comment. It does not
praise your code. A gadfly does not let things slide.
> ### 🤖 Heads up: this is a vibe-coded project
> Gadfly was built almost entirely by an AI agent (Claude Code), prompts and all — the
> reviewer's "brain" is a language model, and so was most of the author. It works and it's
> tested, but treat it accordingly: **it is advisory only, it never blocks a merge, and you
> should still review its reviews.** Issues and PRs welcome; expect the occasional
> AI-flavored rough edge.
## What makes it different
Most LLM "review my diff" bots read the diff in isolation and hallucinate problems they
can't actually see — a "missing import" that's three lines above the hunk, a "broken
caller" in a file they never opened. Gadfly is **agentic**: the model has read-only tools
over the checked-out repo and is *required* to use them before reporting anything.
- **Tools:** `read_file`, `list_dir`, `grep`, `find_files`, `get_diff`.
- **Verify-before-claiming discipline:** baked into the system prompt — open the file,
grep the symbol, or drop the finding.
- **Two passes:** a *review* pass drafts findings, then an adversarial *recheck* pass
independently re-verifies each one against the code and drops the ones it can't confirm,
recomputing the verdict. This is what kills "confident but wrong."
- **Semantic-bug hunting:** it's told not to trust a plausible-looking constant, conversion
factor, or formula — re-derive the expected value, because that's where real bugs hide.
Every review leads with a one-line verdict: **No material issues found**, **Minor issues**,
or **Blocking issues found**.
## Turn it on for a repo
Gadfly ships as a container image, so consuming repos don't build anything — they just run
it. Drop one file in your repo and set a couple of secrets/vars:
1. Copy a stub from [`examples/`](examples/) to `.gitea/workflows/adversarial-review.yml` in
your repo. Two flavors: the slim [`reusable.yml`](examples/reusable.yml) — a tiny caller of
Gadfly's **reusable workflow** (`uses: steve/gadfly/.gitea/workflows/review-reusable.yml@…`,
forwarding only the secrets the reviewer needs), which ships a **default swarm** (3 cloud models +
the Claude Code engine, 5-lens suite) you inherit by omitting `with:` or override per-input — or the full self-contained
[`adversarial-review.yml`](examples/adversarial-review.yml) (Ollama Cloud default, with inline
notes for every provider / local Ollama / OpenAI-compatible / endpoint aliases). See the
[examples index](examples/README.md).
2. Add repo config:
- **secret** `OLLAMA_CLOUD_API_KEY` — your [Ollama Cloud](https://ollama.com) key (empty
⇒ Gadfly posts a harmless "not configured" notice instead of reviewing). *Not needed if
you point Gadfly at a different provider — see [Models & providers](#models--providers).*
- **var** `OLLAMA_REVIEW_MODELS` *(optional)* — comma-separated model ids
(default `qwen3-coder:480b-cloud,gpt-oss:120b-cloud`). One comment per model.
- **var** `GADFLY_ALLOWED_USERS` *(optional)* — who may re-trigger via comment; empty ⇒
any repo collaborator.
`GITEA_TOKEN` is provided automatically by Actions; comments post as the `gitea-actions`
user, scoped to that repo — no bot account needed.
## Models & providers
Gadfly is built on [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo), so the
reviewer model is not hard-wired — it can target anything majordomo supports. Pick a provider
by setting `GADFLY_PROVIDER` (used to prefix bare model ids); point at a custom endpoint with
`GADFLY_BASE_URL`; supply a key with `GADFLY_API_KEY` or the provider's standard env var. A
`GADFLY_MODEL`/`GADFLY_MODELS` value that already contains a `provider/` prefix (or is a
majordomo failover chain / alias) is used verbatim.
| Provider | `GADFLY_PROVIDER` | Key env | Status |
|----------|-------------------|---------|--------|
| **Ollama Cloud** (default) | `ollama-cloud` | `OLLAMA_API_KEY` / `OLLAMA_CLOUD_API_KEY` | ✅ in active use |
| **Local Ollama** | `ollama` | none (`OLLAMA_HOST` or `GADFLY_BASE_URL` for a remote daemon) | ✅ tested |
| **[foreman](https://gitea.stevedudenhoeffer.com/steve/foreman)** (native-Ollama queue daemon) | `foreman` + `GADFLY_BASE_URL`, or a `GADFLY_ENDPOINT_*` / `LLM_*` `foreman://` entry | optional bearer (via the endpoint/DSN) | ✅ native-Ollama path |
| **[llama-swap](https://github.com/mostlygeek/llama-swap)** (model-swapping proxy) | `llama-swap`/`llama-swaps` (un-hyphenated `llamaswap`/`llamaswaps` also accepted) + `GADFLY_BASE_URL` or a `GADFLY_ENDPOINT_*` entry, or an `LLM_*` `llama-swap://` / `llama-swaps://` DSN | optional bearer | ⚠️ wired, **untested** |
| **OpenAI-compatible** (incl. local Ollama's `/v1`) | `openai` + `GADFLY_BASE_URL` | `OPENAI_API_KEY` (any non-empty for Ollama) | ✅ tested against Ollama |
| **OpenAI** | `openai` | `OPENAI_API_KEY` | ⚠️ wired, **untested** |
| **Anthropic** | `anthropic` | `ANTHROPIC_API_KEY` | ⚠️ wired, **untested** |
| **Google (Gemini)** | `google` | `GOOGLE_API_KEY` / `GEMINI_API_KEY` | ⚠️ wired, **untested** |
> ### 🧪 Honest status
> Only the **Ollama** paths above are actually exercised. The OpenAI / Anthropic / Google
> providers come "for free" from majordomo's abstraction and *should* work, but I haven't
> spent money verifying them — treat them as untested. The OpenAI-**compatible** path **is**
> tested, because you can point it at a local Ollama (`GADFLY_BASE_URL=http://localhost:11434/v1`)
> and exercise the exact same code an OpenAI/OpenRouter endpoint would hit, for free. If you
> try a cloud provider and it works (or doesn't), please open an issue.
### Claude Code engine (`claude-code`)
Besides the majordomo model loop, Gadfly can review through the **[Claude Code](https://claude.com/claude-code)
CLI**: for each lens it shells out to `claude -p` *inside the checked-out repo*, so Claude Code
uses its **own** read tools (Read/Grep/Glob) to verify findings against real code, then Gadfly
parses the result and runs the same verdict-parse → recheck → consolidate → emit pipeline. The
CLI is bundled in the image (Node + `@anthropic-ai/claude-code`).
Select it as a model id — bare `claude-code` (CLI default model) or `claude-code/<model>` (the
suffix becomes `--model`, e.g. `claude-code/sonnet`, `claude-code/opus`). An optional
`:<thinking>` suffix forces an extended-thinking budget for that reviewer — `:max` (the high
"ultrathink" tier) or `:<n>` for a specific token budget — so you can run the same model at two
thinking depths as separate reviewers:
```yaml
GADFLY_MODELS: "claude-code/sonnet,claude-code/opus,claude-code/opus:max"
```
The thinking budget is applied via the `MAX_THINKING_TOKENS` env on the CLI subprocess; it's
best-effort (a no-op if the installed CLI build doesn't honor it).
Auth is read from the environment: the default is a **Pro/Max subscription** via
`CLAUDE_CODE_OAUTH_TOKEN` (from `claude setup-token`; no `--bare`), falling back to
`ANTHROPIC_API_KEY`. Don't set both. Tuning knobs (all optional):
| Env | Default | Meaning |
|-----|---------|---------|
| `GADFLY_CLAUDE_MODEL` | *(from the spec suffix)* | overrides the `--model` value |
| `GADFLY_CLAUDE_PERMISSION_MODE` | `plan` | `--permission-mode` (read-only `plan` keeps it from editing) |
| `GADFLY_CLAUDE_ALLOWED_TOOLS` | *(unset)* | `--allowedTools` value, passed verbatim (e.g. `Read,Grep,Glob`) |
| `GADFLY_CLAUDE_EXTRA_ARGS` | *(unset)* | extra CLI args, **whitespace-split** (no shell quoting) and appended after the defaults (e.g. `--max-turns 30`) |
| `GADFLY_CLAUDE_BIN` | `claude` | CLI binary path |
> These are **operator** knobs (workflow env), not PR-author input. Because
> `GADFLY_CLAUDE_EXTRA_ARGS` is appended *after* the defaults, it can override the
> read-only `--permission-mode plan` (e.g. passing `--permission-mode acceptEdits`),
> so keep it read-only unless you mean otherwise. It's whitespace-split, so values
> can't contain spaces — use `GADFLY_CLAUDE_ALLOWED_TOOLS` / `_PERMISSION_MODE` /
> `_MODEL` for those. The subprocess runs with a **minimal environment** (its auth
> token + `PATH`/`HOME`/locale/`GADFLY_CLAUDE_*`), not the runner's full env, so the
> Gitea token and provider keys aren't handed to the CLI.
**Alternate backends (example only, not validated here).** Because the subprocess env forwards
`ANTHROPIC_*` and `CLAUDE_*`, you can point the same engine at a non-Anthropic backend by setting
`ANTHROPIC_BASE_URL` (and `ANTHROPIC_AUTH_TOKEN`/`ANTHROPIC_API_KEY`) to an **Anthropic-API-compatible
proxy** — e.g. [claude-code-router](https://github.com/musistudio/claude-code-router) or LiteLLM in
front of Ollama — to run *Ollama models through Claude Code's harness* and compare it against the
native majordomo loop. Whether tool-use survives a given proxy/backend varies, so this is documented
as an example, not wired or tested here.
> **The Pro/Max path is dogfooded but otherwise lightly tested.** `claude-code/sonnet` now runs on
> gadfly's own PRs (see `.gitea/workflows/adversarial-review.yml`), but treat the engine as new —
> and note that subscription auth in automated CI is a gray area in Anthropic's terms. `auto`
> specialist selection and the `delegate_investigation` worker are majordomo-only and are skipped
> with this engine (Claude Code does its own legwork).
### Endpoint aliases via env vars
For multiple named backends (e.g. a couple of Ollama boxes on your LAN), register them by
name with env vars and then reference `name/model` in `GADFLY_MODEL`/`GADFLY_MODELS`:
```sh
# http-capable (Gadfly-native) — base URL used verbatim, so plaintext LAN works:
GADFLY_ENDPOINT_BIGBOX="ollama|http://192.168.1.50:11434"
GADFLY_ENDPOINT_GPU="openai|http://gpu.lan:8000/v1|sk-local"
GADFLY_ENDPOINT_M1="foreman|http://foreman-m1:8080|tok" # native-Ollama queue daemon
GADFLY_MODELS="bigbox/qwen2.5-coder:7b,gpu/llama3.1,m1/qwen3:14b"
# pure spec alias (a model, or a failover chain):
GADFLY_ALIAS_FAST="bigbox/qwen2.5-coder:7b,ollama-cloud/gpt-oss:120b-cloud"
GADFLY_MODEL="fast"
```
`<NAME>` is lowercased to form the registry name (`GADFLY_ENDPOINT_BIGBOX``bigbox`). This
is the same idea as majordomo's built-in **`LLM_*` env DSNs** (`LLM_BIGBOX=ollama://tok@host`,
`LLM_M1=foreman://tok@host`), which Gadfly also honors — but those are **HTTPS-only**, so for a
plaintext local Ollama or `http://` foreman use `GADFLY_ENDPOINT_*` instead.
> **Gitea Actions note:** repo `vars`/`secrets` aren't auto-exposed as env — add each alias to
> the stub workflow's `env:` block, e.g. `GADFLY_ENDPOINT_BIGBOX: ${{ vars.GADFLY_ENDPOINT_BIGBOX }}`.
## Specialists (the review swarm)
Instead of one generic reviewer, Gadfly runs a **suite of specialists** — each a focused lens
with its own review (+recheck) pass — and merges them into **one comment**, a collapsible
section per lens, led by an overall verdict (the worst across lenses; the optional
`improvements` lens never escalates it).
**Default suite** (when nothing is configured):
`security`, `correctness`, `maintainability` (code cleanliness), `performance`, `error-handling`.
**Also built in** (opt-in by name): `tests`, `docs`, `conventions`, and `improvements`
(strict & quiet — at most 12 high-value, non-blocking suggestions, silent otherwise).
Select which run with **`GADFLY_SPECIALISTS`** (comma-separated names, or `all`):
```yaml
GADFLY_SPECIALISTS: "security,correctness,maintainability,tests"
```
**Define your own** — two ways, which compose (env overrides file overrides built-ins):
```yaml
# 1. env: GADFLY_SPECIALIST_<NAME>="<focus>" (also overrides a built-in by reusing its name)
GADFLY_SPECIALIST_MIGRATIONS: "Review DB migrations for destructive or unindexed changes."
GADFLY_SPECIALISTS: "security,correctness,migrations"
```
```yaml
# 2. a repo .gadfly.yml at the repo root (version-controlled). See examples/.gadfly.yml:
specialists: [security, correctness, maintainability, migrations]
define:
- name: migrations
title: "🗃️ DB migrations"
focus: "Review schema migrations for destructive ops, missing indexes, table locks."
```
**Dynamic selection (`auto`):** set `GADFLY_SPECIALISTS: auto` and a selector model reads the
changed files + PR description and picks only the lenses that materially apply (and may invent
an ad-hoc one — e.g. a "migrations" lens for a schema change). The selector is
`GADFLY_SELECTOR_MODEL` if set (a cheap tier is ideal), else the review model. Capped and
de-duplicated; falls back to the default suite if selection fails.
**Worker-tier delegation:** set `GADFLY_WORKER_MODEL` (a cheap/fast model) to give every
reviewer a `delegate_investigation` tool — it offloads mechanical legwork (trace all callers,
gather every usage, check a pattern across files) to a worker sub-agent that returns a concise,
evidence-cited digest, so the expensive model reasons over summaries instead of raw file dumps.
Unset = no delegation (current behavior).
> **Cost:** each specialist is its own review+recheck, so cost ≈ *specialists × models × 2*.
> The default suite runs on a **single** model. Trim with `GADFLY_SPECIALISTS`, let `auto` pick
> only what a diff needs, and point heavy legwork at a cheap `GADFLY_WORKER_MODEL`.
### Concurrency (per-provider lanes)
With multiple models, each **provider** is its own lane and lanes run in **parallel**, so a fast
cloud provider isn't stuck behind a slow local box. Within a lane, at most `cap` models run at
once — `cap` comes from `GADFLY_PROVIDER_CONCURRENCY` (a `provider=N` map) else `GADFLY_CONCURRENCY`
(default `1`). The timeout is **per-lens** (`GADFLY_TIMEOUT_SECS`), so a slow model on one lens
can't starve the others.
```yaml
# One local box (serial — it serves one model at a time) + 3 cloud reviews at once,
# both lanes running concurrently:
GADFLY_PROVIDER_CONCURRENCY: "ollama-cloud=3,m1pro=1"
GADFLY_MODELS: "m1pro/qwen3:14b,qwen3-coder:480b-cloud,gpt-oss:120b-cloud"
```
A model's provider is the spec's first segment (`m1pro/…``m1pro`), or `GADFLY_PROVIDER`/
`ollama-cloud` for a bare id. Default (`cap 1`) keeps a single-provider pool fully sequential.
**Lens fan-out (within a model).** By default the specialist lenses run **sequentially** inside
each model (`GADFLY_LENS_CONCURRENCY=1`). Raise it to overlap the independent per-lens
review+recheck passes — the model then posts its consolidated comment as soon as its lenses
finish (so with sequential models, results stream in per model and per-model timings stay
clean). Like the model cap, it's **per-provider configurable**: `GADFLY_PROVIDER_LENS_CONCURRENCY`
takes a `provider=N` map keyed by the **same provider lanes** as `GADFLY_PROVIDER_CONCURRENCY`,
falling back to the `GADFLY_LENS_CONCURRENCY` scalar (default `1`). **It multiplies with the
model cap:** total in-flight requests ≈ *models-at-once × lenses-at-once*, so to fan lenses out
without oversubscribing a backend, keep its model cap low and raise its lens cap:
```yaml
# Per provider: cloud runs one model at a time but fans its 3 lenses out (3 concurrent requests);
# the slow local box stays fully serial. Both provider lanes still run in parallel.
GADFLY_PROVIDER_CONCURRENCY: "ollama-cloud=1,m1=1"
GADFLY_PROVIDER_LENS_CONCURRENCY: "ollama-cloud=3,m1=1"
GADFLY_SPECIALISTS: "security,correctness,error-handling"
```
### Live status board
When several models (each with several lenses) review a PR, the individual findings land in
**one comment per model** — but while that's in flight all you'd see is a row of
`⏳ Reviewing…` placeholders. So Gadfly also upserts **one consolidated status-board comment**
that aggregates every model's per-lens progress as it happens:
```
## 🪰 Gadfly — live review status
1/3 reviewers finished · updated 2026-06-27 18:14:56Z
#### `glm-5.2:cloud` · ollama-cloud — ⏳ 2/4 lenses
- ✅ security — No material issues found
- 🔄 correctness — running
- ⏸️ performance — queued
```
Each model process publishes its lenses (queued → running → finished + verdict) to a small
JSON file, and a background renderer in `entrypoint.sh` re-renders + upserts the single comment
every `GADFLY_STATUS_POLL_SECS` (default 12s) until the swarm finishes. It's advisory and
best-effort — the per-model findings comments are unaffected — and entirely separate from those.
Turn it off with `GADFLY_STATUS_BOARD=0`.
### Triggers
1. A **new/reopened/ready** non-draft PR — automatic.
2. Commenting **`@gadfly review`** on a PR — re-review on demand (gated to allowed users).
3. **workflow_dispatch** — manual, with a `pr_number` input.
(Pushing new commits does *not* auto-re-review — comment `@gadfly review` after pushing
fixes. This keeps usage down.)
> **Comment trigger needs the workflow on your default branch.** Gitea runs `issue_comment`
> workflows from the **default branch**, so `@gadfly review` only works once this stub is
> merged to `main` (the `pull_request` auto-trigger works from the PR branch immediately).
>
> **Security:** the example stubs gate the comment trigger with a job-level
> `if: github.event_name != 'issue_comment' || github.actor == '<you>'` so an untrusted
> commenter can't start a secret-bearing run — edit it to your maintainers and keep it in
> sync with `GADFLY_ALLOWED_USERS` (the in-container check). `@gadfly review` is plain-text
> matched (configurable via `GADFLY_TRIGGER_PHRASE`), so no bot account is required; comments
> post as `gitea-actions`.
## How it's packaged
```
cmd/gadfly/ the agentic reviewer binary (majordomo + Ollama Cloud); zero deps beyond stdlib + majordomo
scripts/run.sh fetches the PR diff, runs the reviewer, upserts one labeled comment
scripts/status-board.sh renders + upserts the single live status-board comment (per-lens progress)
scripts/system-prompt.txt the reviewer persona + verification discipline
entrypoint.sh the container brains: trigger gating, clone, model loop (logic lives here, not in YAML)
Dockerfile multi-stage; build-time module creds (BuildKit secrets) never reach the final image
.gitea/workflows/build-image.yml push to main → :latest; tag v* → :<tag> + :latest
examples/ the ~15-line stub a consuming repo drops in
```
The image is published to `gitea.stevedudenhoeffer.com/steve/gadfly`. Every push to `main`
rebuilds and republishes `:latest` (plus `:sha-<short>`); pushing a `v*` tag publishes that
pinned version (plus `:latest`). Pin consumers to a `:vN` tag for stability, or track
`:latest` to ride main.
## Configuration (advanced)
The reviewer binary reads these (the stub/entrypoint set sane defaults):
| Env | Default | Meaning |
|-----|---------|---------|
| `GADFLY_MODEL` | — | model id, or `provider/model` spec, or majordomo alias/chain |
| `GADFLY_PROVIDER` | `ollama-cloud` | provider prefix for a bare model id |
| `GADFLY_BASE_URL` | — | override endpoint (OpenAI/Ollama-compatible servers) |
| `GADFLY_API_KEY` | — | provider key; falls back to the provider's standard env |
| `claude-code` model id | — | route a model through the bundled Claude Code CLI (`claude-code` / `claude-code/<model>`); see [Claude Code engine](#claude-code-engine-claude-code) for its `GADFLY_CLAUDE_*` knobs |
| `GADFLY_SPECIALISTS` | default suite | csv of lenses, `all`, or `auto` (dynamic selection) |
| `GADFLY_SELECTOR_MODEL` | review model | model that picks lenses in `auto` mode |
| `GADFLY_WORKER_MODEL` | — | cheap model for `delegate_investigation`; unset = no delegation |
| `GADFLY_WORKER_MAX_STEPS` | 8 | tool-step cap for a delegated worker run |
| `GADFLY_CONCURRENCY` | 1 | default max models run at once **per provider** |
| `GADFLY_PROVIDER_CONCURRENCY` | — | per-provider overrides, e.g. `ollama-cloud=3,m1pro=1` |
| `GADFLY_LENS_CONCURRENCY` | 1 | specialist lenses run at once **within a model** (× model cap = total in-flight) |
| `GADFLY_PROVIDER_LENS_CONCURRENCY` | — | per-provider lens overrides, same lanes as `GADFLY_PROVIDER_CONCURRENCY`, e.g. `ollama-cloud=3,m1=1` |
| `GADFLY_MAX_STEPS` | 24 | review-pass tool-step cap |
| `GADFLY_TIMEOUT_SECS` | 300 | deadline **per specialist lens** (review+recheck) |
| `GADFLY_RECHECK` | on | set `0`/`false` to skip the recheck pass |
| `GADFLY_RECHECK_MAX_STEPS` | 16 | recheck-pass step cap |
| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the prompt (full diff via `get_diff`) |
| `GADFLY_STATUS_BOARD` | on | set `0` to disable the live status-board comment |
| `GADFLY_STATUS_POLL_SECS` | 12 | how often the status board re-renders/upserts |
| `GADFLY_TRIGGER_PHRASE` | `@gadfly review` | comment phrase that re-triggers |
| `GADFLY_ALLOWED_USERS` | *(collaborators)* | comma-separated allow-list for comment triggers |
| `GADFLY_FINDINGS_URL` | — | gadfly-reports store base URL; set to enable findings telemetry (off when empty) |
| `GADFLY_FINDINGS_TOKEN` | — | bearer token for the gadfly-reports store (sent as `Authorization: Bearer …`) |
| `GADFLY_REPO` | *(from `GITEA_API`)* | `owner/repo` slug stamped on emitted runs/findings (set by `entrypoint.sh`) |
| `GADFLY_PR` | *(from event)* | PR number stamped on emitted runs/findings (set by `entrypoint.sh`) |
## Findings telemetry (optional)
Gadfly can record what it found so model quality can be tracked over time. It is
**off by default** and purely advisory: set **`GADFLY_FINDINGS_URL`** to a
[gadfly-reports](https://gitea.stevedudenhoeffer.com/steve/gadfly-reports) store base URL and,
after each review, the binary best-effort `POST`s the run (`/runs`) and the
findings it surfaced (`/reports`) to that store. Add **`GADFLY_FINDINGS_TOKEN`**
to send an `Authorization: Bearer …` header. `entrypoint.sh` supplies the run
context (`GADFLY_REPO`, `GADFLY_PR`) automatically.
Findings are extracted heuristically from each lens's markdown — a `path:line`
reference anchors a finding, titled by the nearest preceding heading / numbered
item / bold lead-in. A lens whose verdict is **"No material issues found"**
emits **no** findings: its `path:line` references are verification notes
("verified X is safe"), not problems, so extracting them would record false
positives and unfairly penalize thorough clean-pass reviewers. The emit is
strictly best-effort: a short (~10s) timeout, any error (or a non-2xx response)
is logged to stderr only, and it **never** changes the review output or the exit
code.
## Building locally
```sh
go build ./cmd/gadfly # needs read access to the private majordomo module
go test ./...
```
## License
MIT — see [LICENSE](LICENSE).