Files
gadfly/README.md
T
steve d7f364d803
Build & push image / build-and-push (push) Successful in 8s
feat: optional findings telemetry — emit runs+findings to a gadfly-reports store
After each review the binary POSTs the run + its heuristically-extracted findings to GADFLY_FINDINGS_URL (off unless set). Advisory: any error only goes to stderr — never touches stdout, the exit code, or the review. stdlib net/http only (no new deps). entrypoint.sh derives GADFLY_REPO/GADFLY_PR and passes through GADFLY_FINDINGS_URL/GADFLY_FINDINGS_TOKEN. Also renames store references from the old 'docket' name to 'gadfly-reports'.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 09:09:51 -04:00

288 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 🪰 Gadfly
**An AI gadfly for your pull requests.** Gadfly is an *adversarial* code reviewer that
runs in Gitea Actions: on every PR it reads your actual repository, hunts for real
problems, verifies them against the code, and posts its findings as a comment. It does not
praise your code. A gadfly does not let things slide.
> ### 🤖 Heads up: this is a vibe-coded project
> Gadfly was built almost entirely by an AI agent (Claude Code), prompts and all — the
> reviewer's "brain" is a language model, and so was most of the author. It works and it's
> tested, but treat it accordingly: **it is advisory only, it never blocks a merge, and you
> should still review its reviews.** Issues and PRs welcome; expect the occasional
> AI-flavored rough edge.
## What makes it different
Most LLM "review my diff" bots read the diff in isolation and hallucinate problems they
can't actually see — a "missing import" that's three lines above the hunk, a "broken
caller" in a file they never opened. Gadfly is **agentic**: the model has read-only tools
over the checked-out repo and is *required* to use them before reporting anything.
- **Tools:** `read_file`, `list_dir`, `grep`, `find_files`, `get_diff`.
- **Verify-before-claiming discipline:** baked into the system prompt — open the file,
grep the symbol, or drop the finding.
- **Two passes:** a *review* pass drafts findings, then an adversarial *recheck* pass
independently re-verifies each one against the code and drops the ones it can't confirm,
recomputing the verdict. This is what kills "confident but wrong."
- **Semantic-bug hunting:** it's told not to trust a plausible-looking constant, conversion
factor, or formula — re-derive the expected value, because that's where real bugs hide.
Every review leads with a one-line verdict: **No material issues found**, **Minor issues**,
or **Blocking issues found**.
## Turn it on for a repo
Gadfly ships as a container image, so consuming repos don't build anything — they just run
it. Drop one file in your repo and set a couple of secrets/vars:
1. Copy a stub from [`examples/`](examples/) to `.gitea/workflows/adversarial-review.yml` in
your repo — [`adversarial-review.yml`](examples/adversarial-review.yml) for the Ollama Cloud
default, or a provider-specific one (local Ollama, OpenAI-compatible, endpoint aliases). See
the [examples index](examples/README.md).
2. Add repo config:
- **secret** `OLLAMA_CLOUD_API_KEY` — your [Ollama Cloud](https://ollama.com) key (empty
⇒ Gadfly posts a harmless "not configured" notice instead of reviewing). *Not needed if
you point Gadfly at a different provider — see [Models & providers](#models--providers).*
- **var** `OLLAMA_REVIEW_MODELS` *(optional)* — comma-separated model ids
(default `qwen3-coder:480b-cloud,gpt-oss:120b-cloud`). One comment per model.
- **var** `GADFLY_ALLOWED_USERS` *(optional)* — who may re-trigger via comment; empty ⇒
any repo collaborator.
`GITEA_TOKEN` is provided automatically by Actions; comments post as the `gitea-actions`
user, scoped to that repo — no bot account needed.
## Models & providers
Gadfly is built on [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo), so the
reviewer model is not hard-wired — it can target anything majordomo supports. Pick a provider
by setting `GADFLY_PROVIDER` (used to prefix bare model ids); point at a custom endpoint with
`GADFLY_BASE_URL`; supply a key with `GADFLY_API_KEY` or the provider's standard env var. A
`GADFLY_MODEL`/`GADFLY_MODELS` value that already contains a `provider/` prefix (or is a
majordomo failover chain / alias) is used verbatim.
| Provider | `GADFLY_PROVIDER` | Key env | Status |
|----------|-------------------|---------|--------|
| **Ollama Cloud** (default) | `ollama-cloud` | `OLLAMA_API_KEY` / `OLLAMA_CLOUD_API_KEY` | ✅ in active use |
| **Local Ollama** | `ollama` | none (`OLLAMA_HOST` or `GADFLY_BASE_URL` for a remote daemon) | ✅ tested |
| **[foreman](https://gitea.stevedudenhoeffer.com/steve/foreman)** (native-Ollama queue daemon) | `foreman` + `GADFLY_BASE_URL`, or a `GADFLY_ENDPOINT_*` / `LLM_*` `foreman://` entry | optional bearer (via the endpoint/DSN) | ✅ native-Ollama path |
| **OpenAI-compatible** (incl. local Ollama's `/v1`) | `openai` + `GADFLY_BASE_URL` | `OPENAI_API_KEY` (any non-empty for Ollama) | ✅ tested against Ollama |
| **OpenAI** | `openai` | `OPENAI_API_KEY` | ⚠️ wired, **untested** |
| **Anthropic** | `anthropic` | `ANTHROPIC_API_KEY` | ⚠️ wired, **untested** |
| **Google (Gemini)** | `google` | `GOOGLE_API_KEY` / `GEMINI_API_KEY` | ⚠️ wired, **untested** |
> ### 🧪 Honest status
> Only the **Ollama** paths above are actually exercised. The OpenAI / Anthropic / Google
> providers come "for free" from majordomo's abstraction and *should* work, but I haven't
> spent money verifying them — treat them as untested. The OpenAI-**compatible** path **is**
> tested, because you can point it at a local Ollama (`GADFLY_BASE_URL=http://localhost:11434/v1`)
> and exercise the exact same code an OpenAI/OpenRouter endpoint would hit, for free. If you
> try a cloud provider and it works (or doesn't), please open an issue.
### Endpoint aliases via env vars
For multiple named backends (e.g. a couple of Ollama boxes on your LAN), register them by
name with env vars and then reference `name/model` in `GADFLY_MODEL`/`GADFLY_MODELS`:
```sh
# http-capable (Gadfly-native) — base URL used verbatim, so plaintext LAN works:
GADFLY_ENDPOINT_BIGBOX="ollama|http://192.168.1.50:11434"
GADFLY_ENDPOINT_GPU="openai|http://gpu.lan:8000/v1|sk-local"
GADFLY_ENDPOINT_M1="foreman|http://foreman-m1:8080|tok" # native-Ollama queue daemon
GADFLY_MODELS="bigbox/qwen2.5-coder:7b,gpu/llama3.1,m1/qwen3:14b"
# pure spec alias (a model, or a failover chain):
GADFLY_ALIAS_FAST="bigbox/qwen2.5-coder:7b,ollama-cloud/gpt-oss:120b-cloud"
GADFLY_MODEL="fast"
```
`<NAME>` is lowercased to form the registry name (`GADFLY_ENDPOINT_BIGBOX``bigbox`). This
is the same idea as majordomo's built-in **`LLM_*` env DSNs** (`LLM_BIGBOX=ollama://tok@host`,
`LLM_M1=foreman://tok@host`), which Gadfly also honors — but those are **HTTPS-only**, so for a
plaintext local Ollama or `http://` foreman use `GADFLY_ENDPOINT_*` instead.
> **Gitea Actions note:** repo `vars`/`secrets` aren't auto-exposed as env — add each alias to
> the stub workflow's `env:` block, e.g. `GADFLY_ENDPOINT_BIGBOX: ${{ vars.GADFLY_ENDPOINT_BIGBOX }}`.
## Specialists (the review swarm)
Instead of one generic reviewer, Gadfly runs a **suite of specialists** — each a focused lens
with its own review (+recheck) pass — and merges them into **one comment**, a collapsible
section per lens, led by an overall verdict (the worst across lenses; the optional
`improvements` lens never escalates it).
**Default suite** (when nothing is configured):
`security`, `correctness`, `maintainability` (code cleanliness), `performance`, `error-handling`.
**Also built in** (opt-in by name): `tests`, `docs`, `conventions`, and `improvements`
(strict & quiet — at most 12 high-value, non-blocking suggestions, silent otherwise).
Select which run with **`GADFLY_SPECIALISTS`** (comma-separated names, or `all`):
```yaml
GADFLY_SPECIALISTS: "security,correctness,maintainability,tests"
```
**Define your own** — two ways, which compose (env overrides file overrides built-ins):
```yaml
# 1. env: GADFLY_SPECIALIST_<NAME>="<focus>" (also overrides a built-in by reusing its name)
GADFLY_SPECIALIST_MIGRATIONS: "Review DB migrations for destructive or unindexed changes."
GADFLY_SPECIALISTS: "security,correctness,migrations"
```
```yaml
# 2. a repo .gadfly.yml at the repo root (version-controlled). See examples/.gadfly.yml:
specialists: [security, correctness, maintainability, migrations]
define:
- name: migrations
title: "🗃️ DB migrations"
focus: "Review schema migrations for destructive ops, missing indexes, table locks."
```
**Dynamic selection (`auto`):** set `GADFLY_SPECIALISTS: auto` and a selector model reads the
changed files + PR description and picks only the lenses that materially apply (and may invent
an ad-hoc one — e.g. a "migrations" lens for a schema change). The selector is
`GADFLY_SELECTOR_MODEL` if set (a cheap tier is ideal), else the review model. Capped and
de-duplicated; falls back to the default suite if selection fails.
**Worker-tier delegation:** set `GADFLY_WORKER_MODEL` (a cheap/fast model) to give every
reviewer a `delegate_investigation` tool — it offloads mechanical legwork (trace all callers,
gather every usage, check a pattern across files) to a worker sub-agent that returns a concise,
evidence-cited digest, so the expensive model reasons over summaries instead of raw file dumps.
Unset = no delegation (current behavior).
> **Cost:** each specialist is its own review+recheck, so cost ≈ *specialists × models × 2*.
> The default suite runs on a **single** model. Trim with `GADFLY_SPECIALISTS`, let `auto` pick
> only what a diff needs, and point heavy legwork at a cheap `GADFLY_WORKER_MODEL`.
### Concurrency (per-provider lanes)
With multiple models, each **provider** is its own lane and lanes run in **parallel**, so a fast
cloud provider isn't stuck behind a slow local box. Within a lane, at most `cap` models run at
once — `cap` comes from `GADFLY_PROVIDER_CONCURRENCY` (a `provider=N` map) else `GADFLY_CONCURRENCY`
(default `1`). The timeout is **per-lens** (`GADFLY_TIMEOUT_SECS`), so a slow model on one lens
can't starve the others.
```yaml
# One local box (serial — it serves one model at a time) + 3 cloud reviews at once,
# both lanes running concurrently:
GADFLY_PROVIDER_CONCURRENCY: "ollama-cloud=3,m1pro=1"
GADFLY_MODELS: "m1pro/qwen3:14b,qwen3-coder:480b-cloud,gpt-oss:120b-cloud"
```
A model's provider is the spec's first segment (`m1pro/…``m1pro`), or `GADFLY_PROVIDER`/
`ollama-cloud` for a bare id. Default (`cap 1`) keeps a single-provider pool fully sequential.
**Lens fan-out (within a model).** By default the specialist lenses run **sequentially** inside
each model (`GADFLY_LENS_CONCURRENCY=1`). Raise it to overlap the independent per-lens
review+recheck passes — the model then posts its consolidated comment as soon as its lenses
finish (so with sequential models, results stream in per model and per-model timings stay
clean). Like the model cap, it's **per-provider configurable**: `GADFLY_PROVIDER_LENS_CONCURRENCY`
takes a `provider=N` map keyed by the **same provider lanes** as `GADFLY_PROVIDER_CONCURRENCY`,
falling back to the `GADFLY_LENS_CONCURRENCY` scalar (default `1`). **It multiplies with the
model cap:** total in-flight requests ≈ *models-at-once × lenses-at-once*, so to fan lenses out
without oversubscribing a backend, keep its model cap low and raise its lens cap:
```yaml
# Per provider: cloud runs one model at a time but fans its 3 lenses out (3 concurrent requests);
# the slow local box stays fully serial. Both provider lanes still run in parallel.
GADFLY_PROVIDER_CONCURRENCY: "ollama-cloud=1,m1=1"
GADFLY_PROVIDER_LENS_CONCURRENCY: "ollama-cloud=3,m1=1"
GADFLY_SPECIALISTS: "security,correctness,error-handling"
```
### Triggers
1. A **new/reopened/ready** non-draft PR — automatic.
2. Commenting **`@gadfly review`** on a PR — re-review on demand (gated to allowed users).
3. **workflow_dispatch** — manual, with a `pr_number` input.
(Pushing new commits does *not* auto-re-review — comment `@gadfly review` after pushing
fixes. This keeps usage down.)
> **Comment trigger needs the workflow on your default branch.** Gitea runs `issue_comment`
> workflows from the **default branch**, so `@gadfly review` only works once this stub is
> merged to `main` (the `pull_request` auto-trigger works from the PR branch immediately).
>
> **Security:** the example stubs gate the comment trigger with a job-level
> `if: github.event_name != 'issue_comment' || github.actor == '<you>'` so an untrusted
> commenter can't start a secret-bearing run — edit it to your maintainers and keep it in
> sync with `GADFLY_ALLOWED_USERS` (the in-container check). `@gadfly review` is plain-text
> matched (configurable via `GADFLY_TRIGGER_PHRASE`), so no bot account is required; comments
> post as `gitea-actions`.
## How it's packaged
```
cmd/gadfly/ the agentic reviewer binary (majordomo + Ollama Cloud); zero deps beyond stdlib + majordomo
scripts/run.sh fetches the PR diff, runs the reviewer, upserts one labeled comment
scripts/system-prompt.txt the reviewer persona + verification discipline
entrypoint.sh the container brains: trigger gating, clone, model loop (logic lives here, not in YAML)
Dockerfile multi-stage; build-time module creds (BuildKit secrets) never reach the final image
.gitea/workflows/build-image.yml push to main → :latest; tag v* → :<tag> + :latest
examples/ the ~15-line stub a consuming repo drops in
```
The image is published to `gitea.stevedudenhoeffer.com/steve/gadfly`. Every push to `main`
rebuilds and republishes `:latest` (plus `:sha-<short>`); pushing a `v*` tag publishes that
pinned version (plus `:latest`). Pin consumers to a `:vN` tag for stability, or track
`:latest` to ride main.
## Configuration (advanced)
The reviewer binary reads these (the stub/entrypoint set sane defaults):
| Env | Default | Meaning |
|-----|---------|---------|
| `GADFLY_MODEL` | — | model id, or `provider/model` spec, or majordomo alias/chain |
| `GADFLY_PROVIDER` | `ollama-cloud` | provider prefix for a bare model id |
| `GADFLY_BASE_URL` | — | override endpoint (OpenAI/Ollama-compatible servers) |
| `GADFLY_API_KEY` | — | provider key; falls back to the provider's standard env |
| `GADFLY_SPECIALISTS` | default suite | csv of lenses, `all`, or `auto` (dynamic selection) |
| `GADFLY_SELECTOR_MODEL` | review model | model that picks lenses in `auto` mode |
| `GADFLY_WORKER_MODEL` | — | cheap model for `delegate_investigation`; unset = no delegation |
| `GADFLY_WORKER_MAX_STEPS` | 8 | tool-step cap for a delegated worker run |
| `GADFLY_CONCURRENCY` | 1 | default max models run at once **per provider** |
| `GADFLY_PROVIDER_CONCURRENCY` | — | per-provider overrides, e.g. `ollama-cloud=3,m1pro=1` |
| `GADFLY_LENS_CONCURRENCY` | 1 | specialist lenses run at once **within a model** (× model cap = total in-flight) |
| `GADFLY_PROVIDER_LENS_CONCURRENCY` | — | per-provider lens overrides, same lanes as `GADFLY_PROVIDER_CONCURRENCY`, e.g. `ollama-cloud=3,m1=1` |
| `GADFLY_MAX_STEPS` | 24 | review-pass tool-step cap |
| `GADFLY_TIMEOUT_SECS` | 300 | deadline **per specialist lens** (review+recheck) |
| `GADFLY_RECHECK` | on | set `0`/`false` to skip the recheck pass |
| `GADFLY_RECHECK_MAX_STEPS` | 16 | recheck-pass step cap |
| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the prompt (full diff via `get_diff`) |
| `GADFLY_TRIGGER_PHRASE` | `@gadfly review` | comment phrase that re-triggers |
| `GADFLY_ALLOWED_USERS` | *(collaborators)* | comma-separated allow-list for comment triggers |
| `GADFLY_FINDINGS_URL` | — | gadfly-reports store base URL; set to enable findings telemetry (off when empty) |
| `GADFLY_FINDINGS_TOKEN` | — | bearer token for the gadfly-reports store (sent as `Authorization: Bearer …`) |
| `GADFLY_REPO` | *(from `GITEA_API`)* | `owner/repo` slug stamped on emitted runs/findings (set by `entrypoint.sh`) |
| `GADFLY_PR` | *(from event)* | PR number stamped on emitted runs/findings (set by `entrypoint.sh`) |
## Findings telemetry (optional)
Gadfly can record what it found so model quality can be tracked over time. It is
**off by default** and purely advisory: set **`GADFLY_FINDINGS_URL`** to a
[gadfly-reports](https://gitea.stevedudenhoeffer.com/steve/gadfly-reports) store base URL and,
after each review, the binary best-effort `POST`s the run (`/runs`) and the
findings it surfaced (`/reports`) to that store. Add **`GADFLY_FINDINGS_TOKEN`**
to send an `Authorization: Bearer …` header. `entrypoint.sh` supplies the run
context (`GADFLY_REPO`, `GADFLY_PR`) automatically.
Findings are extracted heuristically from each lens's markdown — a `path:line`
reference anchors a finding, titled by the nearest preceding heading / numbered
item / bold lead-in. The emit is strictly best-effort: a short (~10s) timeout,
any error (or a non-2xx response) is logged to stderr only, and it **never**
changes the review output or the exit code.
## Building locally
```sh
go build ./cmd/gadfly # needs read access to the private majordomo module
go test ./...
```
## License
MIT — see [LICENSE](LICENSE).