b860119332
Fixes the large-PR token burn: a ~250K-token diff was re-sent every agent step across models × lenses × passes, draining a metered usage block in minutes. Small PRs are untouched (every mitigation is size-gated / no-op under threshold). - Re-platform the in-process review path onto executus run.Executor: context compaction (executus/compact, threshold from the model's real context window via executus/model), run-bounding, a per-PR budget gate (Ports.Budget), and the wrap-up nudge re-expressed as a run.Critic. Lens fan-out now uses executus/fanout. gadfly keeps its own model.go, so GADFLY_ENDPOINT_<NAME> aliases and the claude-code engine are unaffected. No majordomo bump; the binary stays static (executus core is majordomo+stdlib only). - Paginate get_diff (per-file `path` + start_line/limit) instead of dumping the whole diff; trim the recheck diff embed (60k -> 20k chars). - entrypoint.sh: downshift the fleet above GADFLY_HUGE_DIFF_BYTES (one cheap model, fewer lenses/steps, no recheck) + a swarm-wide GADFLY_PR_BUDGET_SECS wall-clock backstop (adds procps for pkill). All advisory; CI never fails. - README + CLAUDE.md + tests updated. Note: run.Result exposes no transcript, so the old transcript-based forced- finalization fallback is dropped; the wrap-up critic nudge is the remaining "always emit something" mechanism. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
176 lines
12 KiB
Markdown
176 lines
12 KiB
Markdown
# Gadfly — Developer Guide
|
||
|
||
Gadfly (🪰) is an **agentic adversarial code reviewer** that runs in Gitea Actions. On a pull
|
||
request it reads the *checked-out repository* with read-only tools, hunts for real problems,
|
||
verifies each one against the actual code, and posts its findings as a comment. It is
|
||
**advisory only** — it never blocks a merge.
|
||
|
||
> This is a public, **vibe-coded** project (built largely by an AI agent). Keep that framing
|
||
> honest in the README; don't oversell it.
|
||
|
||
## Project goals (keep changes aligned to these)
|
||
|
||
1. **Find *real* problems, not nits.** The whole point of the agentic tools + two-pass
|
||
recheck is to kill diff-only false positives. Anything that raises the false-positive rate
|
||
(or removes verification) works against the project.
|
||
2. **Advisory, never blocking.** Gadfly must never fail a CI job for review *content*, never
|
||
merge, never deploy. Non-zero exit only on usage/config errors; even then run.sh posts a
|
||
notice rather than failing. Do not add it to branch-protection required checks.
|
||
3. **Easy to turn on for any repo.** Consumers should need only a ~15-line stub workflow + a
|
||
couple of secrets/vars. All real logic lives in the image (`entrypoint.sh`), not in the
|
||
consumer's YAML (Gitea's act_runner has weak YAML expression support).
|
||
4. **Provider-agnostic.** Powered by [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo),
|
||
so it can target Ollama (local/cloud), OpenAI, Anthropic, Google, or any
|
||
OpenAI/Ollama-compatible endpoint. Don't re-hardcode a single provider.
|
||
5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib, majordomo, and
|
||
[executus](https://gitea.stevedudenhoeffer.com/steve/executus) (whose *core* — `run`/`compact`/
|
||
`model`/`fanout`/`tool` — is itself majordomo+stdlib only, so the binary stays static; do NOT
|
||
pull executus's `contrib/store` or any battery that drags in a DB driver). No heavyweight deps,
|
||
no coupling to any one consumer repo (e.g. mort). Gadfly is executus's canonical *light* consumer.
|
||
|
||
## Architecture
|
||
|
||
```
|
||
cmd/gadfly/ the reviewer binary — pure producer of review markdown (stdout)
|
||
main.go orchestration: fan specialists out (executus/fanout), each a review pass + recheck
|
||
engine.go reviewEngine abstraction: executus run.Executor (majordomo agent loop +
|
||
compaction/bounding/budget/critic) vs claude-code CLI shell-out
|
||
executus.go executus wiring: tool.Registry over the repo tools, the run.Executor build
|
||
(compact + model context-limit threshold + per-PR budget + wrap-up critic)
|
||
specialists.go specialist lenses: built-ins, default suite, env + .gadfly.yml resolution
|
||
auto.go dynamic `auto` selection: a selector model picks lenses per-diff (may invent)
|
||
delegate.go worker-tier delegate_investigation tool (cheap sub-agent does legwork)
|
||
consolidate.go verdict parsing + one-comment consolidation (a section per specialist)
|
||
model.go provider/model + selector + worker resolution (majordomo.Parse) + endpoint aliases
|
||
tools.go the 5 read-only repo tools (read_file/list_dir/grep/find_files/get_diff)
|
||
recheck.go second-pass verification prompt + verdict recompute
|
||
*_test.go sandbox, recheck, wrap-up, spec/endpoint-parse, specialist-resolution tests
|
||
scripts/run.sh fetch PR diff+meta, run the binary, upsert ONE labeled PR comment
|
||
scripts/status-board.sh render+upsert ONE live status-board comment (per-model/per-lens progress)
|
||
scripts/system-prompt.txt the reviewer persona + verification discipline (generic, not repo-specific)
|
||
entrypoint.sh container brains: trigger gating, PR clone, model loop (the logic that
|
||
used to live in workflow YAML)
|
||
Dockerfile multi-stage; private-module creds via BuildKit secrets never reach the final image
|
||
.gitea/workflows/build-image.yml push main → :latest; tag v* → :<tag>+:latest; PR → build-only
|
||
.gitea/workflows/review-reusable.yml reusable (workflow_call) review job; resolves swarm config at
|
||
RUNTIME: consumer `with:` input → owner user-scope var (GADFLY_DEFAULT_MODELS /
|
||
_SPECIALISTS / _PROVIDER_CONCURRENCY / _PROVIDER_LENS_CONCURRENCY, +
|
||
GADFLY_ENDPOINT_RAGNAROS) → image default. Vars are injected per-run, so editing
|
||
one var retunes the whole fleet even though long-lived act_runners CACHE this file
|
||
by ref (a moved tag is NOT re-fetched — only a runtime value or a fresh @<sha>
|
||
bypasses the cache). Consumers subscribe with an ~8-line caller forwarding only the
|
||
secrets the reviewer needs and pinned to an immutable @<sha> (Phase 4);
|
||
gadfly's own adversarial-review.yml is a thin caller of it (dogfoods the path).
|
||
examples/ copy-paste consumer stub workflows for different providers
|
||
```
|
||
|
||
**Data flow:** consumer stub workflow → container `entrypoint.sh` (gate + clone) →
|
||
`scripts/run.sh` (per model) → `cmd/gadfly` binary (agentic review) → markdown → run.sh
|
||
upserts a PR comment as `gitea-actions`.
|
||
|
||
**Two passes:** a *review* pass drafts findings; an adversarial *recheck* pass independently
|
||
re-verifies each finding against the code and drops the unconfirmed ones, recomputing the
|
||
verdict. Verdict is one of: `No material issues found` / `Minor issues` / `Blocking issues found`.
|
||
|
||
## Build / test
|
||
|
||
```sh
|
||
go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
|
||
go test ./...
|
||
gofmt -l cmd/ # must be clean
|
||
docker build -t gadfly:dev --secret id=REGISTRY_USER,env=REGISTRY_USER --secret id=REGISTRY_PASSWORD,env=REGISTRY_PASSWORD .
|
||
```
|
||
|
||
Run it locally against a real diff without CI:
|
||
|
||
```sh
|
||
git -C /path/to/repo diff main > /tmp/x.diff
|
||
GADFLY_PROVIDER=ollama GADFLY_MODEL=qwen2.5-coder:7b \
|
||
GADFLY_REPO_DIR=/path/to/repo GADFLY_DIFF_FILE=/tmp/x.diff \
|
||
GADFLY_SYSTEM_FILE=scripts/system-prompt.txt ./gadfly
|
||
```
|
||
|
||
## Release / deploy
|
||
|
||
- **Push to `main`** → CI builds and pushes `:latest` (+ `:sha-<short>`).
|
||
- **Tag `v*`** → publishes `:<tag>` (+ `:latest`). Pin consumers to `:vN` for stability.
|
||
- Required CI secrets: `REGISTRY_USER` / `REGISTRY_PASSWORD` (registry push + read access to the
|
||
private majordomo module). Optional `DISCORD_WEBHOOK_URL`.
|
||
|
||
## Configuration
|
||
|
||
The full env reference lives in the **README** (`Specialists`, `Models & providers`,
|
||
`Configuration`). Provider selection: `GADFLY_PROVIDER` (default `ollama-cloud`),
|
||
`GADFLY_MODEL`/`GADFLY_MODELS`, `GADFLY_BASE_URL`, `GADFLY_API_KEY`. Named endpoint aliases via
|
||
`GADFLY_ENDPOINT_<NAME>` / `GADFLY_ALIAS_<NAME>` (http-capable) and majordomo `LLM_*` DSNs
|
||
(HTTPS-only).
|
||
|
||
**Specialists (the swarm):** the reviewer runs a suite of focused lenses, one consolidated
|
||
comment with a section each. Default suite = security/correctness/maintainability/performance/
|
||
error-handling; opt-in built-ins = tests/docs/conventions/improvements. Select via
|
||
`GADFLY_SPECIALISTS` (csv or `all`); define/override via `GADFLY_SPECIALIST_<NAME>` env or a repo
|
||
`.gadfly.yml` (`specialists:` + `define:`). See `cmd/gadfly/specialists.go`. Cost ≈
|
||
specialists × models × 2 passes — the **image/entrypoint** default stays minimal (one model) for
|
||
that reason; the **reusable workflow** (`review-reusable.yml`) deliberately ships a heavier
|
||
opinionated default swarm (3 cloud + Claude Code, 5 lenses) for steve's own fleet, which consumers
|
||
inherit or override per-input.
|
||
**Dynamic `auto`** (`GADFLY_SPECIALISTS=auto`): a selector (`GADFLY_SELECTOR_MODEL` or the review
|
||
model) picks lenses per-diff and may invent ad-hoc ones (`cmd/gadfly/auto.go`). **Worker-tier**
|
||
(`GADFLY_WORKER_MODEL`): a `delegate_investigation` tool offloads grep/read legwork to a cheap
|
||
sub-agent (`cmd/gadfly/delegate.go`).
|
||
|
||
**Tested vs untested:** only the Ollama paths (local + OpenAI-compatible pointed at Ollama)
|
||
are actually exercised. OpenAI/Anthropic/Google come from majordomo's abstraction and are
|
||
**untested** (no spend). Keep the README honest about this; update it if that changes.
|
||
|
||
## When making changes — maintenance rules
|
||
|
||
- **Keep the README and `examples/` current.** Any change to env vars, flags, defaults,
|
||
triggers, provider support, or the consumer stub MUST be reflected in `README.md` and the
|
||
relevant files under `examples/` in the *same* change. The README's `Configuration` table,
|
||
the `Models & providers` table, and the example workflows are the contract users rely on —
|
||
stale docs are a bug.
|
||
- **Preserve the advisory-only invariant** (goal #2). If you touch exit codes or the workflow,
|
||
re-confirm a review can never fail/block a consumer's CI.
|
||
- **Don't add mort-specific (or any single-consumer) assumptions** to the binary or system
|
||
prompt. The system prompt is intentionally generic; repo-specific conventions should be
|
||
discovered by the agent at runtime (it can read the repo's own CONTRIBUTING/CLAUDE.md), not
|
||
hardcoded here.
|
||
- **Keep secrets out of image layers.** Private-module creds flow via BuildKit `--mount=type=secret`
|
||
in the build stage only; never bake them into the final image or commit them.
|
||
- Add a test when you add logic (see the `*_test.go` patterns). Keep `gofmt` clean and `go vet` quiet.
|
||
|
||
## Lessons
|
||
|
||
- majordomo's `LLM_*` env DSNs are **HTTPS-only** (`DSN.BaseURL()` forces `https://`), so they
|
||
can't express a plaintext local Ollama. That's why Gadfly adds the http-capable
|
||
`GADFLY_ENDPOINT_<NAME>="provider|base-url[|key]"` mechanism (see `cmd/gadfly/model.go`).
|
||
- Gitea `vars`/`secrets` are **not** auto-exposed as env in a job — the consumer stub must map
|
||
each one explicitly in its `env:` block (dynamic alias names can't be auto-enumerated).
|
||
- **`uses: docker://…:latest` is CACHED by act_runner** — a freshly-pushed `:latest` is often
|
||
NOT re-pulled, so the job silently runs the previous image. For a run that must use a specific
|
||
build (e.g. validating a just-pushed fix), pin the consumer stub to the immutable
|
||
`:sha-<short>` tag the build publishes, not `:latest`.
|
||
- **Concurrency is per-provider** (`entrypoint.sh`): each provider is a lane, lanes run in
|
||
parallel, `cap` (from `GADFLY_PROVIDER_CONCURRENCY` else `GADFLY_CONCURRENCY`, default 1) bounds
|
||
models-at-once within a lane. The review timeout (`GADFLY_TIMEOUT_SECS`) is **per-lens**, not
|
||
shared across the suite — a slow model can't starve later lenses (the original timeout bug).
|
||
- **Large-PR token burn**: the agent loop re-sends the whole transcript every step, so a giant
|
||
diff (the old `get_diff` dumped it untruncated, and it was embedded in both the review and
|
||
recheck task) was re-transmitted ~steps × lenses × passes × models times — a ~250 K-token PR
|
||
could drain a metered usage block in minutes. Fixed in three size-gated layers (small PRs
|
||
untouched): paginated `get_diff` + `executus/compact` compaction in the binary; an
|
||
`entrypoint.sh` downshift above `GADFLY_HUGE_DIFF_BYTES` (one cheap model, fewer lenses/steps,
|
||
no recheck); and a swarm-wide `GADFLY_PR_BUDGET_SECS` wall-clock backstop. Compaction's threshold
|
||
is intentionally LOW (`GADFLY_COMPACT_RATIO` 0.45, not executus's 0.7) because the burning
|
||
transcript on the embedded path rarely reaches 0.7×context.
|
||
- **executus re-platform**: the in-process review path runs through `executus/run`'s `run.Executor`
|
||
(compaction, run-bounding, `Ports.Budget`, the wrap-up nudge as `Ports.Critic`), wiring it in
|
||
`cmd/gadfly/executus.go`. Gadfly KEEPS its own `model.go` resolution (so `GADFLY_ENDPOINT_<NAME>`
|
||
http aliases + the claude-code engine survive) and only hands `run.Executor` the already-resolved
|
||
model via a trivial resolver — do NOT route review-model resolution through
|
||
`model.ParseModelForContext` (it bypasses gadfly's endpoint aliases). `run.Result` exposes no
|
||
transcript, so the old transcript-based forced-finalization fallback is gone; the wrap-up critic
|
||
nudge is the remaining "always emit something" mechanism. The claude-code engine still shells out
|
||
and is unaffected.
|