feat: re-platform agentic review onto executus + large-PR cost controls
Build & push image / build-and-push (pull_request) Successful in 11s
Adversarial Review (Gadfly) / review (pull_request) Successful in 25m56s

Fixes the large-PR token burn: a ~250K-token diff was re-sent every agent
step across models × lenses × passes, draining a metered usage block in
minutes. Small PRs are untouched (every mitigation is size-gated / no-op
under threshold).

- Re-platform the in-process review path onto executus run.Executor: context
  compaction (executus/compact, threshold from the model's real context window
  via executus/model), run-bounding, a per-PR budget gate (Ports.Budget), and
  the wrap-up nudge re-expressed as a run.Critic. Lens fan-out now uses
  executus/fanout. gadfly keeps its own model.go, so GADFLY_ENDPOINT_<NAME>
  aliases and the claude-code engine are unaffected. No majordomo bump; the
  binary stays static (executus core is majordomo+stdlib only).
- Paginate get_diff (per-file `path` + start_line/limit) instead of dumping the
  whole diff; trim the recheck diff embed (60k -> 20k chars).
- entrypoint.sh: downshift the fleet above GADFLY_HUGE_DIFF_BYTES (one cheap
  model, fewer lenses/steps, no recheck) + a swarm-wide GADFLY_PR_BUDGET_SECS
  wall-clock backstop (adds procps for pkill). All advisory; CI never fails.
- README + CLAUDE.md + tests updated.

Note: run.Result exposes no transcript, so the old transcript-based forced-
finalization fallback is dropped; the wrap-up critic nudge is the remaining
"always emit something" mechanism.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-30 10:43:34 -04:00
parent 5007597cf9
commit b860119332
19 changed files with 935 additions and 224 deletions
+42 -2
View File
@@ -396,7 +396,16 @@ The reviewer binary reads these (the stub/entrypoint set sane defaults):
| `GADFLY_TIMEOUT_SECS` | 300 | deadline **per specialist lens** (review+recheck) |
| `GADFLY_RECHECK` | on | set `0`/`false` to skip the recheck pass |
| `GADFLY_RECHECK_MAX_STEPS` | 16 | recheck-pass step cap |
| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the prompt (full diff via `get_diff`) |
| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the **review** prompt (the full diff is reachable via the paginated `get_diff` tool, scoped per file with its `path` arg) |
| `GADFLY_RECHECK_DIFF_CHARS` | 20000 | diff chars embedded in the **recheck** prompt (smaller — the recheck pages `get_diff` for the hunks it verifies) |
| `GADFLY_COMPACT` | on | context compaction (via [executus](https://gitea.stevedudenhoeffer.com/steve/executus)): fold the transcript's runaway middle into a summary as it nears the model's context window, so a big diff + accumulating tool output can't balloon every step. `0` disables |
| `GADFLY_COMPACT_RATIO` | 0.45 | fraction of the model's context window at which compaction fires |
| `GADFLY_COMPACT_MODEL` | worker, else review model | cheap model the compactor uses to summarize the folded middle |
| `GADFLY_COMPACT_KEEP_RECENT` | 8 | most-recent messages kept verbatim during compaction |
| `GADFLY_COMPACT_SUMMARY_WORDS` | 200 | word cap on the compaction summary |
| `GADFLY_MODEL_CONTEXT_TOKENS` | *(auto)* | override the model's context-window size (tokens) for the compaction threshold; set it for self-hosted endpoints executus can't introspect (Ollama Cloud models resolve automatically) |
| `GADFLY_PR_TOKEN_BUDGET` | — | per-model token ceiling for this PR; once spent, remaining lenses/passes are skipped (advisory). 0 = off |
| `GADFLY_PR_TIME_BUDGET_SECS` | — | per-model wall-clock ceiling for this PR (advisory). 0 = off |
| `GADFLY_STATUS_BOARD` | on | set `0` to disable the live status-board comment |
| `GADFLY_STATUS_POLL_SECS` | 12 | how often the status board re-renders/upserts |
| `GADFLY_CONSOLIDATE` | `auto` | cross-model consensus comment: `auto` (on for ≥2 models), `1` (force on), `0` (off — one comment per model) |
@@ -408,6 +417,37 @@ The reviewer binary reads these (the stub/entrypoint set sane defaults):
| `GADFLY_REPO` | *(from `GITEA_API`)* | `owner/repo` slug stamped on emitted runs/findings (set by `entrypoint.sh`) |
| `GADFLY_PR` | *(from event)* | PR number stamped on emitted runs/findings (set by `entrypoint.sh`) |
### Large-PR cost controls
A very large diff is the one thing that can blow the budget: every review step
re-sends it, multiplied across models × lenses × passes × steps (a single
~250 K-token PR can otherwise burn a whole metered usage block). Gadfly handles
big PRs in three layers, all **size-gated so small PRs are untouched**:
1. **Paginated `get_diff` + compaction** (reviewer binary, on by default) —
`get_diff` returns a paginated, optionally per-file window instead of the whole
diff, and once a transcript nears the model's context window its middle is
folded into a summary (powered by [executus](https://gitea.stevedudenhoeffer.com/steve/executus)'s
`compact`). Tune with the `GADFLY_COMPACT_*` knobs above.
2. **Downshift** (`entrypoint.sh`) — above `GADFLY_HUGE_DIFF_BYTES` the whole fleet
collapses to a single cheap model + a focused lens subset, fewer steps, and no
recheck. A finished shallow review beats a budget-nuking one, and the posted
comment says so.
3. **Hard backstop** (`entrypoint.sh`) — `GADFLY_PR_BUDGET_SECS` is a wall-clock
ceiling across the *entire* fleet; on expiry the review is stopped and whatever
was found so far is posted. Like everything else, it never fails CI.
| Env | Default | Meaning |
|-----|---------|---------|
| `GADFLY_HUGE_DIFF_BYTES` | 600000 | downshift the fleet when the PR diff exceeds this many bytes (0 = never downshift) |
| `GADFLY_HUGE_DIFF_MODELS` | first model | model(s) to run on a downshifted huge PR |
| `GADFLY_HUGE_DIFF_SPECIALISTS` | `security,correctness,error-handling` | lenses on a downshifted huge PR |
| `GADFLY_HUGE_DIFF_MAX_STEPS` | 12 | review step cap on a huge PR |
| `GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS` | 8 | recheck step cap on a huge PR |
| `GADFLY_HUGE_DIFF_RECHECK` | 0 | run the recheck pass on a huge PR (off by default) |
| `GADFLY_HUGE_DIFF_MAX_DIFF_CHARS` | 20000 | embedded review-diff chars on a huge PR |
| `GADFLY_PR_BUDGET_SECS` | — | swarm-wide wall-clock backstop; stops the whole fleet when reached (0 = off) |
## Findings telemetry (optional)
Gadfly can record what it found so model quality can be tracked over time. It is
@@ -431,7 +471,7 @@ code.
## Building locally
```sh
go build ./cmd/gadfly # needs read access to the private majordomo module
go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
go test ./...
```