Compare commits

9 Commits

Author SHA1 Message Date
steve ac6ce06cdd feat: re-platform agentic review onto executus + large-PR cost controls (#20)
Build & push image / build-and-push (push) Successful in 33s
Makes gadfly a consumer of executus (run.Executor compaction/bounding/budget/critic + fanout) and fixes the large-PR token burn in size-gated layers: paginated get_diff, downshift above GADFLY_HUGE_DIFF_BYTES, and a swarm-wide GADFLY_PR_BUDGET_SECS backstop. Small PRs untouched; advisory-only and the static binary preserved. Dogfood swarm reviewed it (6 models, 21 real findings graded + folded in).

Co-authored-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
Co-committed-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
2026-06-30 15:41:03 +00:00
steve 5007597cf9 chore(reusable): bump image pin to sha-3095ebf (inline PR review live)
Phase 3: gadfly's own multi-model reviews now also post a COMMENT-state PR
review with inline comments anchored to changed lines. External consumers
re-pin separately.

[skip ci]

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 22:00:39 -04:00
steve 3095ebff23 feat: inline COMMENT-state PR review (findings anchored to changed lines) (#18)
Build & push image / build-and-push (push) Successful in 8s
Co-authored-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
Co-committed-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
2026-06-29 01:59:36 +00:00
steve 8f5adc91b2 chore(reusable): bump image pin to sha-88f74aa (consensus consolidation live)
Phase 2: gadfly's own multi-model reviews now post ONE cross-model consensus
comment instead of N per-model comments. External consumers re-pin separately.

[skip ci]

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 18:57:38 -04:00
steve 88f74aa768 feat: cross-model consensus consolidation (one ranked comment, not N walls) (#17)
Build & push image / build-and-push (push) Successful in 9s
Co-authored-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
Co-committed-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
2026-06-28 22:56:15 +00:00
steve 84b891b1ba chore(reusable): bump image pin to sha-5397160 (structured findings contract)
Makes the Phase 1 gadfly-findings contract live for gadfly's own dogfood
reviews (the local-ref reusable). External consumers re-pin separately.

[skip ci]

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 18:25:16 -04:00
steve 53971603d3 feat: structured findings contract (machine-readable gadfly-findings block) (#16)
Build & push image / build-and-push (push) Successful in 5s
Co-authored-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
Co-committed-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
2026-06-28 22:23:02 +00:00
steve f49699fc12 Merge pull request 'docs: correct examples/reusable.yml pin guidance (prefer @sha; runners cache @v1)' (#15) from test/trigger-check into main
Reviewed-on: #15
2026-06-28 22:09:07 +00:00
Steve Dudenhoeffer 6e87a3e73f docs: correct examples/reusable.yml pin guidance (runners cache @v1; prefer @sha)
Adversarial Review (Gadfly) / review (pull_request) Successful in 3m4s
The @v1 comment claimed it auto-updates on releases, but long-lived act_runners
cache the reusable by ref so a moved tag isn't re-fetched. Recommend an
immutable @<sha>; routine tuning rides owner variables.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 02:10:35 -04:00
30 changed files with 2804 additions and 287 deletions
+5 -1
View File
@@ -60,6 +60,8 @@ on:
worker_model: { type: string, default: "" } # GADFLY_WORKER_MODEL worker_model: { type: string, default: "" } # GADFLY_WORKER_MODEL
allowed_users: { type: string, default: "" } # GADFLY_ALLOWED_USERS (consumer-specific; set in your stub) allowed_users: { type: string, default: "" } # GADFLY_ALLOWED_USERS (consumer-specific; set in your stub)
trigger_phrase: { type: string, default: "" } # GADFLY_TRIGGER_PHRASE trigger_phrase: { type: string, default: "" } # GADFLY_TRIGGER_PHRASE
consolidate: { type: string, default: "" } # GADFLY_CONSOLIDATE — "" => auto (one consensus comment for >=2 models); "0" => one comment per model
inline_review: { type: string, default: "" } # GADFLY_INLINE_REVIEW — "" => on (post a COMMENT-state PR review with inline comments on changed lines); "0" => off
# Job wall-clock cap. 90 as a default: the 5-lens suite across a slow lane # Job wall-clock cap. 90 as a default: the 5-lens suite across a slow lane
# (claude-code with extended thinking) over two passes can run long. # (claude-code with extended thinking) over two passes can run long.
timeout_minutes: { type: number, default: 90 } timeout_minutes: { type: number, default: 90 }
@@ -92,7 +94,7 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
timeout-minutes: ${{ inputs.timeout_minutes }} timeout-minutes: ${{ inputs.timeout_minutes }}
steps: steps:
- uses: docker://gitea.stevedudenhoeffer.com/steve/gadfly:sha-c342bdb - uses: docker://gitea.stevedudenhoeffer.com/steve/gadfly:sha-3095ebf
env: env:
# --- event context (from the CALLER's github.*) ------------------- # --- event context (from the CALLER's github.*) -------------------
GITEA_API: ${{ github.server_url }}/api/v1/repos/${{ github.repository }} GITEA_API: ${{ github.server_url }}/api/v1/repos/${{ github.repository }}
@@ -143,3 +145,5 @@ jobs:
GADFLY_WORKER_MODEL: ${{ inputs.worker_model }} GADFLY_WORKER_MODEL: ${{ inputs.worker_model }}
GADFLY_ALLOWED_USERS: ${{ inputs.allowed_users }} GADFLY_ALLOWED_USERS: ${{ inputs.allowed_users }}
GADFLY_TRIGGER_PHRASE: ${{ inputs.trigger_phrase }} GADFLY_TRIGGER_PHRASE: ${{ inputs.trigger_phrase }}
GADFLY_CONSOLIDATE: ${{ inputs.consolidate }}
GADFLY_INLINE_REVIEW: ${{ inputs.inline_review }}
+29 -5
View File
@@ -22,15 +22,21 @@ verifies each one against the actual code, and posts its findings as a comment.
4. **Provider-agnostic.** Powered by [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo), 4. **Provider-agnostic.** Powered by [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo),
so it can target Ollama (local/cloud), OpenAI, Anthropic, Google, or any so it can target Ollama (local/cloud), OpenAI, Anthropic, Google, or any
OpenAI/Ollama-compatible endpoint. Don't re-hardcode a single provider. OpenAI/Ollama-compatible endpoint. Don't re-hardcode a single provider.
5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib + majordomo. Keep 5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib, majordomo, and
it that way — no heavyweight deps, no coupling to any one consumer repo (e.g. mort). [executus](https://gitea.stevedudenhoeffer.com/steve/executus) (whose *core*`run`/`compact`/
`model`/`fanout`/`tool` — is itself majordomo+stdlib only, so the binary stays static; do NOT
pull executus's `contrib/store` or any battery that drags in a DB driver). No heavyweight deps,
no coupling to any one consumer repo (e.g. mort). Gadfly is executus's canonical *light* consumer.
## Architecture ## Architecture
``` ```
cmd/gadfly/ the reviewer binary — pure producer of review markdown (stdout) cmd/gadfly/ the reviewer binary — pure producer of review markdown (stdout)
main.go orchestration: loop specialists, each a review pass + adversarial recheck main.go orchestration: fan specialists out (executus/fanout), each a review pass + recheck
engine.go reviewEngine abstraction: majordomo agent loop vs claude-code CLI shell-out engine.go reviewEngine abstraction: executus run.Executor (majordomo agent loop +
compaction/bounding/budget/critic) vs claude-code CLI shell-out
executus.go executus wiring: tool.Registry over the repo tools, the run.Executor build
(compact + model context-limit threshold + per-PR budget + wrap-up critic)
specialists.go specialist lenses: built-ins, default suite, env + .gadfly.yml resolution specialists.go specialist lenses: built-ins, default suite, env + .gadfly.yml resolution
auto.go dynamic `auto` selection: a selector model picks lenses per-diff (may invent) auto.go dynamic `auto` selection: a selector model picks lenses per-diff (may invent)
delegate.go worker-tier delegate_investigation tool (cheap sub-agent does legwork) delegate.go worker-tier delegate_investigation tool (cheap sub-agent does legwork)
@@ -69,7 +75,7 @@ verdict. Verdict is one of: `No material issues found` / `Minor issues` / `Block
## Build / test ## Build / test
```sh ```sh
go build ./cmd/gadfly # needs read access to the private majordomo module go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
go test ./... go test ./...
gofmt -l cmd/ # must be clean gofmt -l cmd/ # must be clean
docker build -t gadfly:dev --secret id=REGISTRY_USER,env=REGISTRY_USER --secret id=REGISTRY_PASSWORD,env=REGISTRY_PASSWORD . docker build -t gadfly:dev --secret id=REGISTRY_USER,env=REGISTRY_USER --secret id=REGISTRY_PASSWORD,env=REGISTRY_PASSWORD .
@@ -149,3 +155,21 @@ are actually exercised. OpenAI/Anthropic/Google come from majordomo's abstractio
parallel, `cap` (from `GADFLY_PROVIDER_CONCURRENCY` else `GADFLY_CONCURRENCY`, default 1) bounds parallel, `cap` (from `GADFLY_PROVIDER_CONCURRENCY` else `GADFLY_CONCURRENCY`, default 1) bounds
models-at-once within a lane. The review timeout (`GADFLY_TIMEOUT_SECS`) is **per-lens**, not models-at-once within a lane. The review timeout (`GADFLY_TIMEOUT_SECS`) is **per-lens**, not
shared across the suite — a slow model can't starve later lenses (the original timeout bug). shared across the suite — a slow model can't starve later lenses (the original timeout bug).
- **Large-PR token burn**: the agent loop re-sends the whole transcript every step, so a giant
diff (the old `get_diff` dumped it untruncated, and it was embedded in both the review and
recheck task) was re-transmitted ~steps × lenses × passes × models times — a ~250 K-token PR
could drain a metered usage block in minutes. Fixed in three size-gated layers (small PRs
untouched): paginated `get_diff` + `executus/compact` compaction in the binary; an
`entrypoint.sh` downshift above `GADFLY_HUGE_DIFF_BYTES` (one cheap model, fewer lenses/steps,
no recheck); and a swarm-wide `GADFLY_PR_BUDGET_SECS` wall-clock backstop. Compaction's threshold
is intentionally LOW (`GADFLY_COMPACT_RATIO` 0.45, not executus's 0.7) because the burning
transcript on the embedded path rarely reaches 0.7×context.
- **executus re-platform**: the in-process review path runs through `executus/run`'s `run.Executor`
(compaction, run-bounding, `Ports.Budget`, the wrap-up nudge as `Ports.Critic`), wiring it in
`cmd/gadfly/executus.go`. Gadfly KEEPS its own `model.go` resolution (so `GADFLY_ENDPOINT_<NAME>`
http aliases + the claude-code engine survive) and only hands `run.Executor` the already-resolved
model via a trivial resolver — do NOT route review-model resolution through
`model.ParseModelForContext` (it bypasses gadfly's endpoint aliases). `run.Result` exposes no
transcript, so the old transcript-based forced-finalization fallback is gone; the wrap-up critic
nudge is the remaining "always emit something" mechanism. The claude-code engine still shells out
and is unaffected.
+4 -1
View File
@@ -24,7 +24,10 @@ RUN --mount=type=cache,target=/go/pkg/mod \
go build -trimpath -ldflags="-s -w" -o /out/gadfly ./cmd/gadfly go build -trimpath -ldflags="-s -w" -o /out/gadfly ./cmd/gadfly
FROM alpine:3.20 FROM alpine:3.20
RUN apk add --no-cache bash git curl jq ca-certificates nodejs npm # procps provides pkill/pgrep, which entrypoint.sh's per-PR wall-clock backstop
# (GADFLY_PR_BUDGET_SECS) uses to stop the review subtrees — busybox's applets
# are not guaranteed to include them.
RUN apk add --no-cache bash git curl jq ca-certificates nodejs npm procps
# Bundle the Claude Code CLI so the `claude-code` review engine works out of the # Bundle the Claude Code CLI so the `claude-code` review engine works out of the
# box (GADFLY_MODELS=claude-code or claude-code/<model>). This adds Node + the # box (GADFLY_MODELS=claude-code or claude-code/<model>). This adds Node + the
# CLI to the image (notably larger); ollama-only users pay the size but nothing # CLI to the image (notably larger); ollama-only users pay the size but nothing
+81 -2
View File
@@ -277,6 +277,43 @@ every `GADFLY_STATUS_POLL_SECS` (default 12s) until the swarm finishes. It's adv
best-effort — the per-model findings comments are unaffected — and entirely separate from those. best-effort — the per-model findings comments are unaffected — and entirely separate from those.
Turn it off with `GADFLY_STATUS_BOARD=0`. Turn it off with `GADFLY_STATUS_BOARD=0`.
### Consensus consolidation
With **two or more models**, posting one comment each means a reader faces N walls of prose that
mostly agree. Instead Gadfly consolidates: every model writes its findings to a shared file, and
after the whole swarm finishes a single pass clusters those findings by location, counts **how
many models independently flagged each one**, and posts **one consensus comment**:
```
## 🪰 Gadfly review — consensus across 7 models
**Verdict: Blocking issues found** · 9 findings (3 with multi-model agreement)
| | Finding | Where | Models | Lens |
|--|--|--|--|--|
| 🔴 | Auth bypass: token not verified | `auth/login.go:42` | 6/7 | security |
| 🟠 | Unbounded retry loop | `sync/worker.go:88` | 3/7 | error-handling |
<details><summary>4 single-model findings (lower confidence)</summary> … </details>
<details><summary>Per-model detail</summary> … each model's full review, folded … </details>
```
Cross-model agreement is the strongest real-vs-false-positive signal available, so findings are
ranked by it (a lone low-severity finding folds away; a lone *critical* still surfaces). The
per-model comments are suppressed in this mode — each model's full review is preserved, folded,
inside the consensus comment — and nothing is lost: if consolidation can't run, Gadfly falls
back to posting the per-model comments. Controlled by `GADFLY_CONSOLIDATE`: `auto` (default — on
for ≥2 models), `1` (force on), `0` (force off, one comment per model). Single-model runs are
unaffected.
**Inline PR review.** Alongside the consensus comment, Gadfly also posts a single Gitea **pull
review** (state `COMMENT` — advisory, **never** request-changes or approve, so it can't block a
merge) whose inline comments anchor each consensus finding to the exact changed line it's about.
Only findings that land on a line in the diff are anchored (Gitea rejects comments off the diff);
the rest stay in the consensus comment. A re-run replaces the previous review instead of stacking.
It's the "reviewer integrated with Gitea" without the blocking — turn it off with
`GADFLY_INLINE_REVIEW=0`.
### Triggers ### Triggers
1. A **new/reopened/ready** non-draft PR — automatic. 1. A **new/reopened/ready** non-draft PR — automatic.
@@ -359,9 +396,20 @@ The reviewer binary reads these (the stub/entrypoint set sane defaults):
| `GADFLY_TIMEOUT_SECS` | 300 | deadline **per specialist lens** (review+recheck) | | `GADFLY_TIMEOUT_SECS` | 300 | deadline **per specialist lens** (review+recheck) |
| `GADFLY_RECHECK` | on | set `0`/`false` to skip the recheck pass | | `GADFLY_RECHECK` | on | set `0`/`false` to skip the recheck pass |
| `GADFLY_RECHECK_MAX_STEPS` | 16 | recheck-pass step cap | | `GADFLY_RECHECK_MAX_STEPS` | 16 | recheck-pass step cap |
| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the prompt (full diff via `get_diff`) | | `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the **review** prompt (the full diff is reachable via the paginated `get_diff` tool, scoped per file with its `path` arg) |
| `GADFLY_RECHECK_DIFF_CHARS` | 20000 | diff chars embedded in the **recheck** prompt (smaller — the recheck pages `get_diff` for the hunks it verifies) |
| `GADFLY_COMPACT` | on | context compaction (via [executus](https://gitea.stevedudenhoeffer.com/steve/executus)): fold the transcript's runaway middle into a summary as it nears the model's context window, so a big diff + accumulating tool output can't balloon every step. `0` disables |
| `GADFLY_COMPACT_RATIO` | 0.45 | fraction of the model's context window at which compaction fires |
| `GADFLY_COMPACT_MODEL` | worker, else review model | cheap model the compactor uses to summarize the folded middle |
| `GADFLY_COMPACT_KEEP_RECENT` | 8 | most-recent messages kept verbatim during compaction |
| `GADFLY_COMPACT_SUMMARY_WORDS` | 200 | word cap on the compaction summary |
| `GADFLY_MODEL_CONTEXT_TOKENS` | *(auto)* | override the model's context-window size (tokens) for the compaction threshold; set it for self-hosted endpoints executus can't introspect (Ollama Cloud models resolve automatically) |
| `GADFLY_PR_TOKEN_BUDGET` | — | per-model token ceiling for this PR; once spent, remaining lenses/passes are skipped (advisory). 0 = off |
| `GADFLY_PR_TIME_BUDGET_SECS` | — | per-model wall-clock ceiling for this PR (advisory). 0 = off |
| `GADFLY_STATUS_BOARD` | on | set `0` to disable the live status-board comment | | `GADFLY_STATUS_BOARD` | on | set `0` to disable the live status-board comment |
| `GADFLY_STATUS_POLL_SECS` | 12 | how often the status board re-renders/upserts | | `GADFLY_STATUS_POLL_SECS` | 12 | how often the status board re-renders/upserts |
| `GADFLY_CONSOLIDATE` | `auto` | cross-model consensus comment: `auto` (on for ≥2 models), `1` (force on), `0` (off — one comment per model) |
| `GADFLY_INLINE_REVIEW` | on | when consolidating, also post a `COMMENT`-state PR review with inline comments on changed lines; `0` disables |
| `GADFLY_TRIGGER_PHRASE` | `@gadfly review` | comment phrase that re-triggers | | `GADFLY_TRIGGER_PHRASE` | `@gadfly review` | comment phrase that re-triggers |
| `GADFLY_ALLOWED_USERS` | *(collaborators)* | comma-separated allow-list for comment triggers | | `GADFLY_ALLOWED_USERS` | *(collaborators)* | comma-separated allow-list for comment triggers |
| `GADFLY_FINDINGS_URL` | — | gadfly-reports store base URL; set to enable findings telemetry (off when empty) | | `GADFLY_FINDINGS_URL` | — | gadfly-reports store base URL; set to enable findings telemetry (off when empty) |
@@ -369,6 +417,37 @@ The reviewer binary reads these (the stub/entrypoint set sane defaults):
| `GADFLY_REPO` | *(from `GITEA_API`)* | `owner/repo` slug stamped on emitted runs/findings (set by `entrypoint.sh`) | | `GADFLY_REPO` | *(from `GITEA_API`)* | `owner/repo` slug stamped on emitted runs/findings (set by `entrypoint.sh`) |
| `GADFLY_PR` | *(from event)* | PR number stamped on emitted runs/findings (set by `entrypoint.sh`) | | `GADFLY_PR` | *(from event)* | PR number stamped on emitted runs/findings (set by `entrypoint.sh`) |
### Large-PR cost controls
A very large diff is the one thing that can blow the budget: every review step
re-sends it, multiplied across models × lenses × passes × steps (a single
~250 K-token PR can otherwise burn a whole metered usage block). Gadfly handles
big PRs in three layers, all **size-gated so small PRs are untouched**:
1. **Paginated `get_diff` + compaction** (reviewer binary, on by default) —
`get_diff` returns a paginated, optionally per-file window instead of the whole
diff, and once a transcript nears the model's context window its middle is
folded into a summary (powered by [executus](https://gitea.stevedudenhoeffer.com/steve/executus)'s
`compact`). Tune with the `GADFLY_COMPACT_*` knobs above.
2. **Downshift** (`entrypoint.sh`) — above `GADFLY_HUGE_DIFF_BYTES` the whole fleet
collapses to a single cheap model + a focused lens subset, fewer steps, and no
recheck. A finished shallow review beats a budget-nuking one, and the posted
comment says so.
3. **Hard backstop** (`entrypoint.sh`) — `GADFLY_PR_BUDGET_SECS` is a wall-clock
ceiling across the *entire* fleet; on expiry the review is stopped and whatever
was found so far is posted. Like everything else, it never fails CI.
| Env | Default | Meaning |
|-----|---------|---------|
| `GADFLY_HUGE_DIFF_BYTES` | 600000 | downshift the fleet when the PR diff exceeds this many bytes (0 = never downshift) |
| `GADFLY_HUGE_DIFF_MODELS` | first model | model(s) to run on a downshifted huge PR |
| `GADFLY_HUGE_DIFF_SPECIALISTS` | `security,correctness,error-handling` | lenses on a downshifted huge PR |
| `GADFLY_HUGE_DIFF_MAX_STEPS` | 12 | review step cap on a huge PR |
| `GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS` | 8 | recheck step cap on a huge PR |
| `GADFLY_HUGE_DIFF_RECHECK` | 0 | run the recheck pass on a huge PR (off by default) |
| `GADFLY_HUGE_DIFF_MAX_DIFF_CHARS` | 20000 | embedded review-diff chars on a huge PR |
| `GADFLY_PR_BUDGET_SECS` | — | swarm-wide wall-clock backstop; stops the whole fleet when reached (0 = off) |
## Findings telemetry (optional) ## Findings telemetry (optional)
Gadfly can record what it found so model quality can be tracked over time. It is Gadfly can record what it found so model quality can be tracked over time. It is
@@ -392,7 +471,7 @@ code.
## Building locally ## Building locally
```sh ```sh
go build ./cmd/gadfly # needs read access to the private majordomo module go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
go test ./... go test ./...
``` ```
+477
View File
@@ -0,0 +1,477 @@
package main
// Cross-model consensus consolidation. The swarm runs each model independently
// (entrypoint.sh fans them out across provider lanes); historically each model
// posted its OWN comment, so a reader faced N walls of prose that mostly agreed.
//
// Instead, every model writes its findings to a shared directory
// (GADFLY_FINDINGS_OUT, one JSON file per model), and after the whole swarm
// finishes a single consolidation pass (GADFLY_CONSOLIDATE_DIR) clusters those
// findings by location, counts how many models independently flagged each one,
// and renders ONE comment: an agreement-ranked table up top (cross-model
// agreement is the strongest real-vs-false-positive signal we have), with each
// model's full review folded below for drill-down.
//
// This file owns: the per-model artifact (modelFindings), writing it
// (writeFindingsOut), reading the directory back, clustering, and rendering the
// consensus markdown (renderConsensus). It depends only on the stdlib.
import (
"encoding/json"
"errors"
"fmt"
"os"
"path/filepath"
"sort"
"strings"
)
// consensusMarker tags the single consolidated comment so entrypoint.sh can
// upsert it in place across re-runs (mirrors run.sh's per-model marker).
const consensusMarker = "<!-- gadfly-consensus -->"
// modelFindings is the per-model artifact written to GADFLY_FINDINGS_OUT. It
// carries enough to rebuild the consolidated comment without re-running models:
// the structured findings (for clustering) plus the full rendered review (for
// the folded per-model drill-down).
type modelFindings struct {
Model string `json:"model"`
Provider string `json:"provider"`
Verdict string `json:"verdict"` // worst lens verdict, as a label
Errored bool `json:"errored"` // the model produced no usable review (every lens failed, or the run crashed)
Markdown string `json:"markdown"` // full rendered per-model review (findings block already stripped)
Findings []outFinding `json:"findings"`
}
// outFinding is one finding in the per-model artifact (a flattened `finding`
// plus its lens).
type outFinding struct {
Lens string `json:"lens"`
File string `json:"file"`
Line int `json:"line"`
Severity string `json:"severity"`
Confidence string `json:"confidence"`
Title string `json:"title"`
Detail string `json:"detail"`
}
// collectFindings returns a lens result's findings with severity always filled
// in: per-finding when the structured block supplied it, else derived from the
// lens verdict (so heuristic-scraped findings still carry a canonical word). A
// clean or errored lens yields nothing. Shared by the telemetry emit and the
// per-model findings file so both agree on what a finding is.
func collectFindings(r specialistResult) []finding {
if r.errored || r.verdict == verdictClean {
return nil
}
fs := extractStructuredFindingsOrScrape(r)
lensSev := r.verdict.severity()
for i := range fs {
if fs[i].severity == "" {
fs[i].severity = lensSev
}
}
return fs
}
// writeFindingsOut writes this model's findings + rendered review to
// GADFLY_FINDINGS_OUT for the later consolidation pass. No-op unless the env is
// set. Best-effort: any error is logged to stderr and never affects the review
// (it runs after the markdown is already on stdout).
func writeFindingsOut(results []specialistResult) {
path := strings.TrimSpace(os.Getenv("GADFLY_FINDINGS_OUT"))
if path == "" {
return
}
mf := modelFindings{
Model: strings.TrimSpace(os.Getenv("GADFLY_MODEL")),
Provider: modelProvider(),
Verdict: worstVerdict(results).label(),
Errored: allErrored(results),
Markdown: renderConsolidated(results),
}
for _, r := range results {
for _, f := range collectFindings(r) {
mf.Findings = append(mf.Findings, outFinding{
Lens: r.spec.Name,
File: f.file,
Line: f.line,
Severity: f.severity,
Confidence: f.confidence,
Title: f.title,
Detail: f.detail,
})
}
}
data, err := json.Marshal(mf)
if err != nil {
fmt.Fprintln(os.Stderr, "gadfly: marshal findings out:", err)
return
}
// Defensive: make sure the parent dir exists (entrypoint creates it, but a
// missing dir would otherwise silently drop this model from the consensus).
if dir := filepath.Dir(path); dir != "" {
_ = os.MkdirAll(dir, 0o755)
}
if err := os.WriteFile(path, data, 0o644); err != nil {
fmt.Fprintln(os.Stderr, "gadfly: write findings out:", err)
}
}
// allErrored reports whether every lens of a review failed (so the model
// produced no usable findings). Such a model is recorded but excluded from the
// consensus agreement denominator — counting it would dilute every ratio with a
// model that never actually reviewed.
func allErrored(results []specialistResult) bool {
if len(results) == 0 {
return true
}
for _, r := range results {
if !r.errored {
return false
}
}
return true
}
// runConsolidate is the consolidation entry point (GADFLY_CONSOLIDATE_DIR set):
// read every per-model artifact in the directory, render the consensus comment
// to stdout. Errors are fatal to THIS process only — entrypoint.sh treats a
// failed consolidation as advisory and falls back to per-model comments.
func runConsolidate() error {
dir := strings.TrimSpace(os.Getenv("GADFLY_CONSOLIDATE_DIR"))
if dir == "" {
return errors.New("GADFLY_CONSOLIDATE_DIR is empty")
}
entries, err := os.ReadDir(dir)
if err != nil {
return fmt.Errorf("read consolidate dir: %w", err)
}
var models []modelFindings
for _, e := range entries {
if e.IsDir() || !strings.HasSuffix(e.Name(), ".json") {
continue
}
data, err := os.ReadFile(filepath.Join(dir, e.Name()))
if err != nil {
fmt.Fprintln(os.Stderr, "gadfly: read", e.Name(), err)
continue
}
var mf modelFindings
if err := json.Unmarshal(data, &mf); err != nil {
fmt.Fprintln(os.Stderr, "gadfly: parse", e.Name(), err)
continue
}
if strings.TrimSpace(mf.Model) == "" {
continue
}
models = append(models, mf)
}
if len(models) == 0 {
return errors.New("no model findings to consolidate")
}
// Cluster once, then render the consensus comment and (best-effort) the inline
// PR review from the same clusters so the two views can't drift.
clusters := clusterFindings(models)
// Lead with the marker so entrypoint.sh can upsert this comment in place
// (same pattern as run.sh's per-model marker); it appends the advisory footer.
fmt.Println(consensusMarker)
fmt.Println(renderConsensus(models, clusters))
// Inline PR review (COMMENT state) anchoring findings to changed lines.
// Best-effort: no-op without diff/API creds, never affects stdout/exit.
postInlineReview(clusters)
return nil
}
// cluster is a group of findings (across models) judged to be the same issue:
// same file, lines within lineTolerance of the cluster's current span. The span
// [line,maxLine] slides as members join, so a chain of nearby findings merges
// instead of splitting once it drifts past the first line.
type cluster struct {
file string
line int // representative (smallest) line
maxLine int // largest line in the cluster — the span's upper edge
severity string
title string
detail string // highest-severity report's detail; rendered in the inline review comment
models map[string]bool
lenses map[string]bool
}
// findingRef is one model's finding (carrying which model reported it), used
// while grouping findings into clusters.
type findingRef struct {
f outFinding
model string
}
// lineTolerance: a finding in the same file within this many lines of a
// cluster's current span is treated as the same issue (models often cite a line
// or two apart).
const lineTolerance = 3
// renderConsensus builds the single consolidated comment body from every model's
// findings. It does NOT emit the marker or advisory footer — entrypoint.sh wraps
// it (mirroring run.sh's per-model framing).
func renderConsensus(models []modelFindings, clusters []cluster) string {
// effective = models that actually produced a review. Errored models are
// shown (folded, below) but excluded from the agreement denominator so a
// failed model doesn't dilute every ratio.
effective := 0
for _, m := range models {
if !m.Errored {
effective++
}
}
errored := len(models) - effective
worst := verdictClean
for _, m := range models {
if v := parseVerdict(m.Verdict); v > worst {
worst = v
}
}
// Partition in one pass: "headline" findings (multi-model agreement, OR a
// lone CRITICAL) vs folded "single-model" lower-confidence findings. Also
// count multi-model agreements for the summary line.
var headline, folded []cluster
agreed := 0
for _, c := range clusters {
if len(c.models) >= 2 {
agreed++
}
if len(c.models) >= 2 || sevRank(c.severity) >= sevRank("critical") {
headline = append(headline, c)
} else {
folded = append(folded, c)
}
}
var b strings.Builder
fmt.Fprintf(&b, "## 🪰 Gadfly review — consensus across %d model%s", effective, plural(effective))
if errored > 0 {
fmt.Fprintf(&b, " (%d failed)", errored)
}
b.WriteString("\n\n")
fmt.Fprintf(&b, "**Verdict: %s** · %d finding%s (%d with multi-model agreement)\n",
worst.label(), len(clusters), plural(len(clusters)), agreed)
if len(headline) > 0 {
b.WriteString("\n| | Finding | Where | Models | Lens |\n|--|--|--|--|--|\n")
for _, c := range headline {
fmt.Fprintf(&b, "| %s | %s | `%s` | %d/%d | %s |\n",
sevIcon(c.severity), mdCell(c.title), mdCell(location(c.file, c.line)),
len(c.models), effective, mdCell(lensList(c.lenses)))
}
} else if len(clusters) == 0 {
b.WriteString("\nNo material issues found by consensus.\n")
}
// else: only single-model findings — they're shown folded below, so don't
// claim "no material issues" (there are some, just none with consensus).
if len(folded) > 0 {
fmt.Fprintf(&b, "\n<details><summary>%d single-model finding%s (lower confidence)</summary>\n\n",
len(folded), plural(len(folded)))
b.WriteString("| | Finding | Where | Model | Lens |\n|--|--|--|--|--|\n")
for _, c := range folded {
fmt.Fprintf(&b, "| %s | %s | `%s` | %s | %s |\n",
sevIcon(c.severity), mdCell(c.title), mdCell(location(c.file, c.line)),
mdCell(oneModel(c.models)), mdCell(lensList(c.lenses)))
}
b.WriteString("\n</details>\n")
}
// Per-model full reviews, folded for drill-down (nothing is lost).
b.WriteString("\n<details><summary>Per-model detail</summary>\n")
for _, m := range models {
body := strings.TrimSpace(m.Markdown)
if body == "" {
body = "_(no output)_"
}
verdict := m.Verdict
if m.Errored {
verdict = "⚠️ reviewer failed"
}
fmt.Fprintf(&b, "\n<details><summary><b>%s</b> (%s) — %s</summary>\n\n%s\n\n</details>\n",
mdCell(m.Model), mdCell(m.Provider), verdict, body)
}
b.WriteString("\n</details>")
return b.String()
}
// clusterFindings groups every model's findings into cross-model clusters,
// sorted by agreement (desc), then severity (desc), then location.
func clusterFindings(models []modelFindings) []cluster {
// Group by file, then greedily merge by line proximity.
byFile := map[string][]findingRef{}
for _, m := range models {
for _, f := range m.Findings {
if strings.TrimSpace(f.File) == "" {
continue
}
byFile[f.File] = append(byFile[f.File], findingRef{f, m.Model})
}
}
var clusters []cluster
for file, items := range byFile {
sort.SliceStable(items, func(i, j int) bool { return items[i].f.Line < items[j].f.Line })
// Cluster within THIS file only (clusters never span files), so the inner
// scan is over same-file clusters, not every cluster seen so far.
var fileClusters []cluster
for _, it := range items {
placed := false
for ci := range fileClusters {
c := &fileClusters[ci]
// Join if the line falls within the cluster's span, widened by the
// tolerance on both edges — so the window slides as the span grows.
if it.f.Line >= c.line-lineTolerance && it.f.Line <= c.maxLine+lineTolerance {
mergeIntoCluster(c, it.f, it.model)
placed = true
break
}
}
if !placed {
c := cluster{
file: file,
line: it.f.Line,
maxLine: it.f.Line,
severity: it.f.Severity,
title: it.f.Title,
detail: it.f.Detail,
models: map[string]bool{},
lenses: map[string]bool{},
}
mergeIntoCluster(&c, it.f, it.model)
fileClusters = append(fileClusters, c)
}
}
clusters = append(clusters, fileClusters...)
}
sort.SliceStable(clusters, func(i, j int) bool {
if len(clusters[i].models) != len(clusters[j].models) {
return len(clusters[i].models) > len(clusters[j].models)
}
if sevRank(clusters[i].severity) != sevRank(clusters[j].severity) {
return sevRank(clusters[i].severity) > sevRank(clusters[j].severity)
}
if clusters[i].file != clusters[j].file {
return clusters[i].file < clusters[j].file
}
return clusters[i].line < clusters[j].line
})
return clusters
}
// mergeIntoCluster folds one finding into a cluster: union the model/lens sets,
// widen the [line,maxLine] span, and keep the highest-severity report's title.
func mergeIntoCluster(c *cluster, f outFinding, model string) {
if model != "" {
c.models[model] = true
}
if f.Lens != "" {
c.lenses[f.Lens] = true
}
if f.Line > 0 && (c.line == 0 || f.Line < c.line) {
c.line = f.Line
}
if f.Line > c.maxLine {
c.maxLine = f.Line
}
// Backfill an empty title/detail from any report, regardless of severity, so a
// higher-severity-but-terse finding doesn't leave the cluster without context.
if strings.TrimSpace(c.title) == "" && strings.TrimSpace(f.Title) != "" {
c.title = f.Title
}
if strings.TrimSpace(c.detail) == "" && strings.TrimSpace(f.Detail) != "" {
c.detail = f.Detail
}
// A strictly-higher-severity report takes over the title/detail.
if sevRank(f.Severity) > sevRank(c.severity) {
c.severity = f.Severity
if strings.TrimSpace(f.Title) != "" {
c.title = f.Title
}
if strings.TrimSpace(f.Detail) != "" {
c.detail = f.Detail
}
}
}
// sevRank orders the canonical severity words for sorting/comparison.
func sevRank(s string) int {
switch strings.ToLower(strings.TrimSpace(s)) {
case "critical":
return 5
case "high":
return 4
case "medium":
return 3
case "small":
return 2
case "trivial":
return 1
default:
return 0
}
}
// sevIcon is the at-a-glance severity badge for the consensus table.
func sevIcon(s string) string {
switch strings.ToLower(strings.TrimSpace(s)) {
case "critical", "high":
return "🔴"
case "medium":
return "🟠"
case "small":
return "🟡"
default:
return "⚪"
}
}
func location(file string, line int) string {
if line > 0 {
return fmt.Sprintf("%s:%d", file, line)
}
return file
}
func lensList(lenses map[string]bool) string {
out := make([]string, 0, len(lenses))
for l := range lenses {
out = append(out, l)
}
sort.Strings(out)
return strings.Join(out, ", ")
}
func oneModel(models map[string]bool) string {
for m := range models {
return m
}
return ""
}
// mdCell makes a string safe for a one-line markdown table cell: collapse
// newlines, escape pipes (which delimit columns), and neutralize backticks
// (a stray one would break an inline-code span — a backslash can't escape it
// inside code, so replace with an apostrophe). Inputs are model-influenced, so
// this keeps a malformed file path or title from breaking the table.
func mdCell(s string) string {
s = strings.ReplaceAll(s, "\n", " ")
s = strings.ReplaceAll(s, "|", "\\|")
s = strings.ReplaceAll(s, "`", "'")
return strings.TrimSpace(s)
}
func plural(n int) string {
if n == 1 {
return ""
}
return "s"
}
+217
View File
@@ -0,0 +1,217 @@
package main
import (
"encoding/json"
"os"
"path/filepath"
"strings"
"testing"
)
func TestClusterFindingsAgreementAndTolerance(t *testing.T) {
models := []modelFindings{
{Model: "m1", Verdict: "Blocking issues found", Findings: []outFinding{
{Lens: "security", File: "a.go", Line: 10, Severity: "high", Title: "auth bypass"},
{Lens: "perf", File: "b.go", Line: 5, Severity: "trivial", Title: "tiny nit"},
}},
{Model: "m2", Verdict: "Minor issues", Findings: []outFinding{
{Lens: "security", File: "a.go", Line: 11, Severity: "critical", Title: "auth bypass (crit)"}, // within tolerance of a.go:10
}},
{Model: "m3", Verdict: "Minor issues", Findings: []outFinding{
{Lens: "correctness", File: "a.go", Line: 10, Severity: "medium", Title: "auth bypass"},
}},
}
clusters := clusterFindings(models)
if len(clusters) != 2 {
t.Fatalf("want 2 clusters (a.go:10±, b.go:5), got %d: %+v", len(clusters), clusters)
}
// First cluster (highest agreement) is the a.go auth one: 3 models, severity
// escalated to critical, representative line the smallest (10).
c := clusters[0]
if len(c.models) != 3 {
t.Errorf("want 3 models on the top cluster, got %d", len(c.models))
}
if c.severity != "critical" {
t.Errorf("want escalated severity critical, got %q", c.severity)
}
if c.line != 10 {
t.Errorf("want representative line 10, got %d", c.line)
}
if !c.lenses["security"] || !c.lenses["correctness"] {
t.Errorf("want union of lenses, got %v", c.lenses)
}
}
func TestRenderConsensusFoldsSingleModelNits(t *testing.T) {
models := []modelFindings{
{Model: "m1", Provider: "p", Verdict: "Blocking issues found", Markdown: "m1 detail", Findings: []outFinding{
{Lens: "security", File: "a.go", Line: 10, Severity: "high", Title: "auth bypass"},
{Lens: "perf", File: "b.go", Line: 5, Severity: "trivial", Title: "tiny nit"},
}},
{Model: "m2", Provider: "p", Verdict: "Minor issues", Markdown: "m2 detail", Findings: []outFinding{
{Lens: "security", File: "a.go", Line: 10, Severity: "high", Title: "auth bypass"},
}},
}
out := renderConsensus(models, clusterFindings(models))
// Headline table: the agreed finding with a 2/2 badge.
if !strings.Contains(out, "2/2") {
t.Errorf("expected a 2/2 agreement badge in headline:\n%s", out)
}
if !strings.Contains(out, "auth bypass") || !strings.Contains(out, "a.go:10") {
t.Errorf("headline missing the consensus finding:\n%s", out)
}
// The lone trivial finding is folded, not in the headline table.
if !strings.Contains(out, "single-model finding") {
t.Errorf("expected a folded single-model section:\n%s", out)
}
// Per-model detail is preserved (folded).
if !strings.Contains(out, "m1 detail") || !strings.Contains(out, "m2 detail") {
t.Errorf("per-model detail not preserved:\n%s", out)
}
}
func TestRenderConsensusHighSeverityLoneFindingStaysHeadline(t *testing.T) {
// A single model, single critical finding must still surface in the headline
// (not be folded as "low confidence").
models := []modelFindings{
{Model: "solo", Verdict: "Blocking issues found", Markdown: "x", Findings: []outFinding{
{Lens: "security", File: "a.go", Line: 1, Severity: "critical", Title: "rce"},
}},
}
out := renderConsensus(models, clusterFindings(models))
headline := out
if i := strings.Index(out, "single-model finding"); i >= 0 {
headline = out[:i]
}
if !strings.Contains(headline, "rce") {
t.Errorf("lone critical should be in the headline, not folded:\n%s", out)
}
}
func TestClusterSlidingWindowMergesChain(t *testing.T) {
// Findings at 10, 13, 16 (each 3 apart) from three models must merge into ONE
// cluster — the window slides with the span instead of anchoring at line 10.
models := []modelFindings{
{Model: "m1", Findings: []outFinding{{Lens: "x", File: "a.go", Line: 10, Severity: "medium", Title: "t"}}},
{Model: "m2", Findings: []outFinding{{Lens: "x", File: "a.go", Line: 13, Severity: "medium", Title: "t"}}},
{Model: "m3", Findings: []outFinding{{Lens: "x", File: "a.go", Line: 16, Severity: "medium", Title: "t"}}},
}
clusters := clusterFindings(models)
if len(clusters) != 1 {
t.Fatalf("chain 10/13/16 should merge into 1 cluster, got %d", len(clusters))
}
if len(clusters[0].models) != 3 {
t.Errorf("want 3 models in the merged cluster, got %d", len(clusters[0].models))
}
}
func TestRenderConsensusExcludesErroredFromDenominator(t *testing.T) {
models := []modelFindings{
{Model: "m1", Verdict: "Minor issues", Markdown: "a", Findings: []outFinding{
{Lens: "security", File: "a.go", Line: 9, Severity: "medium", Title: "leak"}}},
{Model: "m2", Verdict: "Minor issues", Markdown: "b", Findings: []outFinding{
{Lens: "security", File: "a.go", Line: 9, Severity: "medium", Title: "leak"}}},
{Model: "broken", Verdict: "reviewer failed", Errored: true, Markdown: "boom"},
}
out := renderConsensus(models, clusterFindings(models))
// Denominator is the 2 effective models, not 3; the failure is noted.
if !strings.Contains(out, "2/2") {
t.Errorf("errored model must be excluded from the denominator (want 2/2):\n%s", out)
}
if !strings.Contains(out, "1 failed") {
t.Errorf("expected a '1 failed' note:\n%s", out)
}
if !strings.Contains(out, "reviewer failed") {
t.Errorf("errored model should still appear (folded) as failed:\n%s", out)
}
}
func TestRenderConsensusLoneHighFolds(t *testing.T) {
// A single-model HIGH (not critical) folds — only consensus or a lone CRITICAL
// earns the headline, so a lone Blocking-lens finding doesn't reintroduce noise.
models := []modelFindings{
{Model: "solo", Verdict: "Blocking issues found", Markdown: "x", Findings: []outFinding{
{Lens: "security", File: "a.go", Line: 1, Severity: "high", Title: "maybe-bug"}}},
}
out := renderConsensus(models, clusterFindings(models))
head := out
if i := strings.Index(out, "single-model finding"); i >= 0 {
head = out[:i]
}
if strings.Contains(head, "maybe-bug") {
t.Errorf("a lone HIGH should fold, not headline:\n%s", out)
}
}
func TestWriteAndConsolidateRoundTrip(t *testing.T) {
dir := t.TempDir()
// Two model artifacts on disk.
write := func(name string, mf modelFindings) {
data, _ := json.Marshal(mf)
if err := os.WriteFile(filepath.Join(dir, name), data, 0o644); err != nil {
t.Fatal(err)
}
}
write("m1.json", modelFindings{Model: "m1", Provider: "ollama", Verdict: "Minor issues", Markdown: "md1",
Findings: []outFinding{{Lens: "security", File: "x.go", Line: 3, Severity: "medium", Title: "leak"}}})
write("m2.json", modelFindings{Model: "m2", Provider: "ollama", Verdict: "Minor issues", Markdown: "md2",
Findings: []outFinding{{Lens: "security", File: "x.go", Line: 3, Severity: "high", Title: "leak"}}})
// A junk file must be skipped, not crash consolidation.
if err := os.WriteFile(filepath.Join(dir, "notes.txt"), []byte("ignore me"), 0o644); err != nil {
t.Fatal(err)
}
t.Setenv("GADFLY_CONSOLIDATE_DIR", dir)
// runConsolidate prints to stdout; capture it.
out := captureStdout(t, func() {
if err := runConsolidate(); err != nil {
t.Fatalf("runConsolidate: %v", err)
}
})
if !strings.HasPrefix(strings.TrimSpace(out), consensusMarker) {
t.Errorf("consolidated output must lead with the marker:\n%s", out)
}
if !strings.Contains(out, "2/2") || !strings.Contains(out, "x.go:3") {
t.Errorf("expected the agreed x.go:3 finding at 2/2:\n%s", out)
}
}
func TestRunConsolidateEmptyDirErrors(t *testing.T) {
t.Setenv("GADFLY_CONSOLIDATE_DIR", t.TempDir())
if err := runConsolidate(); err == nil {
t.Error("want an error for an empty consolidate dir (entrypoint falls back)")
}
}
// captureStdout redirects os.Stdout for the duration of fn and returns what was
// written.
func captureStdout(t *testing.T, fn func()) string {
t.Helper()
orig := os.Stdout
r, w, err := os.Pipe()
if err != nil {
t.Fatal(err)
}
os.Stdout = w
defer func() { os.Stdout = orig }() // restore even if fn panics
done := make(chan string)
go func() {
var sb strings.Builder
buf := make([]byte, 4096)
for {
n, err := r.Read(buf)
if n > 0 {
sb.Write(buf[:n])
}
if err != nil {
break
}
}
r.Close()
done <- sb.String()
}()
fn()
w.Close()
return <-done
}
+19 -1
View File
@@ -28,6 +28,22 @@ func (v verdict) label() string {
} }
} }
// severity maps a verdict to a canonical severity word (the same vocabulary as
// the structured findings: critical/high/medium/small/trivial). Used as the
// raw_severity for heuristic-scraped findings, which carry no per-finding
// severity of their own — so the telemetry store sees one consistent vocabulary
// instead of a mix of canonical words and full verdict phrases.
func (v verdict) severity() string {
switch v {
case verdictBlocking:
return "high"
case verdictMinor:
return "small"
default:
return "trivial"
}
}
// parseVerdict extracts a specialist's self-reported verdict from its output. // parseVerdict extracts a specialist's self-reported verdict from its output.
// The base prompt tells each lens to lead with one of the three phrases. // The base prompt tells each lens to lead with one of the three phrases.
func parseVerdict(out string) verdict { func parseVerdict(out string) verdict {
@@ -100,7 +116,9 @@ func renderConsolidated(results []specialistResult) string {
headline, len(results), strings.Join(specialistNames(results), ", ")) headline, len(results), strings.Join(specialistNames(results), ", "))
for _, r := range results { for _, r := range results {
body := strings.TrimSpace(r.out) // Strip the machine-readable ```gadfly-findings block — it's for tooling
// (telemetry/consolidation), not for human readers of the comment.
body := strings.TrimSpace(stripFindingsBlock(r.out))
if body == "" { if body == "" {
body = "_(no output)_" body = "_(no output)_"
} }
+24 -23
View File
@@ -10,11 +10,14 @@ package main
// the review markdown consumed by run.sh) and never changes the exit code. // the review markdown consumed by run.sh) and never changes the exit code.
// - It depends only on the Go stdlib (net/http). // - It depends only on the Go stdlib (net/http).
// //
// Findings are extracted heuristically from each lens's markdown: a "path:line" // Findings come PRIMARILY from each lens's machine-readable ```gadfly-findings
// reference (e.g. run/executor.go:166) anchors a finding, whose title is the // block (findings.go: exact file/line + per-finding severity/confidence). When a
// nearest preceding markdown heading / numbered item / bold lead-in (else the // model emits no parseable block, emit falls back to a heuristic prose scrape: a
// first sentence of the finding's paragraph). This is best-effort signal for the // "path:line" reference (e.g. run/executor.go:166) anchors a finding, whose title
// store to aggregate — not a structured contract the reviewer guarantees. // is the nearest preceding markdown heading / numbered item / bold lead-in (else
// the first sentence of the finding's paragraph). The scrape is best-effort
// signal for the store to aggregate, not a structured contract the reviewer
// guarantees.
import ( import (
"bytes" "bytes"
@@ -50,11 +53,16 @@ var pathLineRe = regexp.MustCompile(`([A-Za-z0-9_./-]+\.[A-Za-z0-9]+):(\d+)`)
var numberedRe = regexp.MustCompile(`^\d+[.)]\s+(.+)$`) var numberedRe = regexp.MustCompile(`^\d+[.)]\s+(.+)$`)
// finding is one extracted issue: where it points and a derived human title. // finding is one extracted issue: where it points and a derived human title.
// severity/confidence are populated from the structured ```gadfly-findings block
// (extractStructuredFindings); they are empty for findings recovered by the
// heuristic prose scrape (parseFindings), which has no per-finding signal.
type finding struct { type finding struct {
file string file string
line int line int
title string title string
detail string detail string
severity string // per-finding: critical/high/medium/small/trivial ("" if heuristic)
confidence string // per-finding: high/medium/low ("" if heuristic)
} }
// runPayload is the POST /runs body. Field names match the gadfly-reports API EXACTLY. // runPayload is the POST /runs body. Field names match the gadfly-reports API EXACTLY.
@@ -86,6 +94,7 @@ type reportPayload struct {
Provider string `json:"provider"` Provider string `json:"provider"`
RunID string `json:"run_id"` RunID string `json:"run_id"`
RawSeverity string `json:"raw_severity"` RawSeverity string `json:"raw_severity"`
Confidence string `json:"confidence"` // per-finding high/medium/low ("" if heuristic)
Detail string `json:"detail"` Detail string `json:"detail"`
} }
@@ -116,21 +125,12 @@ func emit(results []specialistResult, elapsed time.Duration) {
// InputTokens/OutputTokens/CostUSD stay nil -> JSON null (not metered). // InputTokens/OutputTokens/CostUSD stay nil -> JSON null (not metered).
} }
// collectFindings (consensus.go) skips clean/errored lenses and fills in each
// finding's severity (per-finding when structured, else derived from the lens
// verdict), so every reportPayload carries a canonical raw_severity.
var reports []reportPayload var reports []reportPayload
for _, r := range results { for _, r := range results {
if r.errored { for _, f := range collectFindings(r) {
continue // a failed lens contributes no findings
}
// A lens that reports "No material issues found" has nothing to flag —
// its path:line references are verification notes ("verified X at
// file:line is safe"), not problems. Extracting them pollutes the
// findings store with false positives and unfairly penalizes thorough
// reviewers that do clean passes, so a clean lens emits no findings.
if r.verdict == verdictClean {
continue
}
sev := r.verdict.label()
for _, f := range parseFindings(r.spec, r.out) {
reports = append(reports, reportPayload{ reports = append(reports, reportPayload{
Repo: repo, Repo: repo,
PR: pr, PR: pr,
@@ -141,7 +141,8 @@ func emit(results []specialistResult, elapsed time.Duration) {
Model: model, Model: model,
Provider: provider, Provider: provider,
RunID: runID, RunID: runID,
RawSeverity: sev, RawSeverity: f.severity,
Confidence: f.confidence,
Detail: f.detail, Detail: f.detail,
}) })
} }
+6 -3
View File
@@ -157,7 +157,7 @@ func TestEmit_PostsRunsAndReports(t *testing.T) {
if len(reportBody) != 2 { if len(reportBody) != 2 {
t.Fatalf("/reports array length = %d, want 2", len(reportBody)) t.Fatalf("/reports array length = %d, want 2", len(reportBody))
} }
for _, k := range []string{"repo", "pr", "lens", "file", "line", "title", "model", "provider", "run_id", "raw_severity", "detail"} { for _, k := range []string{"repo", "pr", "lens", "file", "line", "title", "model", "provider", "run_id", "raw_severity", "confidence", "detail"} {
if _, ok := reportBody[0][k]; !ok { if _, ok := reportBody[0][k]; !ok {
t.Errorf("/reports[0] missing field %q (got keys %v)", k, keysOf(reportBody[0])) t.Errorf("/reports[0] missing field %q (got keys %v)", k, keysOf(reportBody[0]))
} }
@@ -171,8 +171,11 @@ func TestEmit_PostsRunsAndReports(t *testing.T) {
if reportBody[0]["line"] != float64(166) { if reportBody[0]["line"] != float64(166) {
t.Errorf("reports[0].line = %v, want 166", reportBody[0]["line"]) t.Errorf("reports[0].line = %v, want 166", reportBody[0]["line"])
} }
if reportBody[0]["raw_severity"] != "Blocking issues found" { // No structured block in sampleLensMarkdown, so this is a heuristic-scraped
t.Errorf("reports[0].raw_severity = %v, want 'Blocking issues found'", reportBody[0]["raw_severity"]) // finding: raw_severity is the canonical word derived from the lens verdict
// (Blocking -> "high"), not the full verdict phrase.
if reportBody[0]["raw_severity"] != "high" {
t.Errorf("reports[0].raw_severity = %v, want 'high'", reportBody[0]["raw_severity"])
} }
if reportBody[0]["run_id"] != "owner/repo#7:ollama-cloud/qwen3" { if reportBody[0]["run_id"] != "owner/repo#7:ollama-cloud/qwen3" {
t.Errorf("reports[0].run_id = %v, want owner/repo#7:ollama-cloud/qwen3", reportBody[0]["run_id"]) t.Errorf("reports[0].run_id = %v, want owner/repo#7:ollama-cloud/qwen3", reportBody[0]["run_id"])
+8 -4
View File
@@ -33,14 +33,18 @@ type reviewEngine interface {
runPass(ctx context.Context, system, task string, maxSteps int) (string, error) runPass(ctx context.Context, system, task string, maxSteps int) (string, error)
} }
// majordomoEngine drives the in-process majordomo agent over the repo sandbox. // majordomoEngine drives the in-process review path over the repo sandbox. It no
// longer calls majordomo's agent loop directly: each pass runs through an
// executus run.Executor (see executus.go), which adds context compaction, run
// bounding, the per-PR budget gate, and the wrap-up critic. mdl is retained only
// so auto-select can fall back to the review model as its selector.
type majordomoEngine struct { type majordomoEngine struct {
mdl llm.Model rex *reviewExecutor
fsTools *repoFS mdl llm.Model
} }
func (e *majordomoEngine) runPass(ctx context.Context, system, task string, maxSteps int) (string, error) { func (e *majordomoEngine) runPass(ctx context.Context, system, task string, maxSteps int) (string, error) {
return runAgent(ctx, e.mdl, e.fsTools, system, task, maxSteps) return e.rex.run(ctx, system, task, maxSteps)
} }
// claudeCodeEngine reviews by shelling out to the `claude` CLI (Claude Code) in // claudeCodeEngine reviews by shelling out to the `claude` CLI (Claude Code) in
+371
View File
@@ -0,0 +1,371 @@
package main
// executus.go wires gadfly's agentic review path onto the executus run kernel
// (gitea.stevedudenhoeffer.com/steve/executus), layered above majordomo. The
// majordomoEngine no longer drives majordomo's agent loop directly; it builds a
// run.Executor that gives gadfly, for free:
//
// - context compaction (executus/compact): once the transcript a step would
// SEND crosses a token threshold derived from the model's real context
// window, the runaway middle is folded into a one-paragraph summary by a
// cheap summarizer model — so a big diff + accumulating read_file/grep
// results can't balloon every re-sent step (the large-PR burn).
// - run bounding + a per-PR spend budget (executus/run Ports.Budget): a hard
// token/seconds ceiling so a pathological PR can't drain the usage block.
// - the wrap-up nudge, re-expressed as an executus Critic (Ports.Critic): the
// steer that tells a step-hungry model to stop investigating and write its
// answer is now the critic seam, not a bespoke RunOption.
//
// Everything degrades to today's behavior when unconfigured: nil summarizer or a
// 0 context window disables compaction; nil budget disables the ceiling; the
// claude-code engine shells out and is unaffected by any of this.
//
// gadfly keeps its own model.go resolution (so GADFLY_ENDPOINT_<NAME> http
// aliases, failover chains, and the claude-code engine all survive) — the
// run.Executor is handed gadfly's already-resolved model via a trivial resolver,
// not routed through executus's tier table.
import (
"context"
"errors"
"fmt"
"os"
"strconv"
"strings"
"sync"
"sync/atomic"
"time"
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
"gitea.stevedudenhoeffer.com/steve/executus/compact"
"gitea.stevedudenhoeffer.com/steve/executus/model"
exrun "gitea.stevedudenhoeffer.com/steve/executus/run"
exectool "gitea.stevedudenhoeffer.com/steve/executus/tool"
)
const (
// defaultCompactRatio is the fraction of the model's context window at which
// compaction fires. It is deliberately LOWER than executus's own 0.7 default:
// on the large-PR burn the per-step transcript is the embedded diff (~17K) plus
// accumulating read_file results, which rarely reaches 0.7×262K≈183K — so a
// 0.7 threshold never bites. ~0.45×262K≈118K folds the runaway middle while a
// transcript is still well under the cap. Override with GADFLY_COMPACT_RATIO.
defaultCompactRatio = 0.45
// defaultCompactKeepRecent / defaultCompactSummaryWords mirror executus's own
// compactor defaults; surfaced as gadfly env knobs for tuning.
defaultCompactKeepRecent = 8
defaultCompactSummaryWords = 200
// contextTokenLookupTimeout bounds the one-shot /api/show call that resolves a
// cloud model's context window at executor-build time. Kept short so a slow or
// unreachable endpoint adds at most this to startup before degrading to
// no-compaction (rather than the provider cache's default 15s).
contextTokenLookupTimeout = 5 * time.Second
)
// runSeq mints a unique-per-process RunID suffix for each executor run so audit
// and the run kernel can tell one pass from another within a binary process.
var runSeq atomic.Uint64
// wrappedTool adapts an already-built majordomo llm.Tool (gadfly's sandboxed
// read_file/grep/get_diff/… closures over the repoFS) to executus's tool.Tool
// interface so the run kernel can build a toolbox from them by name. gadfly's
// tools need no caller/channel identity, so BuildLLM ignores the Invocation and
// returns the pre-built tool; Permission is the zero value (private, ungated).
type wrappedTool struct{ t llm.Tool }
func (w wrappedTool) Name() string { return w.t.Name }
func (w wrappedTool) Description() string { return w.t.Description }
func (w wrappedTool) Permission() exectool.Permission { return exectool.Permission{} }
func (w wrappedTool) BuildLLM(_ exectool.Invocation) llm.Tool { return w.t }
// gadflyToolRegistry registers the repo's read-only tools (plus the optional
// delegate_investigation worker tool) in a fresh executus tool.Registry and
// returns it along with the tool names for RunnableAgent.LowLevelTools.
func gadflyToolRegistry(fs *repoFS) (exectool.Registry, []string, error) {
reg := exectool.NewRegistry()
tools := fs.allTools()
names := make([]string, 0, len(tools))
for _, t := range tools {
if err := reg.Register(wrappedTool{t: t}); err != nil {
return nil, nil, fmt.Errorf("register tool %q: %w", t.Name, err)
}
names = append(names, t.Name)
}
return reg, names, nil
}
// gadflyBudget is gadfly's per-PR spend ceiling, satisfying run.Ports.Budget.
// It gates a run BEFORE it makes any model call (Check) once the process has
// spent its token or wall-clock allowance on this PR. Tokens are fed in
// out-of-band via addUsage (the Budget interface's Commit only carries seconds);
// the engine calls addUsage after each pass with run.Result.Usage. A nil
// *gadflyBudget is never installed — caps of 0 mean "unlimited", so the port is
// only wired when at least one cap is set.
//
// The guard is PASS-granular: Check runs before each pass, so it stops the NEXT
// pass once the budget is spent but cannot abort a single runaway pass mid-flight.
// The swarm-wide GADFLY_PR_BUDGET_SECS wall-clock backstop (entrypoint.sh) is what
// bounds a mid-pass runaway.
type gadflyBudget struct {
mu sync.Mutex
maxTokens int64
maxSeconds float64
tokens int64
seconds float64
}
// newPRBudget builds the per-PR budget from env, or nil when neither cap is set
// (the default — the swarm-wide ceiling lives in entrypoint.sh; this is the
// per-process belt to its suspenders).
func newPRBudget() *gadflyBudget {
toks := envInt("GADFLY_PR_TOKEN_BUDGET", 0)
secs := envInt("GADFLY_PR_TIME_BUDGET_SECS", 0)
if toks <= 0 && secs <= 0 {
return nil
}
return &gadflyBudget{maxTokens: int64(toks), maxSeconds: float64(secs)}
}
func (b *gadflyBudget) Check(_ context.Context, _ string) error {
if b == nil {
return nil
}
b.mu.Lock()
defer b.mu.Unlock()
if b.maxTokens > 0 && b.tokens >= b.maxTokens {
return fmt.Errorf("gadfly: per-PR token budget exhausted (%d/%d)", b.tokens, b.maxTokens)
}
if b.maxSeconds > 0 && b.seconds >= b.maxSeconds {
return fmt.Errorf("gadfly: per-PR time budget exhausted (%.0f/%.0fs)", b.seconds, b.maxSeconds)
}
return nil
}
func (b *gadflyBudget) Commit(_ context.Context, _ string, runtimeSeconds float64) {
if b == nil {
return
}
b.mu.Lock()
defer b.mu.Unlock()
b.seconds += runtimeSeconds
}
// addUsage records a finished pass's token spend toward the budget. Safe on nil.
func (b *gadflyBudget) addUsage(u llm.Usage) {
if b == nil {
return
}
b.mu.Lock()
defer b.mu.Unlock()
b.tokens += int64(u.InputTokens) + int64(u.OutputTokens)
}
// wrapUpCritic re-expresses gadfly's wrap-up nudge as an executus run.Critic:
// once a run comes within wrapUpReserve steps of its cap, Steer() injects the
// "stop calling tools and write your final answer" message so a thorough model
// spends its last steps finalizing instead of hard-failing empty. It sets no
// hard deadline (Deadline()==zero) and never raises the step ceiling
// (MaxSteps()==0, defer to the run's MaxIterations) — it is purely the nudge.
type wrapUpCritic struct{ reserve int }
func (c *wrapUpCritic) Monitor(_ context.Context, info exrun.RunInfo, _ time.Duration) exrun.CriticHandle {
return &wrapUpHandle{maxSteps: info.MaxIterations, reserve: c.reserve}
}
type wrapUpHandle struct {
mu sync.Mutex
maxSteps int
reserve int
done int // steps completed so far
nudged bool
}
func (h *wrapUpHandle) RecordStep(iter int, _ *llm.Response) {
h.mu.Lock()
h.done = iter + 1
h.mu.Unlock()
}
func (h *wrapUpHandle) RecordToolStart(string, string) {}
func (h *wrapUpHandle) Steer() []llm.Message {
h.mu.Lock()
defer h.mu.Unlock()
at := h.maxSteps - h.reserve
if at < 1 {
at = 1
}
if !h.nudged && h.maxSteps > 0 && h.done >= at {
h.nudged = true
return []llm.Message{llm.UserText(wrapUpInstruction)}
}
return nil
}
func (h *wrapUpHandle) Deadline() time.Time { return time.Time{} }
func (h *wrapUpHandle) MaxSteps() int { return 0 }
func (h *wrapUpHandle) KillCause() error { return nil }
func (h *wrapUpHandle) Stop() {}
// reviewExecutor bundles a run.Executor with the per-run wiring the engine needs
// for each pass (the tool names to expose, the model spec to report as the tier,
// the per-PR caller id, and the budget to feed token usage into).
type reviewExecutor struct {
ex *exrun.Executor
toolNames []string
modelSpec string
callerID string
budget *gadflyBudget
}
// newReviewExecutor builds the run.Executor for the in-process majordomo review
// path. mdl is gadfly's already-resolved review model; summarizer is the cheap
// model the compactor uses (nil disables compaction). Compaction also needs the
// model's context window (resolved once here, not per pass); a 0 window likewise
// disables it. The budget (may be nil) becomes the run.Ports.Budget gate.
func newReviewExecutor(fs *repoFS, mdl, summarizer llm.Model, modelSpec string, budget *gadflyBudget) (*reviewExecutor, error) {
reg, names, err := gadflyToolRegistry(fs)
if err != nil {
return nil, err
}
// gadfly resolves exactly one model per process; the run kernel's resolver
// just hands that model back regardless of the tier string it is asked for.
modelsResolver := func(ctx context.Context, _ string) (context.Context, llm.Model, error) {
return ctx, mdl, nil
}
var compactor compact.CompactorFactory
var ctxTokens func(string) int
if summarizer != nil && compactionEnabled() {
if window := resolveContextTokens(modelSpec); window > 0 {
sumResolver := func(ctx context.Context, _ string) (context.Context, llm.Model, error) {
return ctx, summarizer, nil
}
compactor = compact.NewCompactor(compact.CompactorConfig{
Models: sumResolver,
KeepRecent: envInt("GADFLY_COMPACT_KEEP_RECENT", defaultCompactKeepRecent),
SummaryWordCap: envInt("GADFLY_COMPACT_SUMMARY_WORDS", defaultCompactSummaryWords),
})
ctxTokens = func(string) int { return window } // memoized: one window per process
}
}
var ports exrun.Ports
ports.Critic = &wrapUpCritic{reserve: wrapUpReserve()}
if budget != nil {
ports.Budget = budget
}
cfg := exrun.Config{
Registry: reg,
Models: modelsResolver,
Compactor: compactor,
ContextTokens: ctxTokens,
Defaults: exrun.Defaults{
// MaxIterations/MaxRuntime are intentionally omitted: every pass sets its
// own per-run cap on the RunnableAgent below (the review and recheck caps
// differ), so a Defaults value here would always be overridden — dead. This
// leaves only the cross-pass guards + the compaction ratio.
MaxConsecutiveToolErrors: 4,
MaxSameToolCallRepeats: 4,
CompactionThresholdRatio: compactRatio(),
FallbackTier: modelSpec,
},
Ports: ports,
}
return &reviewExecutor{
ex: exrun.New(cfg),
toolNames: names,
modelSpec: modelSpec,
callerID: prCallerID(),
budget: budget,
}, nil
}
// run executes one agent pass (review or recheck) through the run kernel and
// returns the model's final text. An empty answer with no error is reported as
// an error so the caller (reviewWithSpecialist) renders the advisory "reviewer
// failed to complete" notice rather than a blank section.
func (r *reviewExecutor) run(ctx context.Context, system, task string, maxSteps int) (string, error) {
res := r.ex.Run(ctx, exrun.RunnableAgent{
Name: "gadfly-review",
SystemPrompt: system,
ModelTier: r.modelSpec,
MaxIterations: maxSteps,
MaxRuntime: reviewTimeout(),
LowLevelTools: r.toolNames,
Critic: exrun.CriticConfig{Enabled: true},
}, exectool.Invocation{
RunID: fmt.Sprintf("gadfly-%d", runSeq.Add(1)),
CallerID: r.callerID,
}, task)
// Feed token spend toward the per-PR budget out-of-band (Commit carries only
// seconds; the executor already called it). Safe on a nil budget.
r.budget.addUsage(res.Usage)
if res.Err != nil {
return "", res.Err
}
if out := strings.TrimSpace(res.Output); out != "" {
return out, nil
}
return "", errors.New("agent produced no output")
}
// prCallerID is the budget/audit caller key: the repo + PR, so a budget keyed on
// it is naturally per-PR. Falls back to "local" for an out-of-CI run.
func prCallerID() string {
repo := strings.TrimSpace(os.Getenv("GADFLY_REPO"))
pr := strings.TrimSpace(os.Getenv("GADFLY_PR"))
if repo == "" && pr == "" {
return "local"
}
return repo + "#" + pr
}
// compactionEnabled reports whether context compaction should be wired. On
// unless GADFLY_COMPACT is explicitly falsey.
func compactionEnabled() bool { return envBool("GADFLY_COMPACT", true) }
// compactRatio is the compaction threshold as a fraction of the model context
// window (GADFLY_COMPACT_RATIO), clamped to (0,1]; default defaultCompactRatio.
func compactRatio() float64 {
v := strings.TrimSpace(os.Getenv("GADFLY_COMPACT_RATIO"))
if v == "" {
return defaultCompactRatio
}
f, err := strconv.ParseFloat(v, 64)
if err != nil || f <= 0 || f > 1 {
return defaultCompactRatio
}
return f
}
// resolveContextTokens returns the review model's context window in tokens, used
// to set the compaction threshold. GADFLY_MODEL_CONTEXT_TOKENS overrides it
// (needed for custom/self-hosted endpoints executus can't introspect); otherwise
// it asks executus/model, which knows the static catalog and can fetch an
// Ollama Cloud model's limit via /api/show (one call, at executor-build time).
// Returns 0 — disabling compaction — for an unknown model, mirroring executus's
// "unknown ⇒ don't budget" contract.
func resolveContextTokens(modelSpec string) int {
if v := envInt("GADFLY_MODEL_CONTEXT_TOKENS", 0); v > 0 {
return v
}
key := strings.TrimSpace(os.Getenv("GADFLY_API_KEY"))
if key == "" {
key = strings.TrimSpace(os.Getenv("OLLAMA_API_KEY"))
}
cache := model.NewCloudOllamaLimitCache("", key, nil)
ctx, cancel := context.WithTimeout(context.Background(), contextTokenLookupTimeout)
defer cancel()
if n, ok := model.MaxContextTokensResolving(ctx, modelSpec, cache); ok {
return n
}
// Unknown model or a failed lookup (e.g. no key / unreachable endpoint): don't
// guess — compaction is disabled. Log it so a misconfiguration is debuggable
// rather than silently dropping the protection. Set GADFLY_MODEL_CONTEXT_TOKENS
// to force a window for an endpoint executus can't introspect.
fmt.Fprintf(os.Stderr, "gadfly: no context window resolved for %q; compaction disabled (set GADFLY_MODEL_CONTEXT_TOKENS to enable it)\n", modelSpec)
return 0
}
+140
View File
@@ -0,0 +1,140 @@
package main
import (
"context"
"testing"
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
)
func TestGadflyBudget(t *testing.T) {
ctx := context.Background()
// A nil budget never blocks and never panics.
var nilB *gadflyBudget
if err := nilB.Check(ctx, "pr"); err != nil {
t.Errorf("nil budget Check should be nil, got %v", err)
}
nilB.Commit(ctx, "pr", 10)
nilB.addUsage(llm.Usage{InputTokens: 5})
// Token ceiling: passes until usage crosses it.
b := &gadflyBudget{maxTokens: 100}
if err := b.Check(ctx, "pr"); err != nil {
t.Fatalf("fresh budget should pass, got %v", err)
}
b.addUsage(llm.Usage{InputTokens: 60, OutputTokens: 50}) // 110 >= 100
if err := b.Check(ctx, "pr"); err == nil {
t.Error("budget over the token cap should reject the next run")
}
// Seconds ceiling, accumulated via Commit.
s := &gadflyBudget{maxSeconds: 30}
s.Commit(ctx, "pr", 31)
if err := s.Check(ctx, "pr"); err == nil {
t.Error("budget over the time cap should reject the next run")
}
}
func TestNewPRBudget(t *testing.T) {
t.Setenv("GADFLY_PR_TOKEN_BUDGET", "")
t.Setenv("GADFLY_PR_TIME_BUDGET_SECS", "")
if newPRBudget() != nil {
t.Error("no caps set should yield a nil (disabled) budget")
}
t.Setenv("GADFLY_PR_TOKEN_BUDGET", "1000")
if b := newPRBudget(); b == nil || b.maxTokens != 1000 {
t.Errorf("token cap should build a budget with maxTokens=1000, got %+v", b)
}
}
func TestCompactRatio(t *testing.T) {
t.Setenv("GADFLY_COMPACT_RATIO", "")
if got := compactRatio(); got != defaultCompactRatio {
t.Errorf("default ratio = %v, want %v", got, defaultCompactRatio)
}
t.Setenv("GADFLY_COMPACT_RATIO", "0.6")
if got := compactRatio(); got != 0.6 {
t.Errorf("ratio override = %v, want 0.6", got)
}
for _, bad := range []string{"0", "-1", "2", "nope"} {
t.Setenv("GADFLY_COMPACT_RATIO", bad)
if got := compactRatio(); got != defaultCompactRatio {
t.Errorf("invalid ratio %q should fall back to the default, got %v", bad, got)
}
}
}
func TestResolveContextTokensOverride(t *testing.T) {
// The explicit override short-circuits any model introspection (no network).
t.Setenv("GADFLY_MODEL_CONTEXT_TOKENS", "123456")
if got := resolveContextTokens("anything"); got != 123456 {
t.Errorf("explicit context-token override = %d, want 123456", got)
}
}
func TestPRCallerID(t *testing.T) {
t.Setenv("GADFLY_REPO", "")
t.Setenv("GADFLY_PR", "")
if got := prCallerID(); got != "local" {
t.Errorf("no repo/PR should be %q, got %q", "local", got)
}
t.Setenv("GADFLY_REPO", "steve/mort")
t.Setenv("GADFLY_PR", "1367")
if got := prCallerID(); got != "steve/mort#1367" {
t.Errorf("callerID = %q, want steve/mort#1367", got)
}
}
func TestCompactionEnabled(t *testing.T) {
for _, v := range []string{"", "1", "true", "yes"} {
t.Setenv("GADFLY_COMPACT", v)
if !compactionEnabled() {
t.Errorf("GADFLY_COMPACT=%q should be enabled", v)
}
}
for _, v := range []string{"0", "false", "no", "off"} {
t.Setenv("GADFLY_COMPACT", v)
if compactionEnabled() {
t.Errorf("GADFLY_COMPACT=%q should be disabled", v)
}
}
}
func TestGadflyToolRegistry(t *testing.T) {
fs, err := newRepoFS(t.TempDir(), "diff")
if err != nil {
t.Fatal(err)
}
_, names, err := gadflyToolRegistry(fs)
if err != nil {
t.Fatalf("gadflyToolRegistry: %v", err)
}
want := map[string]bool{"read_file": true, "list_dir": true, "grep": true, "find_files": true, "get_diff": true}
has := func(n string) bool {
for _, g := range names {
if g == n {
return true
}
}
return false
}
for n := range want {
if !has(n) {
t.Errorf("registry missing tool %q (got %v)", n, names)
}
}
if has("delegate_investigation") {
t.Error("delegate_investigation must be absent without a worker model")
}
// With a worker model the delegate tool is registered too.
fs.worker = fakeModel(t, "x")
_, names, err = gadflyToolRegistry(fs)
if err != nil {
t.Fatalf("gadflyToolRegistry with worker: %v", err)
}
if !has("delegate_investigation") {
t.Errorf("delegate_investigation should be registered with a worker model, got %v", names)
}
}
+236
View File
@@ -0,0 +1,236 @@
package main
// Structured findings: the machine-readable contract between a lens's output and
// the telemetry/consolidation pipeline. Each lens is asked (see
// scripts/system-prompt.txt) to append a fenced ```gadfly-findings code block
// holding a JSON array of its findings. Parsing that exact block is far more
// reliable than scraping prose with a path:line regex (emit.go's heuristic),
// and it carries PER-FINDING severity + confidence the prose verdict can't.
//
// Everything here degrades gracefully: a missing, unterminated, or malformed
// block makes extractStructuredFindings return ok=false (and yield no findings),
// so the caller falls back to the heuristic scrape — a weak model that ignores
// the contract still contributes findings, exactly as before.
import (
"encoding/json"
"strconv"
"strings"
)
// structuredFinding mirrors one element of the ```gadfly-findings JSON array.
// Line is json.Number so we tolerate both 123 and "123" from less-precise models.
type structuredFinding struct {
File string `json:"file"`
Line json.Number `json:"line"`
Severity string `json:"severity"`
Confidence string `json:"confidence"`
Title string `json:"title"`
Detail string `json:"detail"` // optional; the prose paragraph is used when absent
}
// findingsFence is the info-string that tags the machine-readable block.
const findingsFence = "gadfly-findings"
// extractStructuredFindings parses the ```gadfly-findings JSON block out of a
// lens's markdown. It returns the findings and ok=true when a TERMINATED block is
// present AND parses as a JSON array (an empty array is valid "nothing found" —
// ok is still true). A missing, unterminated, or unparseable block returns
// ok=false so the caller falls back to the heuristic scrape.
//
// Findings are deduped by file:line (keeping the first, matching parseFindings),
// findings with no usable file are dropped, and each title/detail is backfilled
// from the prose when the JSON omits it (best of both: exact location + the human
// context the model already wrote).
func extractStructuredFindings(out string) ([]finding, bool) {
lines := strings.Split(out, "\n")
start, end, ok := findingsSpan(lines)
if !ok {
return nil, false
}
var raw []structuredFinding
if err := json.Unmarshal([]byte(strings.Join(lines[start+1:end], "\n")), &raw); err != nil {
return nil, false
}
findings := make([]finding, 0, len(raw))
seen := map[string]bool{}
var prose map[string]string // built lazily — only if a finding lacks its own detail
for _, sf := range raw {
file := strings.TrimSpace(sf.File)
if file == "" {
continue
}
ln := 0
if n, err := strconv.Atoi(strings.TrimSpace(sf.Line.String())); err == nil && n > 0 {
ln = n
}
key := file + ":" + strconv.Itoa(ln)
if ln > 0 { // only dedupe concrete locations; unknown-line findings are kept
if seen[key] {
continue
}
seen[key] = true
}
detail := strings.TrimSpace(sf.Detail)
if detail == "" && ln > 0 {
if prose == nil {
prose = proseParagraphs(out)
}
detail = prose[key]
}
title := strings.TrimSpace(sf.Title)
if title == "" { // never empty: fall back to the prose detail, then the location
if detail != "" {
title = truncate(detail, 120)
} else if ln > 0 {
title = key
} else {
title = file
}
}
findings = append(findings, finding{
file: file,
line: ln,
title: title,
detail: truncate(detail, 500),
severity: normalizeSeverity(sf.Severity),
confidence: normalizeConfidence(sf.Confidence),
})
if len(findings) >= maxFindingsPerLens {
break
}
}
return findings, true
}
// extractStructuredFindingsOrScrape returns a lens's findings, preferring the
// structured ```gadfly-findings block and falling back to the heuristic prose
// scrape when the block is absent, unterminated/malformed, OR parsed to zero
// usable findings (e.g. an empty [] emitted alongside real prose findings).
// Factored out of emit() so the fallback rule is unit-testable.
func extractStructuredFindingsOrScrape(r specialistResult) []finding {
if fs, _ := extractStructuredFindings(r.out); len(fs) > 0 {
return fs
}
return parseFindings(r.spec, r.out)
}
// stripFindingsBlock removes every TERMINATED ```gadfly-findings block from out so
// the machine-readable JSON never shows in the rendered comment. An UNTERMINATED
// fence is left in place — treating it as a block would swallow the rest of the
// comment (e.g. when a model's output was truncated mid-block). Trailing
// whitespace is trimmed.
func stripFindingsBlock(out string) string {
lines := strings.Split(out, "\n")
for {
start, end, ok := findingsSpan(lines)
if !ok {
break
}
lines = append(lines[:start:start], lines[end+1:]...)
}
return strings.TrimRight(strings.Join(lines, "\n"), "\n")
}
// findingsSpan returns the [start,end] inclusive line indices of the first
// TERMINATED ```gadfly-findings block in lines (start = the opening fence, end =
// the closing fence), or ok=false when there is none or it is unterminated.
// extract and strip share it so they always agree on what is (and isn't) a block.
func findingsSpan(lines []string) (start, end int, ok bool) {
start = -1
for i, ln := range lines {
if isFindingsOpen(ln) {
start = i
break
}
}
if start < 0 {
return 0, 0, false
}
for j := start + 1; j < len(lines); j++ {
if isFenceClose(lines[j]) {
return start, j, true
}
}
return 0, 0, false // unterminated
}
// fenceInfo returns the info-string (text after the backticks) of a code-fence
// line and whether the line opens/closes a fence at all. A bare ``` yields ("",
// true); ```gadfly-findings yields ("gadfly-findings", true).
func fenceInfo(line string) (string, bool) {
t := strings.TrimSpace(line)
if !strings.HasPrefix(t, "```") {
return "", false
}
return strings.TrimSpace(strings.TrimLeft(t, "`")), true
}
// isFindingsOpen reports whether line opens a ```gadfly-findings block, matching
// the info-string EXACTLY (not as a substring) so a fence like ```not-findings
// can't masquerade as ours.
func isFindingsOpen(line string) bool {
info, ok := fenceInfo(line)
return ok && info == findingsFence
}
// isFenceClose reports whether line is a bare closing fence (``` with no info).
func isFenceClose(line string) bool {
info, ok := fenceInfo(line)
return ok && info == ""
}
// proseParagraphs maps "file:line" -> the prose paragraph that first references
// it, so a structured finding without its own detail can borrow the human
// context the model already wrote. Built from the markdown OUTSIDE the findings
// block (the block is JSON, not prose).
func proseParagraphs(out string) map[string]string {
prose := stripFindingsBlock(out)
lines := strings.Split(prose, "\n")
m := map[string]string{}
for _, loc := range pathLineRe.FindAllStringSubmatchIndex(prose, -1) {
key := prose[loc[2]:loc[3]] + ":" + prose[loc[4]:loc[5]]
if _, dup := m[key]; dup {
continue
}
li := strings.Count(prose[:loc[0]], "\n")
m[key] = paragraphAt(lines, li)
}
return m
}
// normalizeSeverity maps a model's severity word onto the canonical set
// (critical/high/medium/small/trivial), accepting common synonyms. An
// unrecognized value is returned lowercased so the store still sees the raw word.
func normalizeSeverity(s string) string {
switch t := strings.ToLower(strings.TrimSpace(s)); t {
case "critical", "crit", "blocker", "blocking":
return "critical"
case "high", "major", "severe":
return "high"
case "medium", "moderate":
return "medium"
case "small", "low", "minor":
return "small"
case "trivial", "nit", "nitpick", "info", "informational", "style", "cosmetic":
return "trivial"
default:
return t
}
}
// normalizeConfidence maps a model's confidence word onto high/medium/low,
// accepting common synonyms; an unrecognized value is returned lowercased.
func normalizeConfidence(s string) string {
switch t := strings.ToLower(strings.TrimSpace(s)); t {
case "high", "certain", "confirmed", "verified":
return "high"
case "medium", "med", "moderate":
return "medium"
case "low", "unsure", "tentative", "unverified", "speculative":
return "low"
default:
return t
}
}
+146
View File
@@ -0,0 +1,146 @@
package main
import (
"strings"
"testing"
)
const sampleReview = "**Blocking issues found**\n\n" +
"- **Unauthenticated endpoint** — `model.go:184` leaks PR content to a third party.\n" +
"- A nit about naming in `util.go:12`.\n\n" +
"```gadfly-findings\n" +
"[\n" +
" {\"file\": \"model.go\", \"line\": 184, \"severity\": \"high\", \"confidence\": \"high\", \"title\": \"Unauthenticated endpoint\"},\n" +
" {\"file\": \"util.go\", \"line\": 12, \"severity\": \"nit\", \"confidence\": \"medium\", \"title\": \"naming\"}\n" +
"]\n" +
"```\n"
func TestExtractStructuredFindings(t *testing.T) {
fs, ok := extractStructuredFindings(sampleReview)
if !ok {
t.Fatal("expected ok=true for a well-formed block")
}
if len(fs) != 2 {
t.Fatalf("want 2 findings, got %d", len(fs))
}
if fs[0].file != "model.go" || fs[0].line != 184 || fs[0].severity != "high" || fs[0].confidence != "high" {
t.Errorf("finding[0] mismatch: %+v", fs[0])
}
// "nit" must normalize to the canonical "trivial".
if fs[1].severity != "trivial" {
t.Errorf("want severity normalized to trivial, got %q", fs[1].severity)
}
// Detail is borrowed from the prose paragraph referencing the same file:line.
if fs[0].detail == "" {
t.Error("expected detail borrowed from prose, got empty")
}
}
func TestExtractStructuredFindingsEmptyArray(t *testing.T) {
out := "No material issues found\n\n```gadfly-findings\n[]\n```\n"
fs, ok := extractStructuredFindings(out)
if !ok {
t.Fatal("an empty array is a valid block; want ok=true")
}
if len(fs) != 0 {
t.Fatalf("want 0 findings, got %d", len(fs))
}
}
func TestExtractStructuredFindingsFallback(t *testing.T) {
// No block at all -> ok=false so the caller uses the heuristic scrape.
if _, ok := extractStructuredFindings("Minor issues\n\n- something at `x.go:1`\n"); ok {
t.Error("want ok=false when there is no block")
}
// Malformed JSON -> ok=false (graceful fallback).
bad := "Minor issues\n\n```gadfly-findings\n{not json}\n```\n"
if _, ok := extractStructuredFindings(bad); ok {
t.Error("want ok=false for malformed JSON")
}
}
func TestExtractStructuredFindingsStringLine(t *testing.T) {
// Tolerate a quoted line number from a less-precise model.
out := "Minor issues\n\n```gadfly-findings\n[{\"file\":\"a.go\",\"line\":\"42\",\"severity\":\"medium\",\"title\":\"x\"}]\n```\n"
fs, ok := extractStructuredFindings(out)
if !ok || len(fs) != 1 || fs[0].line != 42 {
t.Fatalf("want one finding at line 42, got ok=%v %+v", ok, fs)
}
}
func TestStripFindingsBlock(t *testing.T) {
stripped := stripFindingsBlock(sampleReview)
if strings.Contains(stripped, findingsFence) {
t.Errorf("block not stripped: %q", stripped)
}
// The prose findings must survive.
if !strings.Contains(stripped, "Unauthenticated endpoint") || !strings.Contains(stripped, "util.go:12") {
t.Errorf("prose lost during strip: %q", stripped)
}
}
func TestStripFindingsBlockUnterminated(t *testing.T) {
// A truncated, unterminated block must NOT swallow the prose before it.
out := "Minor issues\n\n- real finding at `x.go:1`\n\n```gadfly-findings\n[{\"file\":\"x.go\""
got := stripFindingsBlock(out)
if !strings.Contains(got, "real finding at `x.go:1`") {
t.Errorf("unterminated block swallowed the prose: %q", got)
}
}
func TestStripFindingsBlockNoBlock(t *testing.T) {
in := "Minor issues\n\n- finding at `x.go:9`"
if out := stripFindingsBlock(in); out != in {
t.Errorf("strip changed block-free text: %q != %q", out, in)
}
}
func TestNormalizeSeverity(t *testing.T) {
cases := map[string]string{
"Critical": "critical", "blocker": "critical",
"major": "high", "HIGH": "high",
"moderate": "medium",
"minor": "small", "low": "small", // "minor" and "low" both map to small (consistently)
"nit": "trivial", "Style": "trivial",
"weird": "weird", // unknown passes through, lowercased
}
for in, want := range cases {
if got := normalizeSeverity(in); got != want {
t.Errorf("normalizeSeverity(%q) = %q, want %q", in, got, want)
}
}
}
func TestNormalizeConfidence(t *testing.T) {
cases := map[string]string{
"High": "high", "confirmed": "high",
"MEDIUM": "medium", "moderate": "medium",
"low": "low", "unverified": "low",
"hunch": "hunch", // unknown passes through, lowercased
}
for in, want := range cases {
if got := normalizeConfidence(in); got != want {
t.Errorf("normalizeConfidence(%q) = %q, want %q", in, got, want)
}
}
}
func TestExtractStructuredFindingsFallbackOnEmpty(t *testing.T) {
// A non-clean lens that emitted an empty [] but listed prose findings must
// fall through to the heuristic scrape, not silently drop everything.
out := "Minor issues\n\n- bug at `pkg/a.go:7`\n\n```gadfly-findings\n[]\n```\n"
r := specialistResult{spec: Specialist{Name: "correctness"}, out: out, verdict: verdictMinor}
fs := extractStructuredFindingsOrScrape(r)
if len(fs) == 0 {
t.Fatal("empty [] must fall back to the heuristic scrape, got no findings")
}
if fs[0].file != "pkg/a.go" || fs[0].line != 7 {
t.Errorf("heuristic fallback wrong: %+v", fs[0])
}
}
func TestVerdictSeverity(t *testing.T) {
if verdictBlocking.severity() != "high" || verdictMinor.severity() != "small" || verdictUnknown.severity() != "trivial" {
t.Error("verdict.severity mapping changed unexpectedly")
}
}
+3 -3
View File
@@ -104,7 +104,7 @@ func TestRunSpecialists_FansOut(t *testing.T) {
} }
specs := threeLenses() specs := threeLenses()
results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff") results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff")
if got := peak(); got != 3 { if got := peak(); got != 3 {
t.Errorf("peak concurrent lenses = %d, want 3", got) t.Errorf("peak concurrent lenses = %d, want 3", got)
@@ -124,7 +124,7 @@ func TestRunSpecialists_SequentialByDefault(t *testing.T) {
} }
specs := threeLenses() specs := threeLenses()
results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff") results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff")
if got := peak(); got != 1 { if got := peak(); got != 1 {
t.Errorf("peak concurrent lenses = %d, want 1 (sequential by default)", got) t.Errorf("peak concurrent lenses = %d, want 1 (sequential by default)", got)
@@ -146,7 +146,7 @@ func TestRunSpecialists_PerProviderFanOut(t *testing.T) {
} }
specs := threeLenses() specs := threeLenses()
results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff") results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff")
if got := peak(); got != 3 { if got := peak(); got != 3 {
t.Errorf("peak concurrent lenses = %d, want 3 (m1 per-provider override)", got) t.Errorf("peak concurrent lenses = %d, want 3 (m1 per-provider override)", got)
+89 -140
View File
@@ -39,10 +39,9 @@
// GADFLY_TITLE PR title (optional). // GADFLY_TITLE PR title (optional).
// GADFLY_BODY PR description (optional). // GADFLY_BODY PR description (optional).
// GADFLY_MAX_STEPS review-pass step cap (optional, default 24). // GADFLY_MAX_STEPS review-pass step cap (optional, default 24).
// GADFLY_WRAPUP_RESERVE steps before the cap at which the agent is told to // GADFLY_WRAPUP_RESERVE steps before the cap at which the wrap-up critic nudges
// stop investigating and write its answer (optional, // the agent to stop investigating and write its answer
// default 4). Plus a tool-free finalization fallback // (optional, default 4).
// guarantees a step-exhausted pass still emits output.
// GADFLY_RECHECK set to 0/false to skip the recheck pass (optional, default on). // GADFLY_RECHECK set to 0/false to skip the recheck pass (optional, default on).
// GADFLY_RECHECK_MAX_STEPS recheck-pass step cap (optional, default 16). // GADFLY_RECHECK_MAX_STEPS recheck-pass step cap (optional, default 16).
// GADFLY_TIMEOUT_SECS overall deadline in seconds, shared by both passes (optional, default 300). // GADFLY_TIMEOUT_SECS overall deadline in seconds, shared by both passes (optional, default 300).
@@ -55,8 +54,8 @@
// lanes as GADFLY_PROVIDER_CONCURRENCY (e.g. // lanes as GADFLY_PROVIDER_CONCURRENCY (e.g.
// "ollama-cloud=3,m1=1"). Wins over GADFLY_LENS_CONCURRENCY // "ollama-cloud=3,m1=1"). Wins over GADFLY_LENS_CONCURRENCY
// for the model's provider; falls back to it otherwise. // for the model's provider; falls back to it otherwise.
// GADFLY_MAX_DIFF_CHARS diff chars embedded in the prompt (optional, default 60000; // GADFLY_MAX_DIFF_CHARS diff chars embedded in the review prompt (optional, default 60000;
// the full diff is always available via the get_diff tool). // the full diff is reachable via the paginated get_diff tool).
// //
// On success it prints the review to stdout and exits 0. On a usage/config or // On success it prints the review to stdout and exits 0. On a usage/config or
// model error it prints a diagnostic to stderr and exits non-zero; run.sh then // model error it prints a diagnostic to stderr and exits non-zero; run.sh then
@@ -70,11 +69,9 @@ import (
"os" "os"
"strconv" "strconv"
"strings" "strings"
"sync"
"time" "time"
"gitea.stevedudenhoeffer.com/steve/majordomo/agent" "gitea.stevedudenhoeffer.com/steve/executus/fanout"
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
) )
const ( const (
@@ -107,13 +104,6 @@ const wrapUpInstruction = "⚠️ You are almost out of your investigation budge
"Do not begin any new investigation. If a finding could not be confirmed, drop it or mark it explicitly as unverified. " + "Do not begin any new investigation. If a finding could not be confirmed, drop it or mark it explicitly as unverified. " +
"Output the review in the required format right now." "Output the review in the required format right now."
// finalizeInstruction is the user message sent on the tool-free fallback pass
// when the agent exhausted its budget (or tripped a loop guard) without ever
// producing a final answer. It forces the model to synthesize whatever it has.
const finalizeInstruction = "You have run out of investigation steps. Do NOT call any tools. " +
"Based solely on what you have already gathered above, write your final answer now in the required format. " +
"If you could not confirm some findings, omit them or mark them as unverified, but produce the answer."
func main() { func main() {
if err := run(); err != nil { if err := run(); err != nil {
fmt.Fprintln(os.Stderr, "gadfly:", err) fmt.Fprintln(os.Stderr, "gadfly:", err)
@@ -123,6 +113,14 @@ func main() {
func run() error { func run() error {
start := time.Now() start := time.Now()
// Consolidation mode: not a review at all — read the per-model findings the
// swarm wrote and print the single cross-model consensus comment. entrypoint.sh
// runs this once, after every model has finished.
if strings.TrimSpace(os.Getenv("GADFLY_CONSOLIDATE_DIR")) != "" {
return runConsolidate()
}
repoDir := os.Getenv("GADFLY_REPO_DIR") repoDir := os.Getenv("GADFLY_REPO_DIR")
diffFile := os.Getenv("GADFLY_DIFF_FILE") diffFile := os.Getenv("GADFLY_DIFF_FILE")
systemFile := os.Getenv("GADFLY_SYSTEM_FILE") systemFile := os.Getenv("GADFLY_SYSTEM_FILE")
@@ -169,7 +167,18 @@ func run() error {
} else if worker != nil { } else if worker != nil {
fsTools.worker = worker fsTools.worker = worker
} }
eng = &majordomoEngine{mdl: mdl, fsTools: fsTools} // The context compactor needs a cheap summarizer; reuse the worker model
// when present, else the review model. A bad explicit GADFLY_COMPACT_MODEL
// just disables compaction rather than sinking the review.
summarizer, serr := resolveSummarizerModel(mdl, fsTools.worker)
if serr != nil {
fmt.Fprintln(os.Stderr, "gadfly: compaction summarizer disabled:", serr)
}
rex, rerr := newReviewExecutor(fsTools, mdl, summarizer, os.Getenv("GADFLY_MODEL"), newPRBudget())
if rerr != nil {
return fmt.Errorf("build review executor: %w", rerr)
}
eng = &majordomoEngine{rex: rex, mdl: mdl}
} }
specialists, registry, auto, serrs := resolveSpecialists(repoDir) specialists, registry, auto, serrs := resolveSpecialists(repoDir)
@@ -214,59 +223,64 @@ func run() error {
// Optional, best-effort telemetry. OFF unless GADFLY_FINDINGS_URL is set; // Optional, best-effort telemetry. OFF unless GADFLY_FINDINGS_URL is set;
// any failure is logged to stderr and never affects stdout or the exit code. // any failure is logged to stderr and never affects stdout or the exit code.
emit(results, time.Since(start)) emit(results, time.Since(start))
// Optional per-model findings artifact for the cross-model consolidation
// pass. No-op unless GADFLY_FINDINGS_OUT is set (entrypoint sets it for a
// multi-model swarm). Best-effort, never affects stdout or the exit code.
writeFindingsOut(results)
return nil return nil
} }
// runSpecialists reviews the diff through each lens and returns the results in // runSpecialists reviews the diff through each lens and returns the results in
// the SAME order as specialists, regardless of finish order. Up to // the SAME order as specialists, regardless of finish order. It uses executus's
// GADFLY_LENS_CONCURRENCY lenses run concurrently; the default of 1 keeps the // fanout primitive: up to GADFLY_LENS_CONCURRENCY lenses run concurrently (the
// suite sequential, exactly as before. Each lens already runs under its own // default of 1 keeps the suite sequential, exactly as before), and fanout.Run
// per-lens timeout (reviewWithSpecialist), so concurrency simply overlaps those // returns one result per lens in input order. Each lens already runs under its
// independent passes — and because reviewWithSpecialist builds a fresh toolbox // own per-lens timeout (reviewWithSpecialist) and the lenses only read the
// per pass and the lenses only read the immutable repoFS, they share no mutable // immutable repoFS, so concurrency simply overlaps independent passes.
// state. Results are stored by index so the consolidated comment keeps the
// configured lens order.
// //
// Caution: this fans out WITHIN one model. It multiplies with entrypoint.sh's // Caution: this fans out WITHIN one model. It multiplies with entrypoint.sh's
// per-provider model concurrency, so total concurrent backend requests ≈ // per-provider model concurrency, so total concurrent backend requests ≈
// (models at once) × (lenses at once). To fan lenses out without oversubscribing // (models at once) × (lenses at once). To fan lenses out without oversubscribing
// the backend, run models one at a time (provider lane cap 1) and raise this. // the backend, run models one at a time (provider lane cap 1) and raise this.
func runSpecialists(eng reviewEngine, base string, specialists []Specialist, task, diff string) []specialistResult { func runSpecialists(eng reviewEngine, base string, specialists []Specialist, task, diff string) []specialistResult {
results := make([]specialistResult, len(specialists))
// Optional live status board: publishes this model's per-lens progress to a // Optional live status board: publishes this model's per-lens progress to a
// file the entrypoint board renders. Inert (no-op) unless GADFLY_STATUS_FILE // file the entrypoint board renders. Inert (no-op) unless GADFLY_STATUS_FILE
// is set, so plain runs are unaffected. // is set, so plain runs are unaffected.
sw := newStatusWriter(os.Getenv("GADFLY_MODEL"), modelProvider(), specialists) sw := newStatusWriter(os.Getenv("GADFLY_MODEL"), modelProvider(), specialists)
conc := min(lensConcurrency(), len(specialists)) fanResults := fanout.Run(context.Background(), specialists, fanout.Options[Specialist]{
MaxConcurrent: lensConcurrency(),
}, func(_ context.Context, sp Specialist) (res specialistResult, _ error) {
// A panic in one lens must not crash the whole binary (which would kill
// every other lens's output) or leave this lens stuck at "running" on the
// status board. fanout does not recover fn panics, so we do it here:
// record the panic as an errored result and mark the lens finished.
defer func() {
if r := recover(); r != nil {
res = specialistResult{spec: sp, out: fmt.Sprintf("⚠️ This reviewer panicked: %v", r), verdict: verdictUnknown, errored: true}
sw.set(sp.Name, lensFinished, "", true)
}
}()
sw.set(sp.Name, lensRunning, "", false)
out, errored := reviewWithSpecialist(eng, base, sp, task, diff)
v := parseVerdict(out)
sw.set(sp.Name, lensFinished, v.label(), errored)
return specialistResult{spec: sp, out: out, verdict: v, errored: errored}, nil
})
sem := make(chan struct{}, conc) // fanout guarantees input order; its Result.Err is set only when the context
var wg sync.WaitGroup // is cancelled before a lens ran (reviewWithSpecialist embeds its own failures
for i, sp := range specialists { // in the result), so surface that as an errored lens rather than dropping it.
wg.Add(1) results := make([]specialistResult, len(specialists))
sem <- struct{}{} // blocks once `conc` lenses are already in flight for i, r := range fanResults {
go func(i int, sp Specialist) { if r.Err != nil {
defer wg.Done() results[i] = specialistResult{spec: specialists[i], out: fmt.Sprintf("⚠️ This reviewer did not run: %v", r.Err), verdict: verdictUnknown, errored: true}
defer func() { <-sem }() sw.set(specialists[i].Name, lensFinished, "", true)
// A panic in one lens must not crash the whole binary (which would continue
// kill every other lens's output) or leave this lens stuck at }
// "running" on the status board. Recover, record it as an errored results[i] = r.Value
// result, and mark the lens finished so the board can complete.
defer func() {
if r := recover(); r != nil {
results[i] = specialistResult{spec: sp, out: fmt.Sprintf("⚠️ This reviewer panicked: %v", r), verdict: verdictUnknown, errored: true}
sw.set(sp.Name, lensFinished, "", true)
}
}()
sw.set(sp.Name, lensRunning, "", false)
out, errored := reviewWithSpecialist(eng, base, sp, task, diff)
v := parseVerdict(out)
results[i] = specialistResult{spec: sp, out: out, verdict: v, errored: errored}
sw.set(sp.Name, lensFinished, v.label(), errored)
}(i, sp)
} }
wg.Wait()
return results return results
} }
@@ -307,8 +321,7 @@ func providerOverride(envName, provider string) (int, bool) {
// returned bool is true when the review pass failed (rendered as an inline // returned bool is true when the review pass failed (rendered as an inline
// notice — advisory; one lens failing never sinks the others or the job). // notice — advisory; one lens failing never sinks the others or the job).
func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, diff string) (string, bool) { func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, diff string) (string, bool) {
timeout := time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second ctx, cancel := context.WithTimeout(context.Background(), reviewTimeout())
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel() defer cancel()
draft, err := eng.runPass(ctx, composeSpecialistPrompt(base, sp), task, draft, err := eng.runPass(ctx, composeSpecialistPrompt(base, sp), task,
@@ -331,90 +344,6 @@ func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, di
return final, false return final, false
} }
// runAgent runs one agent pass (its own fresh toolbox over the sandbox) and
// returns the final answer. An empty answer is an error — the caller decides
// whether that is fatal (review pass) or recoverable (recheck pass). A
// non-empty answer that ended on a budget/guard error is still returned: the
// model wrote its output, then ran out of steps.
//
// Two mechanisms keep a step-hungry model from hard-failing with no output:
// 1. A wrap-up steer: once the run comes within wrapUpReserve steps of the
// cap, a forceful "stop calling tools, write your final answer" message is
// injected so the model spends its remaining steps finalizing.
// 2. A finalization fallback: if the loop still ends empty (the model ignored
// the nudge, or a loop guard tripped), one tool-free model call forces a
// final answer out of the transcript already gathered.
func runAgent(ctx context.Context, mdl llm.Model, fsTools *repoFS, system, task string, maxSteps int) (string, error) {
box, err := fsTools.toolbox()
if err != nil {
return "", err
}
loop := agent.New(mdl, system,
agent.WithToolbox(box),
agent.WithMaxSteps(maxSteps),
// Guard rails: stop the model from spinning on failing or identical
// tool calls instead of writing its answer.
agent.WithToolErrorLimits(4, 4),
)
wrapUpAt := maxSteps - wrapUpReserve()
if wrapUpAt < 1 {
wrapUpAt = 1
}
var completed int // steps finished so far (updated after each step)
nudged := false
res, runErr := loop.Run(ctx, task,
agent.OnStep(func(s agent.Step) { completed = s.Index + 1 }),
agent.WithSteer(func() []llm.Message {
if !nudged && completed >= wrapUpAt {
nudged = true
return []llm.Message{llm.UserText(wrapUpInstruction)}
}
return nil
}),
)
out := ""
if res != nil {
out = strings.TrimSpace(res.Output)
}
if out != "" {
return out, nil
}
// No final answer. If we still have budget on the clock and a transcript to
// work from, force a tool-free finalization rather than losing the pass.
if res != nil && len(res.Messages) > 0 && ctx.Err() == nil {
if forced := forceFinalAnswer(ctx, mdl, system, res.Messages); forced != "" {
return forced, nil
}
}
if runErr != nil {
return "", runErr
}
return "", errors.New("agent produced no output")
}
// forceFinalAnswer makes one tool-free model call to squeeze a final answer out
// of an agent that exhausted its step budget without producing one. Tools are
// forbidden (ToolChoice "none") so the model must synthesize from the transcript
// instead of investigating further. Best-effort: any error or empty reply
// returns "" and the caller falls back to its normal empty-output handling.
func forceFinalAnswer(ctx context.Context, mdl llm.Model, system string, transcript []llm.Message) string {
msgs := append(append([]llm.Message(nil), transcript...), llm.UserText(finalizeInstruction))
resp, err := mdl.Generate(ctx, llm.Request{
System: system,
Messages: msgs,
ToolChoice: "none",
})
if err != nil || resp == nil {
return ""
}
return strings.TrimSpace(resp.Text())
}
// wrapUpReserve is how many steps before the cap the wrap-up nudge fires, // wrapUpReserve is how many steps before the cap the wrap-up nudge fires,
// overridable via GADFLY_WRAPUP_RESERVE. // overridable via GADFLY_WRAPUP_RESERVE.
func wrapUpReserve() int { func wrapUpReserve() int {
@@ -431,7 +360,7 @@ func buildTask(diff string) string {
truncNote := "" truncNote := ""
if maxDiff > 0 && len(diff) > maxDiff { if maxDiff > 0 && len(diff) > maxDiff {
diff = diff[:maxDiff] diff = diff[:maxDiff]
truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars in this message; read the changed files (or call get_diff, if available) for the full text.]", maxDiff) truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars in this message; page the full diff with get_diff (paginated; pass a `path` to scope it to one file) or read the changed files.]", maxDiff)
} }
var b strings.Builder var b strings.Builder
@@ -458,3 +387,23 @@ func envInt(name string, def int) int {
} }
return n return n
} }
// envBool reads a boolean-ish env var: def when unset, false for an explicit
// falsey value (0/false/no/off), true otherwise. The shared spelling for
// gadfly's "on unless disabled" opt-out flags (GADFLY_RECHECK, GADFLY_COMPACT).
func envBool(name string, def bool) bool {
switch strings.ToLower(strings.TrimSpace(os.Getenv(name))) {
case "":
return def
case "0", "false", "no", "off":
return false
default:
return true
}
}
// reviewTimeout is the per-specialist-lens deadline (GADFLY_TIMEOUT_SECS), shared
// across a lens's review+recheck passes and applied as each pass's run cap.
func reviewTimeout() time.Duration {
return time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second
}
+21
View File
@@ -127,6 +127,27 @@ func resolveWorkerModel() (llm.Model, error) {
return majordomo.Parse(buildSpec(provider, spec)) return majordomo.Parse(buildSpec(provider, spec))
} }
// resolveSummarizerModel picks the model the context compactor uses to compress
// the runaway middle of a transcript. It should be CHEAP, since it fires once per
// compaction: GADFLY_COMPACT_MODEL if set (honoring GADFLY_PROVIDER for a bare
// id), else the delegate worker model when one is configured (already cheap by
// design), else the review model itself. Returns (nil, err) only on an explicit
// bad GADFLY_COMPACT_MODEL spec — the caller logs it and simply runs without
// compaction rather than failing the review.
func resolveSummarizerModel(review, worker llm.Model) (llm.Model, error) {
if spec := strings.TrimSpace(os.Getenv("GADFLY_COMPACT_MODEL")); spec != "" {
provider := strings.TrimSpace(os.Getenv("GADFLY_PROVIDER"))
if provider == "" {
provider = defaultProvider
}
return majordomo.Parse(buildSpec(provider, spec))
}
if worker != nil {
return worker, nil
}
return review, nil
}
// buildSpec turns (provider, model) into a majordomo spec. A model id that // buildSpec turns (provider, model) into a majordomo spec. A model id that
// already carries a "provider/" prefix (or is a multi-element failover chain) // already carries a "provider/" prefix (or is a multi-element failover chain)
// is passed through verbatim; a bare id is prefixed with the provider. // is passed through verbatim; a bare id is prefixed with the provider.
+15 -11
View File
@@ -2,7 +2,6 @@ package main
import ( import (
"fmt" "fmt"
"os"
"strings" "strings"
) )
@@ -11,6 +10,13 @@ import (
// than discovering them. // than discovering them.
const defaultRecheckMaxSteps = 16 const defaultRecheckMaxSteps = 16
// defaultRecheckDiffChars caps the diff embedded in the recheck task. It is much
// smaller than the review task's GADFLY_MAX_DIFF_CHARS: the recheck already has
// the draft findings to verify and can pull the exact hunks it needs via the
// paginated get_diff tool (optionally scoped to a path), so re-embedding the
// whole diff on every recheck step is pure burn. Override: GADFLY_RECHECK_DIFF_CHARS.
const defaultRecheckDiffChars = 20000
// recheckSystemPrompt drives the second, adversarial verification pass. The // recheckSystemPrompt drives the second, adversarial verification pass. The
// model is given a DRAFT review and must independently confirm each finding // model is given a DRAFT review and must independently confirm each finding
// against the real code before letting it survive — the antidote to a // against the real code before letting it survive — the antidote to a
@@ -49,18 +55,16 @@ Output rules:
- Do NOT invent new findings; this is a verification gate, not a fresh review. - Do NOT invent new findings; this is a verification gate, not a fresh review.
- Do NOT include meta-commentary about the verification process or which - Do NOT include meta-commentary about the verification process or which
findings you dropped — output only the final, corrected review markdown. findings you dropped — output only the final, corrected review markdown.
- The draft ends with a fenced ` + "`gadfly-findings`" + ` JSON block. Regenerate it
so it lists ONLY the findings that SURVIVED your verification, in the same schema
({"file","line","severity","confidence","title"}; severity one of
critical/high/medium/small/trivial, confidence one of high/medium/low). If every
finding was dropped, emit an empty array ` + "`[]`" + `. Keep the block last.
- When done investigating, STOP calling tools and reply with the review.` - When done investigating, STOP calling tools and reply with the review.`
// recheckEnabled reports whether the verification pass should run. On unless // recheckEnabled reports whether the verification pass should run. On unless
// GADFLY_RECHECK is explicitly a falsey value. // GADFLY_RECHECK is explicitly a falsey value.
func recheckEnabled() bool { func recheckEnabled() bool { return envBool("GADFLY_RECHECK", true) }
switch strings.ToLower(strings.TrimSpace(os.Getenv("GADFLY_RECHECK"))) {
case "0", "false", "no", "off":
return false
default:
return true
}
}
// shouldRecheck decides whether to run the verification pass for a given draft. // shouldRecheck decides whether to run the verification pass for a given draft.
// A clean "no material issues" draft has nothing to verify, so it is skipped // A clean "no material issues" draft has nothing to verify, so it is skipped
@@ -79,11 +83,11 @@ func shouldRecheck(draft string) bool {
// scrutinize, with the full diff available via get_diff (and embedded here, // scrutinize, with the full diff available via get_diff (and embedded here,
// truncated, to save a tool call). // truncated, to save a tool call).
func buildRecheckTask(draft, diff string) string { func buildRecheckTask(draft, diff string) string {
maxDiff := envInt("GADFLY_MAX_DIFF_CHARS", defaultMaxDiffChars) maxDiff := envInt("GADFLY_RECHECK_DIFF_CHARS", defaultRecheckDiffChars)
truncNote := "" truncNote := ""
if maxDiff > 0 && len(diff) > maxDiff { if maxDiff > 0 && len(diff) > maxDiff {
diff = diff[:maxDiff] diff = diff[:maxDiff]
truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars here; read the changed files (or call get_diff, if available) for the full text.]", maxDiff) truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars here; call get_diff (paginated; pass a `path` to scope it to one file) or read the changed files for the rest.]", maxDiff)
} }
var b strings.Builder var b strings.Builder
+29 -8
View File
@@ -49,7 +49,7 @@ func TestRecheckEnabled(t *testing.T) {
} }
func TestBuildRecheckTask(t *testing.T) { func TestBuildRecheckTask(t *testing.T) {
t.Setenv("GADFLY_MAX_DIFF_CHARS", "") t.Setenv("GADFLY_RECHECK_DIFF_CHARS", "")
draft := "VERDICT: Blocking issues found\n- foo.go:1 broken" draft := "VERDICT: Blocking issues found\n- foo.go:1 broken"
out := buildRecheckTask(draft, "diff --git a/x b/x\n+y\n") out := buildRecheckTask(draft, "diff --git a/x b/x\n+y\n")
if !strings.Contains(out, draft) { if !strings.Contains(out, draft) {
@@ -77,25 +77,46 @@ func fakeModel(t *testing.T, reply string) llm.Model {
return m return m
} }
func TestRunAgent_ReturnsOutput(t *testing.T) { // newTestReviewExecutor builds a reviewExecutor over a fake model + repo for unit
// tests: no compaction summarizer and no budget, so it exercises the bare agent
// loop through the executus run kernel without any network call.
func newTestReviewExecutor(t *testing.T, mdl llm.Model, fs *repoFS) *reviewExecutor {
t.Helper()
rex, err := newReviewExecutor(fs, mdl, nil, "mock", nil)
if err != nil {
t.Fatal(err)
}
return rex
}
// testEngine wraps newTestReviewExecutor in a majordomoEngine for the
// runSpecialists-level tests.
func testEngine(t *testing.T, mdl llm.Model, fs *repoFS) *majordomoEngine {
t.Helper()
return &majordomoEngine{rex: newTestReviewExecutor(t, mdl, fs), mdl: mdl}
}
func TestReviewExecutor_ReturnsOutput(t *testing.T) {
fs, err := newRepoFS(t.TempDir(), "diff") fs, err := newRepoFS(t.TempDir(), "diff")
if err != nil { if err != nil {
t.Fatal(err) t.Fatal(err)
} }
mdl := fakeModel(t, " corrected review: No material issues found. ") mdl := fakeModel(t, " corrected review: No material issues found. ")
out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4) rex := newTestReviewExecutor(t, mdl, fs)
out, err := rex.run(context.Background(), "sys", "task", 4)
if err != nil { if err != nil {
t.Fatalf("runAgent: %v", err) t.Fatalf("run: %v", err)
} }
if out != "corrected review: No material issues found." { if out != "corrected review: No material issues found." {
t.Errorf("runAgent should return trimmed model output, got %q", out) t.Errorf("run should return trimmed model output, got %q", out)
} }
} }
func TestRunAgent_EmptyIsError(t *testing.T) { func TestReviewExecutor_EmptyIsError(t *testing.T) {
fs, _ := newRepoFS(t.TempDir(), "diff") fs, _ := newRepoFS(t.TempDir(), "diff")
mdl := fakeModel(t, " ") mdl := fakeModel(t, " ")
if _, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4); err == nil { rex := newTestReviewExecutor(t, mdl, fs)
t.Error("runAgent should error on empty model output") if _, err := rex.run(context.Background(), "sys", "task", 4); err == nil {
t.Error("run should error on empty model output")
} }
} }
+296
View File
@@ -0,0 +1,296 @@
package main
// Inline PR review. After the consensus comment is rendered, Gadfly also posts a
// single Gitea pull review (state COMMENT — advisory, NEVER request-changes or
// approve) whose inline comments anchor consensus findings to the exact changed
// lines. The issue comment stays the ranked overview; the review puts each
// finding next to the code it's about — the "reviewer integrated with Gitea" the
// project wanted, without ever blocking a merge.
//
// All of this is best-effort: disabled by GADFLY_INLINE_REVIEW=0 or when the diff
// / API creds aren't available, only anchors findings that land on a line present
// in the diff (Gitea rejects comments off the diff), and any error is logged to
// stderr without touching the consensus comment (already on stdout) or the exit
// code.
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
"os"
"regexp"
"strconv"
"strings"
"time"
)
const (
// inlineReviewMarker tags our review body so a re-run can delete the previous
// one instead of stacking duplicate inline comments.
inlineReviewMarker = "<!-- gadfly-inline-review -->"
// maxInlineComments caps how many inline comments one review carries, so a
// huge diff can't produce a wall of annotations. Clusters are pre-sorted by
// agreement×severity, so the cap keeps the most important.
maxInlineComments = 25
inlineReviewHTTPTimeout = 20 * time.Second
)
// reviewComment is one inline comment in a Gitea pull review. Field names match
// the Gitea API EXACTLY (new_position = line in the new/head file).
type reviewComment struct {
Path string `json:"path"`
Body string `json:"body"`
NewPosition int `json:"new_position"`
}
// createReview is the POST /pulls/{n}/reviews body. event is ALWAYS "COMMENT".
type createReview struct {
Body string `json:"body"`
Event string `json:"event"`
Comments []reviewComment `json:"comments"`
}
// postInlineReview posts one COMMENT-state pull review with inline comments for
// consensus findings on changed lines. No-op + best-effort (see file comment).
func postInlineReview(clusters []cluster) {
if strings.EqualFold(strings.TrimSpace(os.Getenv("GADFLY_INLINE_REVIEW")), "0") {
return
}
api := strings.TrimRight(strings.TrimSpace(os.Getenv("GITEA_API")), "/")
token := strings.TrimSpace(os.Getenv("GITEA_TOKEN"))
pr := strings.TrimSpace(os.Getenv("GADFLY_PR"))
diffPath := strings.TrimSpace(os.Getenv("GADFLY_DIFF_FILE"))
if api == "" || token == "" || pr == "" || diffPath == "" {
return
}
diff, err := os.ReadFile(diffPath)
if err != nil {
fmt.Fprintln(os.Stderr, "gadfly: inline review: read diff:", err)
return
}
comments := inlineComments(clusters, parseDiffNewLines(string(diff)))
if len(comments) == 0 {
return // nothing anchors to a changed line; the consensus comment covers it
}
client := &http.Client{Timeout: inlineReviewHTTPTimeout}
base := fmt.Sprintf("%s/pulls/%s/reviews", api, pr)
deletePriorReviews(client, base, token) // avoid stacking on re-runs
body := fmt.Sprintf("%s\n🪰 **Gadfly consensus review** — %d inline finding%s on changed lines. See the consensus comment for the full ranked summary.\n\n<sub>Advisory only — does not block merge.</sub>",
inlineReviewMarker, len(comments), plural(len(comments)))
if err := giteaSend(client, http.MethodPost, base, token, createReview{Body: body, Event: "COMMENT", Comments: comments}); err != nil {
fmt.Fprintln(os.Stderr, "gadfly: inline review post:", err)
}
}
// inlineComments builds inline comments for the clusters that anchor to a line
// present in the diff, in priority order (clusters are pre-sorted), capped.
func inlineComments(clusters []cluster, addable map[string]map[int]bool) []reviewComment {
var out []reviewComment
for _, c := range clusters {
path := normPath(c.file)
anchor := anchorLine(addable[path], c.line, c.maxLine)
if anchor == 0 {
continue
}
out = append(out, reviewComment{Path: path, NewPosition: anchor, Body: inlineBody(c)})
if len(out) >= maxInlineComments {
break
}
}
return out
}
// anchorLine returns the first line in [lo,hi] that is an added line in the diff,
// or 0 if none. Scanning the cluster's whole span (not just its representative
// line) anchors a finding whose min line is just outside the diff but whose span
// still overlaps a changed line.
func anchorLine(added map[int]bool, lo, hi int) int {
if added == nil || lo <= 0 {
return 0
}
if hi < lo { // single-line cluster (maxLine unset)
hi = lo
}
for ln := lo; ln <= hi; ln++ {
if added[ln] {
return ln
}
}
return 0
}
// inlineBody renders one inline comment: severity + title, who flagged it, detail.
func inlineBody(c cluster) string {
var b strings.Builder
fmt.Fprintf(&b, "%s **%s**", sevIcon(c.severity), strings.TrimSpace(c.title))
fmt.Fprintf(&b, "\n\n_%s · flagged by %d model%s_", lensList(c.lenses), len(c.models), plural(len(c.models)))
if d := strings.TrimSpace(c.detail); d != "" {
fmt.Fprintf(&b, "\n\n%s", d)
}
b.WriteString("\n\n<sub>🪰 Gadfly · advisory</sub>")
return b.String()
}
// parseDiffNewLines returns, per file, the set of NEW-file line numbers that were
// ADDED in the unified diff — the safest lines for an inline comment to anchor to
// (Gitea reliably accepts comments on added lines). Context lines are walked to
// keep the line counter correct but are NOT recorded: anchoring only to added
// lines avoids the all-or-nothing review POST being rejected for an off-change
// anchor. Hunk lengths from the @@ header bound each hunk, so a content line that
// happens to start with "+++ " or "@@" is still read as content, not a header.
func parseDiffNewLines(diff string) map[string]map[int]bool {
out := map[string]map[int]bool{}
var file string
var newLine, oldRem, newRem int
inHunk := false
for _, line := range strings.Split(diff, "\n") {
if inHunk && (newRem > 0 || oldRem > 0) {
switch {
case strings.HasPrefix(line, "+"):
record(out, file, newLine) // added line — anchorable
newLine++
newRem--
case strings.HasPrefix(line, "-"):
oldRem--
case strings.HasPrefix(line, "\\"): // "\ No newline at end of file"
default: // context line (leading space, or an empty line): advance, don't record
newLine++
newRem--
oldRem--
}
if newRem <= 0 && oldRem <= 0 {
inHunk = false
}
continue
}
switch {
case strings.HasPrefix(line, "+++ "):
file = normPath(strings.TrimPrefix(line, "+++ "))
if file == "/dev/null" {
file = ""
}
case strings.HasPrefix(line, "@@"):
if m := hunkRe.FindStringSubmatch(line); m != nil && file != "" {
newLine, _ = strconv.Atoi(m[3])
oldRem = atoiOr(m[2], 1)
newRem = atoiOr(m[4], 1)
inHunk = newRem > 0 || oldRem > 0
}
}
}
return out
}
// hunkRe captures a unified-diff hunk header's old/new start+length:
// @@ -<oldStart>[,<oldLen>] +<newStart>[,<newLen>] @@
var hunkRe = regexp.MustCompile(`^@@ -(\d+)(?:,(\d+))? \+(\d+)(?:,(\d+))? @@`)
// normPath trims a unified-diff path down to a repo-relative one: strip a single
// leading "a/" or "b/" prefix (and any "./"), and surrounding whitespace. Applied
// to BOTH the diff paths and a finding's file so they match even when a model
// writes "./pkg/x.go" or the diff carries the "b/" prefix.
func normPath(p string) string {
p = strings.TrimSpace(p)
p = strings.TrimPrefix(p, "./")
if strings.HasPrefix(p, "a/") || strings.HasPrefix(p, "b/") {
p = p[2:]
}
return p
}
func record(out map[string]map[int]bool, file string, line int) {
if file == "" {
return
}
if out[file] == nil {
out[file] = map[int]bool{}
}
out[file][line] = true
}
// atoiOr parses s, returning def when s is empty or unparseable. Used for the
// optional hunk-length fields (absent => length 1).
func atoiOr(s string, def int) int {
if s == "" {
return def
}
if n, err := strconv.Atoi(s); err == nil {
return n
}
return def
}
// deletePriorReviews removes our previous inline reviews (matched by the body
// marker) so a re-run replaces rather than stacks. Best-effort and quiet.
func deletePriorReviews(client *http.Client, base, token string) {
const perPage = 50
for page := 1; page <= 10; page++ { // bound the scan, but page past 50 so a stale marked review isn't missed
req, err := http.NewRequest(http.MethodGet, fmt.Sprintf("%s?limit=%d&page=%d", base, perPage, page), nil)
if err != nil {
return
}
req.Header.Set("Authorization", "token "+token)
resp, err := client.Do(req)
if err != nil {
return
}
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
io.Copy(io.Discard, resp.Body)
resp.Body.Close()
return
}
var reviews []struct {
ID int `json:"id"`
Body string `json:"body"`
}
_ = json.NewDecoder(resp.Body).Decode(&reviews)
resp.Body.Close()
for _, r := range reviews {
if !strings.Contains(r.Body, inlineReviewMarker) {
continue
}
dreq, err := http.NewRequest(http.MethodDelete, fmt.Sprintf("%s/%d", base, r.ID), nil)
if err != nil {
continue
}
dreq.Header.Set("Authorization", "token "+token)
if dresp, err := client.Do(dreq); err == nil {
io.Copy(io.Discard, dresp.Body)
dresp.Body.Close()
}
}
if len(reviews) < perPage {
return // last page
}
}
}
// giteaSend marshals payload and sends it with Gitea's "token" auth scheme,
// treating a non-2xx response as an error (with a snippet of the body).
func giteaSend(client *http.Client, method, url, token string, payload any) error {
body, err := json.Marshal(payload)
if err != nil {
return err
}
req, err := http.NewRequest(method, url, bytes.NewReader(body))
if err != nil {
return err
}
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Authorization", "token "+token)
resp, err := client.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
snippet, _ := io.ReadAll(io.LimitReader(resp.Body, 2048))
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return fmt.Errorf("%s %s: status %d: %s", method, url, resp.StatusCode, strings.TrimSpace(string(snippet)))
}
return nil
}
+107
View File
@@ -0,0 +1,107 @@
package main
import (
"strings"
"testing"
)
func TestParseDiffNewLines(t *testing.T) {
diff := "diff --git a/a.go b/a.go\n" +
"index 111..222 100644\n" +
"--- a/a.go\n" +
"+++ b/a.go\n" +
"@@ -1,3 +1,4 @@\n" +
" line1\n" +
"-old2\n" +
"+new2\n" +
"+new3\n" +
" line4\n"
got := parseDiffNewLines(diff)
// Only ADDED lines anchor: new2 (line 2) and new3 (line 3). Context lines 1
// and 4 are walked for counting but not recorded.
want := map[int]bool{2: true, 3: true}
if len(got["a.go"]) != len(want) {
t.Fatalf("a.go lines = %v, want %v (added-only)", got["a.go"], want)
}
for ln := range want {
if !got["a.go"][ln] {
t.Errorf("expected a.go line %d anchorable", ln)
}
}
if got["a.go"][1] || got["a.go"][4] {
t.Error("context lines 1/4 should not be anchorable (added-only)")
}
}
func TestParseDiffNewLinesContentLooksLikeHeader(t *testing.T) {
// An added line whose CONTENT is "++ weird" appears as "+++ weird" in the
// diff. Hunk-length tracking must read it as content, not a file header.
diff := "--- a/x\n+++ b/x\n@@ -1,1 +1,2 @@\n ctx\n+++ weird\n"
got := parseDiffNewLines(diff)
if got["x"][1] { // context line, not recorded (added-only)
t.Errorf("line 1 is context, should not anchor: %v", got["x"])
}
if !got["x"][2] { // the "+++ weird" added line
t.Errorf("want added line 2 anchorable, got %v", got["x"])
}
}
func TestAnchorLineScansSpan(t *testing.T) {
// A cluster spanning 10..14 whose min line (10) isn't in the diff but whose
// span includes added line 14 must anchor to 14, not be dropped.
added := map[string]map[int]bool{"a.go": {14: true}}
clusters := []cluster{{file: "a.go", line: 10, maxLine: 14, severity: "high", title: "t", models: set("m1"), lenses: set("x")}}
cs := inlineComments(clusters, added)
if len(cs) != 1 || cs[0].NewPosition != 14 {
t.Fatalf("want anchor at 14 via span scan, got %+v", cs)
}
}
func TestParseDiffNewLinesMultiFile(t *testing.T) {
diff := "diff --git a/one.go b/one.go\n--- a/one.go\n+++ b/one.go\n@@ -0,0 +1,2 @@\n+a\n+b\n" +
"diff --git a/two.go b/two.go\n--- a/two.go\n+++ b/two.go\n@@ -5,0 +6,1 @@\n+c\n"
got := parseDiffNewLines(diff)
if !got["one.go"][1] || !got["one.go"][2] {
t.Errorf("one.go: want 1,2 got %v", got["one.go"])
}
if !got["two.go"][6] {
t.Errorf("two.go: want 6 got %v", got["two.go"])
}
}
func TestInlineCommentsNormalizesPaths(t *testing.T) {
// Diff path "b/pkg/x.go" -> "pkg/x.go"; a finding written as "./pkg/x.go" must
// still anchor (both normalize to the same repo-relative path).
addable := parseDiffNewLines("--- a/pkg/x.go\n+++ b/pkg/x.go\n@@ -0,0 +1,1 @@\n+bug\n")
clusters := []cluster{{file: "./pkg/x.go", line: 1, severity: "high", title: "t", models: set("m1"), lenses: set("x")}}
if cs := inlineComments(clusters, addable); len(cs) != 1 || cs[0].Path != "pkg/x.go" {
t.Errorf("path normalization failed to anchor: %+v", cs)
}
}
func TestInlineCommentsFiltersToDiffLines(t *testing.T) {
addable := map[string]map[int]bool{"a.go": {10: true, 11: true}}
clusters := []cluster{
{file: "a.go", line: 10, severity: "high", title: "anchored", models: set("m1", "m2"), lenses: set("security"), detail: "d"},
{file: "a.go", line: 99, severity: "high", title: "off-diff line", models: set("m1"), lenses: set("security")},
{file: "b.go", line: 1, severity: "high", title: "off-diff file", models: set("m1"), lenses: set("security")},
}
cs := inlineComments(clusters, addable)
if len(cs) != 1 {
t.Fatalf("want 1 anchorable inline comment, got %d", len(cs))
}
if cs[0].Path != "a.go" || cs[0].NewPosition != 10 {
t.Errorf("wrong anchor: %+v", cs[0])
}
if !strings.Contains(cs[0].Body, "anchored") || !strings.Contains(cs[0].Body, "2 model") {
t.Errorf("inline body missing title/agreement: %q", cs[0].Body)
}
}
func set(xs ...string) map[string]bool {
m := map[string]bool{}
for _, x := range xs {
m[x] = true
}
return m
}
+143 -16
View File
@@ -9,6 +9,7 @@ import (
"regexp" "regexp"
"sort" "sort"
"strings" "strings"
"sync"
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm" llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
) )
@@ -17,11 +18,13 @@ import (
// every tool caps how much it can pull in one call — a runaway read_file or // every tool caps how much it can pull in one call — a runaway read_file or
// grep would blow the window and stall the loop. // grep would blow the window and stall the loop.
const ( const (
maxFileBytes = 64 * 1024 // per read_file call maxFileBytes = 64 * 1024 // per read_file call
maxReadLines = 800 // per read_file call maxReadLines = 800 // per read_file call
maxGrepResults = 200 // per grep call maxGrepResults = 200 // per grep call
maxFindResults = 200 // per find_files call maxFindResults = 200 // per find_files call
maxLineLen = 400 // truncate any single returned line to this maxLineLen = 400 // truncate any single returned line to this
maxGetDiffLines = 800 // per get_diff call (paginated window)
maxGetDiffBytes = 64 * 1024 // per get_diff call
) )
// skipDirs are never descended into by grep / find_files — noise and bulk that // skipDirs are never descended into by grep / find_files — noise and bulk that
@@ -40,6 +43,12 @@ type repoFS struct {
root string // absolute, symlink-resolved repo root root string // absolute, symlink-resolved repo root
diff string // the full PR unified diff (served by get_diff) diff string // the full PR unified diff (served by get_diff)
worker llm.Model // optional cheap model for delegate_investigation; nil = no delegation worker llm.Model // optional cheap model for delegate_investigation; nil = no delegation
// diffLines caches the split diff so paging through get_diff doesn't re-split
// the whole (possibly large) diff on every call. The diff is immutable, so the
// cache is computed once and safe to share across concurrent lenses.
diffOnce sync.Once
diffLines []string
} }
// newRepoFS resolves root to an absolute, symlink-free path. // newRepoFS resolves root to an absolute, symlink-free path.
@@ -108,15 +117,24 @@ func (r *repoFS) fsTools() []llm.Tool {
} }
} }
// toolbox builds the reviewer's toolbox: the read-only repo tools, plus the // allTools is the single source of truth for the reviewer's tool set: the
// delegate_investigation tool when a worker model is configured. // read-only repo tools, plus delegate_investigation when a worker model is
func (r *repoFS) toolbox() (*llm.Toolbox, error) { // configured. Both the executus registry (gadflyToolRegistry, the production
box := llm.NewToolbox("gadfly") // path) and toolbox() build from this list.
func (r *repoFS) allTools() []llm.Tool {
tools := r.fsTools() tools := r.fsTools()
if r.worker != nil { if r.worker != nil {
tools = append(tools, r.delegateTool()) tools = append(tools, r.delegateTool())
} }
for _, t := range tools { return tools
}
// toolbox builds a majordomo toolbox from allTools(). The production review path
// now goes through executus's tool.Registry (see executus.go); this remains for
// the toolbox-level tests (the `call` helper).
func (r *repoFS) toolbox() (*llm.Toolbox, error) {
box := llm.NewToolbox("gadfly")
for _, t := range r.allTools() {
if err := box.Add(t); err != nil { if err := box.Add(t); err != nil {
return nil, fmt.Errorf("add tool %q: %w", t.Name, err) return nil, fmt.Errorf("add tool %q: %w", t.Name, err)
} }
@@ -396,15 +414,124 @@ func (r *repoFS) findFilesTool() llm.Tool {
) )
} }
type getDiffArgs struct {
Path string `json:"path,omitempty" description:"Optional changed-file path (e.g. pkg/foo/bar.go); returns ONLY that file's diff hunks. Omit for the whole diff. Use this on a large PR to pull just the file a finding is about."`
StartLine int `json:"start_line,omitempty" description:"Optional 1-based line to start from within the (whole or path-scoped) diff (default 1)."`
Limit int `json:"limit,omitempty" description:"Optional max number of diff lines to return (default/maximum 800)."`
}
func (r *repoFS) getDiffTool() llm.Tool { func (r *repoFS) getDiffTool() llm.Tool {
return llm.DefineTool[struct{}]( return llm.DefineTool[getDiffArgs](
"get_diff", "get_diff",
"Return the complete unified diff under review. The diff is also included (possibly truncated) in the task message; call this to get the full, untruncated text.", "Return the unified diff under review as a numbered, PAGINATED window (like read_file) — not the whole diff at once, so a huge PR can't blow the context window. Pass `path` to fetch just one changed file's hunks, or `start_line`/`limit` to page through. A truncated copy of the diff is also embedded in the task message.",
func(_ context.Context, _ struct{}) (any, error) { func(_ context.Context, args getDiffArgs) (any, error) {
if strings.TrimSpace(r.diff) == "" { scope := "the diff"
return "(empty diff)", nil var lines []string
if p := strings.TrimSpace(args.Path); p != "" {
lines = diffLinesForPath(r.diff, p)
if len(lines) == 0 {
return fmt.Sprintf("(no diff hunks for path %q; check it against the changed-files list — the path must match a file the diff touches)", p), nil
}
scope = "the diff for " + p
} else {
lines = r.diffAllLines()
if len(lines) == 0 {
return "(empty diff)", nil
}
} }
return r.diff, nil return windowDiff(lines, scope, args.StartLine, args.Limit), nil
}, },
) )
} }
// diffAllLines returns the whole diff split into lines, cached so paging never
// re-splits the (possibly large) diff.
func (r *repoFS) diffAllLines() []string {
r.diffOnce.Do(func() { r.diffLines = splitDiffLines(r.diff) })
return r.diffLines
}
// splitDiffLines splits a unified diff into lines, dropping the single trailing
// empty element a trailing newline produces — otherwise windowDiff would emit a
// blank final line and over-count the total by one.
func splitDiffLines(diff string) []string {
lines := strings.Split(diff, "\n")
if n := len(lines); n > 0 && lines[n-1] == "" {
lines = lines[:n-1]
}
return lines
}
// windowDiff returns a numbered, paginated slice of pre-split diff lines,
// mirroring read_file's caps (maxGetDiffLines / maxGetDiffBytes / maxLineLen) so
// a single get_diff call can never dump a multi-hundred-KB diff into the
// transcript — the amplifier behind the large-PR token burn. The full diff stays
// reachable by paging with start_line, or scoped per file via the path arg.
func windowDiff(lines []string, scope string, start, limit int) string {
total := len(lines)
if start < 1 {
start = 1
}
if limit <= 0 || limit > maxGetDiffLines {
limit = maxGetDiffLines
}
if start > total {
return fmt.Sprintf("(%s has %d lines; nothing at/after line %d)", scope, total, start)
}
var b strings.Builder
emitted := 0
i := start - 1
for ; i < total; i++ {
if emitted >= limit || b.Len() >= maxGetDiffBytes {
break
}
line := lines[i]
if len(line) > maxLineLen {
line = line[:maxLineLen] + "…"
}
fmt.Fprintf(&b, "%d\t%s\n", i+1, line)
emitted++
}
if i < total {
// i (0-based) is the first line NOT emitted; line i was the last shown.
fmt.Fprintf(&b, "... (%s truncated after line %d of %d; call get_diff again with start_line=%d for the rest, or pass a `path` to scope to one file)\n", scope, i, total, i+1)
}
return b.String()
}
// diffLinesForPath returns the unified-diff lines for one changed file: from the
// `diff --git` header that names path through to the next header (or end). The
// header names two path tokens (a/<old> b/<new>); a match is on a WHOLE token,
// so path "foo.go" does not pull in "barfoo.go" — and a trailing "/" scopes to a
// directory (e.g. "pkg/foo/" matches pkg/foo/bar.go).
func diffLinesForPath(diff, path string) []string {
want := strings.TrimPrefix(strings.TrimPrefix(strings.TrimSpace(path), "a/"), "b/")
var out []string
inSection := false
for _, ln := range splitDiffLines(diff) {
if strings.HasPrefix(ln, "diff --git ") {
inSection = diffHeaderNames(ln, want)
}
if inSection {
out = append(out, ln)
}
}
return out
}
// diffHeaderNames reports whether a `diff --git a/X b/Y` header names want as one
// of its (a/b-stripped) path tokens — exact whole-token match, or a directory
// prefix when want ends with "/".
func diffHeaderNames(header, want string) bool {
fields := strings.Fields(header)
if len(fields) < 3 {
return false
}
for _, f := range fields[2:] {
p := strings.TrimPrefix(strings.TrimPrefix(f, "a/"), "b/")
if p == want || (strings.HasSuffix(want, "/") && strings.HasPrefix(p, want)) {
return true
}
}
return false
}
+81 -3
View File
@@ -197,15 +197,93 @@ func TestFindFilesTool(t *testing.T) {
func TestGetDiffTool(t *testing.T) { func TestGetDiffTool(t *testing.T) {
root := buildFixtureRepo(t) root := buildFixtureRepo(t)
const diff = "diff --git a/x b/x\n+added line\n" const diff = "diff --git a/x b/x\n--- a/x\n+++ b/x\n+added line\n" +
"diff --git a/y.go b/y.go\n--- a/y.go\n+++ b/y.go\n+y change\n"
fs, _ := newRepoFS(root, diff) fs, _ := newRepoFS(root, diff)
// Default: the whole diff as a NUMBERED window (paginated), not a raw dump —
// so a huge PR can't be poured into the transcript in one call.
out, err := call(t, fs, "get_diff", map[string]any{})
if err != nil {
t.Fatalf("get_diff: %v", err)
}
if !strings.Contains(out, "1\tdiff --git a/x b/x") {
t.Errorf("get_diff should return a numbered window, got:\n%s", out)
}
if !strings.Contains(out, "+added line") || !strings.Contains(out, "+y change") {
t.Errorf("the full (short) diff window should include every hunk, got:\n%s", out)
}
// path filter: only the named file's hunks come back.
out, err = call(t, fs, "get_diff", map[string]any{"path": "y.go"})
if err != nil {
t.Fatalf("get_diff path: %v", err)
}
if !strings.Contains(out, "y change") {
t.Errorf("get_diff path=y.go should include y's hunk, got:\n%s", out)
}
if strings.Contains(out, "added line") {
t.Errorf("get_diff path=y.go must NOT include x's hunk, got:\n%s", out)
}
// unknown path: a clear note, never an error.
out, err = call(t, fs, "get_diff", map[string]any{"path": "nope.txt"})
if err != nil {
t.Fatalf("get_diff unknown path: %v", err)
}
if !strings.Contains(out, "no diff hunks") {
t.Errorf("get_diff for an unknown path should note no hunks, got:\n%s", out)
}
}
// TestGetDiffTool_Paginates: a diff longer than the per-call line cap is returned
// as a truncated window with a paging hint, and start_line pages past it — the
// mechanism that stops get_diff from dumping a multi-hundred-KB diff at once.
func TestGetDiffTool_Paginates(t *testing.T) {
diff := "diff --git a/big b/big\n" + strings.Repeat("+line\n", maxGetDiffLines+50)
fs, _ := newRepoFS(t.TempDir(), diff)
out, err := call(t, fs, "get_diff", map[string]any{}) out, err := call(t, fs, "get_diff", map[string]any{})
if err != nil { if err != nil {
t.Fatalf("get_diff: %v", err) t.Fatalf("get_diff: %v", err)
} }
if out != diff { if !strings.Contains(out, "truncated after line") {
t.Errorf("get_diff returned %q, want %q", out, diff) t.Error("a diff longer than the per-call cap should be truncated with a paging hint")
}
if strings.Contains(out, "801\t") {
t.Error("the first window must stop at the line cap, not reach line 801")
}
out, err = call(t, fs, "get_diff", map[string]any{"start_line": 805})
if err != nil {
t.Fatalf("get_diff page: %v", err)
}
if !strings.Contains(out, "805\t") {
t.Error("paging with start_line=805 should include line 805")
}
}
// TestDiffLinesForPath_Anchored: get_diff path= matches on a WHOLE path token,
// so "foo.go" never pulls in "barfoo.go" (the unanchored-substring weakness the
// swarm flagged), while a trailing "/" still scopes to a directory.
func TestDiffLinesForPath_Anchored(t *testing.T) {
diff := "diff --git a/foo.go b/foo.go\n+foo change\n" +
"diff --git a/barfoo.go b/barfoo.go\n+barfoo change\n" +
"diff --git a/pkg/x.go b/pkg/x.go\n+x change\n"
joined := strings.Join(diffLinesForPath(diff, "foo.go"), "\n")
if !strings.Contains(joined, "foo change") {
t.Errorf("foo.go should match its own hunk:\n%s", joined)
}
if strings.Contains(joined, "barfoo change") {
t.Errorf("foo.go must NOT match barfoo.go (unanchored substring regression):\n%s", joined)
}
if !strings.Contains(strings.Join(diffLinesForPath(diff, "pkg/"), "\n"), "x change") {
t.Error("a trailing-slash path should scope to the directory (pkg/ -> pkg/x.go)")
}
if len(diffLinesForPath(diff, "nope.go")) != 0 {
t.Error("an unknown path should yield no lines")
} }
} }
+20 -51
View File
@@ -37,10 +37,11 @@ func lastUserText(req llm.Request) string {
return req.Messages[len(req.Messages)-1].Text() return req.Messages[len(req.Messages)-1].Text()
} }
// TestRunAgent_WrapUpNudgeProducesAnswer: a model that keeps calling tools until // TestReviewExecutor_WrapUpNudgeProducesAnswer: a model that keeps calling tools
// it is nudged to wrap up should still finish inside its budget — the steer // until it is nudged to wrap up should still finish inside its budget — the steer
// message arrives a few steps before the cap and the model writes its answer. // message (delivered by the executus wrap-up critic a few steps before the cap)
func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) { // arrives and the model writes its answer.
func TestReviewExecutor_WrapUpNudgeProducesAnswer(t *testing.T) {
t.Setenv("GADFLY_WRAPUP_RESERVE", "4") t.Setenv("GADFLY_WRAPUP_RESERVE", "4")
final := "VERDICT: No material issues found." final := "VERDICT: No material issues found."
@@ -60,9 +61,10 @@ func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) {
} }
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n") fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 12) rex := newTestReviewExecutor(t, mdl, fs)
out, err := rex.run(context.Background(), "sys", "task", 12)
if err != nil { if err != nil {
t.Fatalf("runAgent should succeed via wrap-up nudge, got error: %v", err) t.Fatalf("run should succeed via wrap-up nudge, got error: %v", err)
} }
if out != final { if out != final {
t.Errorf("expected final review %q, got %q", final, out) t.Errorf("expected final review %q, got %q", final, out)
@@ -72,24 +74,19 @@ func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) {
} }
} }
// TestRunAgent_FinalizationFallback: a model that ignores the wrap-up nudge and // TestReviewExecutor_ExhaustionWithoutAnswerIsError: a model that ignores the
// spins on tools until the cap should NOT hard-fail — the tool-free finalization // wrap-up nudge and spins on tools until the step cap produces no final answer.
// pass forces a final answer out of the transcript. // The transcript-based forced-finalization fallback was removed in the executus
func TestRunAgent_FinalizationFallback(t *testing.T) { // re-platform (run.Result does not expose the loop transcript), so the pass now
// surfaces an error — which reviewWithSpecialist renders as an advisory "reviewer
// failed to complete" notice rather than a phantom success.
func TestReviewExecutor_ExhaustionWithoutAnswerIsError(t *testing.T) {
t.Setenv("GADFLY_WRAPUP_RESERVE", "2") t.Setenv("GADFLY_WRAPUP_RESERVE", "2")
final := "VERDICT: Minor issues\n- something"
forcedCalled := false
n := 0 n := 0
p := fake.New("fake", fake.WithDefault(func(_ string, req llm.Request) fake.Step { p := fake.New("fake", fake.WithDefault(func(_ string, _ llm.Request) fake.Step {
// Only the tool-free finalization pass forbids tools — reply there.
if req.ToolChoice == "none" {
forcedCalled = true
return fake.Reply(final)
}
// Otherwise keep spinning, ignoring the wrap-up nudge entirely.
n++ n++
return spinToolCall(n) return spinToolCall(n) // spin forever, ignoring the wrap-up nudge
})) }))
mdl, err := p.Model("mock") mdl, err := p.Model("mock")
if err != nil { if err != nil {
@@ -97,37 +94,9 @@ func TestRunAgent_FinalizationFallback(t *testing.T) {
} }
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n") fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 6) rex := newTestReviewExecutor(t, mdl, fs)
if err != nil { if _, err := rex.run(context.Background(), "sys", "task", 6); err == nil {
t.Fatalf("runAgent should recover via finalization fallback, got error: %v", err) t.Error("run should error when the model exhausts its steps without an answer")
}
if !forcedCalled {
t.Error("finalization fallback was never invoked")
}
if out != final {
t.Errorf("expected forced final answer %q, got %q", final, out)
}
}
// TestRunAgent_FallbackStillEmptyIsError: if even the tool-free finalization
// yields nothing, runAgent surfaces an error rather than a phantom success.
func TestRunAgent_FallbackStillEmptyIsError(t *testing.T) {
n := 0
p := fake.New("fake", fake.WithDefault(func(_ string, req llm.Request) fake.Step {
if req.ToolChoice == "none" {
return fake.Reply(" ") // finalization produces only whitespace
}
n++
return spinToolCall(n)
}))
mdl, err := p.Model("mock")
if err != nil {
t.Fatal(err)
}
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
if _, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4); err == nil {
t.Error("runAgent should error when the finalization fallback also yields no output")
} }
} }
+168 -2
View File
@@ -45,6 +45,10 @@
# GADFLY_FINDINGS_URL optional gadfly-reports store base URL; set to POST the run + # GADFLY_FINDINGS_URL optional gadfly-reports store base URL; set to POST the run +
# findings for model-quality tracking (off when empty) # findings for model-quality tracking (off when empty)
# GADFLY_FINDINGS_TOKEN optional bearer token for the gadfly-reports store # GADFLY_FINDINGS_TOKEN optional bearer token for the gadfly-reports store
# GADFLY_CONSOLIDATE cross-model consensus comment: "auto" (default; on for >=2
# models), "1" force on, "0" force off (one comment per model)
# GADFLY_INLINE_REVIEW when consolidating, also post a COMMENT-state PR review with
# inline comments on changed lines (default on; "0" disables)
set -uo pipefail set -uo pipefail
# One model by default: the specialist suite already provides breadth, so a # One model by default: the specialist suite already provides breadth, so a
@@ -64,6 +68,29 @@ die() { log "ERROR: $*"; exit 1; }
API() { curl -fsS --connect-timeout 20 --max-time 30 -H "Authorization: token ${GITEA_TOKEN}" "$@"; } API() { curl -fsS --connect-timeout 20 --max-time 30 -H "Authorization: token ${GITEA_TOKEN}" "$@"; }
# upsert_comment_body MARKER BODY — create or update (by leading MARKER) a single
# PR comment. Mirrors run.sh's per-model upsert; used for the consensus comment
# and the per-model fallback when consolidation is on.
upsert_comment_body() {
local marker="$1" body="$2" post_body existing_id="" page=1 cmts
post_body="$(jq -n --arg b "$body" '{body:$b}')"
while [ "$page" -le 10 ]; do
cmts="$(API "${GITEA_API}/issues/${PR}/comments?limit=50&page=${page}" || echo '[]')"
[ "$(echo "$cmts" | jq 'length')" = "0" ] && break
existing_id="$(echo "$cmts" | jq -r --arg m "$marker" \
'.[] | select(.body != null and (.body | startswith($m))) | .id' | head -n1)"
[ -n "$existing_id" ] && break
page=$((page+1))
done
if [ -n "$existing_id" ]; then
curl -sS --connect-timeout 20 --max-time 30 -X PATCH -H "Authorization: token ${GITEA_TOKEN}" \
-H "Content-Type: application/json" "${GITEA_API}/issues/comments/${existing_id}" -d "$post_body" >/dev/null
else
curl -sS --connect-timeout 20 --max-time 30 -X POST -H "Authorization: token ${GITEA_TOKEN}" \
-H "Content-Type: application/json" "${GITEA_API}/issues/${PR}/comments" -d "$post_body" >/dev/null
fi
}
# --- is the commenter allowed to trigger a re-review? ---------------------- # --- is the commenter allowed to trigger a re-review? ----------------------
actor_allowed() { actor_allowed() {
local actor="$1" local actor="$1"
@@ -156,6 +183,36 @@ export GADFLY_FINDINGS_TOKEN="${GADFLY_FINDINGS_TOKEN:-}"
MODELS="${GADFLY_MODELS:-${OLLAMA_REVIEW_MODELS:-$DEFAULT_MODELS}}" MODELS="${GADFLY_MODELS:-${OLLAMA_REVIEW_MODELS:-$DEFAULT_MODELS}}"
DEFAULT_CONC="${GADFLY_CONCURRENCY:-1}" DEFAULT_CONC="${GADFLY_CONCURRENCY:-1}"
# --- huge-PR downshift ------------------------------------------------------
# A very large diff is what burns the model budget: every review step re-sends
# it, multiplied across models × lenses × passes × steps (this is what nuked a
# whole Ollama Cloud block on one giant PR). entrypoint is the only process that
# spans the whole fleet, so the fleet-wide size decision lives here: size the PR
# diff ONCE, and above GADFLY_HUGE_DIFF_BYTES collapse to a single cheap model +
# a focused lens subset, fewer steps, no recheck, and a smaller embedded diff.
# A finished shallow review beats a budget-nuking one. All knobs override; set
# GADFLY_HUGE_DIFF_BYTES=0 to disable. Small PRs are never touched.
HUGE_PR=0
HUGE_DIFF_BYTES="${GADFLY_HUGE_DIFF_BYTES:-600000}"
if [ "$HUGE_DIFF_BYTES" -gt 0 ] 2>/dev/null; then
PR_DIFF_BYTES="$(API "${GITEA_API}/pulls/${PR}.diff" 2>/dev/null | wc -c | tr -d '[:space:]')"
[ -z "$PR_DIFF_BYTES" ] && PR_DIFF_BYTES=0
if [ "$PR_DIFF_BYTES" -gt "$HUGE_DIFF_BYTES" ] 2>/dev/null; then
HUGE_PR=1
log "huge PR: diff ${PR_DIFF_BYTES}B > ${HUGE_DIFF_BYTES}B — downshifting the fleet (advisory)"
MODELS="${GADFLY_HUGE_DIFF_MODELS:-${MODELS%%,*}}" # first model only by default
export GADFLY_SPECIALISTS="${GADFLY_HUGE_DIFF_SPECIALISTS:-security,correctness,error-handling}"
export GADFLY_MAX_STEPS="${GADFLY_HUGE_DIFF_MAX_STEPS:-12}"
export GADFLY_RECHECK_MAX_STEPS="${GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS:-8}"
export GADFLY_RECHECK="${GADFLY_HUGE_DIFF_RECHECK:-0}" # skip recheck on huge PRs
# The Go-visible name directly (run.sh prefers GADFLY_MAX_DIFF_CHARS over its
# own MAX_DIFF_CHARS), so the cap is honored without relying on run.sh's alias.
export GADFLY_MAX_DIFF_CHARS="${GADFLY_HUGE_DIFF_MAX_DIFF_CHARS:-20000}"
# Surfaced on each posted comment so the shallower review is self-explaining.
export GADFLY_NOTICE="⚠️ Large PR (${PR_DIFF_BYTES} bytes): Gadfly downshifted to a focused, single-model review to stay within budget — coverage is intentionally shallower. Consider splitting the PR for a deeper review."
fi
fi
provider_of() { case "$1" in */*) echo "${1%%/*}";; *) echo "${GADFLY_PROVIDER:-ollama-cloud}";; esac; } provider_of() { case "$1" in */*) echo "${1%%/*}";; *) echo "${GADFLY_PROVIDER:-ollama-cloud}";; esac; }
# Per-model status file path for the live board. The model id can contain '/' # Per-model status file path for the live board. The model id can contain '/'
@@ -176,10 +233,11 @@ provider_cap() { # provider -> concurrency (override map "p=N,...", else default
} }
review_one() { review_one() {
local sf="" local sf="" ff=""
[ "${GADFLY_STATUS_BOARD:-1}" != "0" ] && sf="$(status_file_for "$1")" [ "${GADFLY_STATUS_BOARD:-1}" != "0" ] && sf="$(status_file_for "$1")"
[ "$CONSOLIDATE" = "1" ] && ff="$(findings_file_for "$1")"
PROVIDER=ollama MODEL="$1" GADFLY_BIN="/usr/local/bin/gadfly" GADFLY_REPO_DIR="$REPO_DIR" \ PROVIDER=ollama MODEL="$1" GADFLY_BIN="/usr/local/bin/gadfly" GADFLY_REPO_DIR="$REPO_DIR" \
GADFLY_STATUS_FILE="$sf" \ GADFLY_STATUS_FILE="$sf" GADFLY_FINDINGS_OUT="$ff" GADFLY_CONSOLIDATE="$CONSOLIDATE" \
bash "${SCRIPTS_DIR}/run.sh" || log "model $1 failed (continuing)" bash "${SCRIPTS_DIR}/run.sh" || log "model $1 failed (continuing)"
# If the binary never wrote real status (run.sh skipped it: empty diff, no key, # If the binary never wrote real status (run.sh skipped it: empty diff, no key,
# binary missing), the pre-seed stays {started:0, done:false} and the board # binary missing), the pre-seed stays {started:0, done:false} and the board
@@ -196,6 +254,34 @@ IFS=',' read -ra _raw <<< "$MODELS" || true
MODEL_LIST=() MODEL_LIST=()
for raw in "${_raw[@]}"; do m="$(echo "$raw" | tr -d '[:space:]')"; [ -n "$m" ] && MODEL_LIST+=("$m"); done for raw in "${_raw[@]}"; do m="$(echo "$raw" | tr -d '[:space:]')"; [ -n "$m" ] && MODEL_LIST+=("$m"); done
# --- cross-model consolidation decision ------------------------------------
# With >=2 models, post ONE consensus comment (findings clustered + ranked by
# cross-model agreement) instead of N per-model walls of prose. Each model writes
# its findings to FINDINGS_DIR; a final pass (the binary in GADFLY_CONSOLIDATE_DIR
# mode) renders the consensus comment. GADFLY_CONSOLIDATE: "auto" (default; on for
# >=2 models), "1" force on, "0" force off (keep per-model comments).
FINDINGS_DIR="${WORKDIR}/findings"
CONSOLIDATE=0
case "${GADFLY_CONSOLIDATE:-auto}" in
1) CONSOLIDATE=1 ;;
0) CONSOLIDATE=0 ;;
*) [ "${#MODEL_LIST[@]}" -ge 2 ] && CONSOLIDATE=1 ;;
esac
# A model spec can contain '/' and ':' (e.g. claude-code/opus, qwen3:14b), so
# sanitize to a flat filename — but append a checksum of the raw spec so two
# specs that sanitize the same (foo:bar vs foo/bar -> foo_bar) don't collide onto
# one file and silently drop a model from the consensus.
findings_file_for() {
local safe sum
safe="$(echo "$1" | tr -c '[:alnum:]._-' '_')"
sum="$(printf '%s' "$1" | cksum | cut -d' ' -f1)"
echo "${FINDINGS_DIR}/${safe}-${sum}.json"
}
if [ "$CONSOLIDATE" = "1" ]; then
rm -rf "$FINDINGS_DIR"; mkdir -p "$FINDINGS_DIR"
log "consolidation ON: ${#MODEL_LIST[@]} models -> one consensus comment"
fi
# Distinct providers, in first-seen order (no associative arrays — portable). # Distinct providers, in first-seen order (no associative arrays — portable).
PROVIDERS="" PROVIDERS=""
for m in "${MODEL_LIST[@]}"; do for m in "${MODEL_LIST[@]}"; do
@@ -241,6 +327,32 @@ if [ "${GADFLY_STATUS_BOARD:-1}" != "0" ]; then
log "status board started (pid ${BOARD_PID})" log "status board started (pid ${BOARD_PID})"
fi fi
# --- swarm-wide hard backstop ----------------------------------------------
# A wall-clock ceiling across the WHOLE fleet, so a pathological PR can never
# drain the usage block however the models behave. entrypoint is the only
# process spanning every model, so a single "never exceed X" guard lives here.
# On expiry it stops the review subtrees (the binary + run.sh); whatever partial
# findings were gathered are still posted and the job never fails (advisory).
# GADFLY_PR_BUDGET_SECS=0 (default) disables it.
KILLER_PID=""
rm -f "${WORKDIR}/.budget_killed" "${WORKDIR}/.disarmed" 2>/dev/null || true
if [ "${GADFLY_PR_BUDGET_SECS:-0}" -gt 0 ] 2>/dev/null; then
(
sleep "${GADFLY_PR_BUDGET_SECS}"
log "PR wall-clock budget (${GADFLY_PR_BUDGET_SECS}s) reached — stopping the review fleet (advisory; partial findings still posted)"
: > "${WORKDIR}/.budget_killed"
pkill -TERM -f '/usr/local/bin/gadfly' 2>/dev/null || true
pkill -TERM -f "${SCRIPTS_DIR}/run.sh" 2>/dev/null || true
sleep 5
# Guard the delayed SIGKILL on the disarm marker: once the lanes finished and
# the watchdog was disarmed, the consolidation gadfly pass runs next, and a
# name-based KILL here must NOT catch it.
[ -f "${WORKDIR}/.disarmed" ] || pkill -KILL -f '/usr/local/bin/gadfly' 2>/dev/null || true
) &
KILLER_PID=$!
log "PR budget watchdog armed (${GADFLY_PR_BUDGET_SECS}s, pid ${KILLER_PID})"
fi
log "providers: ${PROVIDERS:-none}" log "providers: ${PROVIDERS:-none}"
# Each provider lane runs in parallel; cap is enforced within each lane. Track # Each provider lane runs in parallel; cap is enforced within each lane. Track
# the lane PIDs so we wait ONLY for the review work — not the status board, # the lane PIDs so we wait ONLY for the review work — not the status board,
@@ -252,9 +364,63 @@ for p in $PROVIDERS; do
done done
[ "${#LANE_PIDS[@]}" -gt 0 ] && wait "${LANE_PIDS[@]}" [ "${#LANE_PIDS[@]}" -gt 0 ] && wait "${LANE_PIDS[@]}"
# Reviews finished (or the watchdog killed them): disarm the watchdog so its
# delayed SIGKILL can't catch the consolidation pass that runs next. Drop the
# disarm marker FIRST so even a racing watchdog that already reached its KILL line
# skips it (the kill below also tears the watchdog subshell down during its sleep).
if [ -n "$KILLER_PID" ]; then
: > "${WORKDIR}/.disarmed"
kill "$KILLER_PID" 2>/dev/null || true
fi
# If the backstop fired, note it on the consensus comment (per-model comments
# were already posted during the run; a killed model surfaces as a failed lane).
if [ -f "${WORKDIR}/.budget_killed" ]; then
export GADFLY_NOTICE="${GADFLY_NOTICE:+${GADFLY_NOTICE} }⏱️ This review was stopped early by the per-PR time budget (GADFLY_PR_BUDGET_SECS); findings are partial."
fi
# Reviews are done: signal the board to render the final state once and exit. # Reviews are done: signal the board to render the final state once and exit.
if [ -n "$BOARD_PID" ]; then if [ -n "$BOARD_PID" ]; then
touch "${STATUS_DIR}/.done" 2>/dev/null || true touch "${STATUS_DIR}/.done" 2>/dev/null || true
wait "$BOARD_PID" 2>/dev/null || true wait "$BOARD_PID" 2>/dev/null || true
fi fi
# --- cross-model consensus comment -----------------------------------------
# Render ONE consensus comment from the per-model findings the swarm wrote. This
# is advisory and best-effort: if the consolidation pass produces nothing, fall
# back to posting each model's review as its own comment (the per-model comments
# were suppressed during the run), so a consolidation hiccup never loses output.
if [ "$CONSOLIDATE" = "1" ]; then
n_files="$(ls -1 "${FINDINGS_DIR}"/*.json 2>/dev/null | wc -l | tr -d '[:space:]')"
log "consolidating findings from ${n_files} model(s)"
# Fetch the PR diff so the consolidator can also post an inline PR review,
# anchoring findings to changed lines (GADFLY_DIFF_FILE). GITEA_API/GITEA_TOKEN/
# GADFLY_PR are already in the binary's environment. Best-effort: an empty diff
# just means no inline review.
DIFF_FILE="${WORKDIR}/pr.diff"
API "${GITEA_API}/pulls/${PR}.diff" > "$DIFF_FILE" 2>/dev/null || true
CONSENSUS="$(GADFLY_CONSOLIDATE_DIR="$FINDINGS_DIR" GADFLY_DIFF_FILE="$DIFF_FILE" \
/usr/local/bin/gadfly 2>"${WORKDIR}/consolidate.err" || true)"
if [ -n "$CONSENSUS" ]; then
NOTICE_BLOCK=""
[ -n "${GADFLY_NOTICE:-}" ] && NOTICE_BLOCK="> ${GADFLY_NOTICE}"$'\n\n'
BODY="$(printf '%s%s\n\n<sub>Automated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.</sub>' "$NOTICE_BLOCK" "$CONSENSUS")"
upsert_comment_body "<!-- gadfly-consensus -->" "$BODY"
log "consensus comment posted"
else
log "consolidation produced no output; falling back to per-model comments"
log "$(tail -c 500 "${WORKDIR}/consolidate.err" 2>/dev/null)"
for f in "${FINDINGS_DIR}"/*.json; do
[ -f "$f" ] || continue
m="$(jq -r '.model // ""' "$f" 2>/dev/null)"
[ -z "$m" ] && continue
prov="$(jq -r '.provider // ""' "$f" 2>/dev/null)"
md="$(jq -r '.markdown // ""' "$f" 2>/dev/null)"
marker="<!-- gadfly-review:ollama:${m} -->"
body="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s\n\n<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge.</sub>' \
"$marker" "$m" "$prov" "$md")"
upsert_comment_body "$marker" "$body"
done
fi
fi
log "done" log "done"
+7 -6
View File
@@ -14,10 +14,11 @@
# Forward ONLY the secrets the reviewer uses (least privilege) — see the # Forward ONLY the secrets the reviewer uses (least privilege) — see the
# `secrets:` block below. GITEA_TOKEN is automatic. `secrets: inherit` also works # `secrets:` block below. GITEA_TOKEN is automatic. `secrets: inherit` also works
# but hands the reusable EVERY secret in your repo (registry/deploy/db creds the # but hands the reusable EVERY secret in your repo (registry/deploy/db creds the
# review never touches), so prefer the explicit form. Pin @<ref>: use the @v1 # review never touches), so prefer the explicit form. Pin to an immutable
# release tag (a curated pointer moved on deliberate releases) for auto-updating # @<sha>: long-lived act_runners CACHE the reusable by ref, so a moved tag (@v1)
# stability, or a full @<sha> for an immutable pin. Avoid @main — it moves on # or @main is often not re-fetched and silently runs a stale copy. Bump the @<sha>
# every push and would change what runs with your forwarded secrets. # to adopt a structural change; routine swarm tuning rides owner variables (see
# the gadfly README "Central config via variables") with no re-pin needed.
# #
# For custom named endpoints (GADFLY_ENDPOINT_<NAME>) or a provider the reusable # For custom named endpoints (GADFLY_ENDPOINT_<NAME>) or a provider the reusable
# doesn't map, use the full stub in adversarial-review.yml instead. # doesn't map, use the full stub in adversarial-review.yml instead.
@@ -49,8 +50,8 @@ jobs:
if: >- if: >-
github.event_name != 'issue_comment' github.event_name != 'issue_comment'
|| (github.event.issue.pull_request && github.actor == 'your-username') || (github.event.issue.pull_request && github.actor == 'your-username')
# @v1 = curated release tag (auto-updates on releases); swap for a full @<sha> # Pin to an immutable @<sha> (runners cache the ref, so @v1/@main can run
# if you want an immutable pin. Don't use @main (moves on every push). # stale). Bump it for structural changes; tune the swarm via owner variables.
uses: steve/gadfly/.gitea/workflows/review-reusable.yml@v1 uses: steve/gadfly/.gitea/workflows/review-reusable.yml@v1
# Forward ONLY what the reviewer needs. Add provider keys you use # Forward ONLY what the reviewer needs. Add provider keys you use
# (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, GADFLY_API_KEY) and/or # (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, GADFLY_API_KEY) and/or
+2
View File
@@ -3,6 +3,7 @@ module gitea.stevedudenhoeffer.com/steve/gadfly
go 1.26.2 go 1.26.2
require ( require (
gitea.stevedudenhoeffer.com/steve/executus v0.1.4
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462 gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462
gopkg.in/yaml.v3 v3.0.1 gopkg.in/yaml.v3 v3.0.1
) )
@@ -17,6 +18,7 @@ require (
github.com/go-logr/stdr v1.2.2 // indirect github.com/go-logr/stdr v1.2.2 // indirect
github.com/google/go-cmp v0.7.0 // indirect github.com/google/go-cmp v0.7.0 // indirect
github.com/google/s2a-go v0.1.9 // indirect github.com/google/s2a-go v0.1.9 // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/googleapis/enterprise-certificate-proxy v0.3.17 // indirect github.com/googleapis/enterprise-certificate-proxy v0.3.17 // indirect
github.com/googleapis/gax-go/v2 v2.22.0 // indirect github.com/googleapis/gax-go/v2 v2.22.0 // indirect
github.com/gorilla/websocket v1.5.3 // indirect github.com/gorilla/websocket v1.5.3 // indirect
+2
View File
@@ -4,6 +4,8 @@ cloud.google.com/go/auth v0.20.0 h1:kXTssoVb4azsVDoUiF8KvxAqrsQcQtB53DcSgta74CA=
cloud.google.com/go/auth v0.20.0/go.mod h1:942/yi/itH1SsmpyrbnTMDgGfdy2BUqIKyd0cyYLc5Q= cloud.google.com/go/auth v0.20.0/go.mod h1:942/yi/itH1SsmpyrbnTMDgGfdy2BUqIKyd0cyYLc5Q=
cloud.google.com/go/compute/metadata v0.9.0 h1:pDUj4QMoPejqq20dK0Pg2N4yG9zIkYGdBtwLoEkH9Zs= cloud.google.com/go/compute/metadata v0.9.0 h1:pDUj4QMoPejqq20dK0Pg2N4yG9zIkYGdBtwLoEkH9Zs=
cloud.google.com/go/compute/metadata v0.9.0/go.mod h1:E0bWwX5wTnLPedCKqk3pJmVgCBSM6qQI1yTBdEb3C10= cloud.google.com/go/compute/metadata v0.9.0/go.mod h1:E0bWwX5wTnLPedCKqk3pJmVgCBSM6qQI1yTBdEb3C10=
gitea.stevedudenhoeffer.com/steve/executus v0.1.4 h1:4F99uCV3OVaE9ITFp0FjPiYxLUQO+WpE+wU2HCnpXNM=
gitea.stevedudenhoeffer.com/steve/executus v0.1.4/go.mod h1:WQP/lH+meU06OSNF0TQO/wQLcJCrMwpi0EMj5vSpVtk=
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462 h1:1crjE1YkWHLZ91tUDOxN/Y5cuOnJ56e0U9UADoFfEPY= gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462 h1:1crjE1YkWHLZ91tUDOxN/Y5cuOnJ56e0U9UADoFfEPY=
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462/go.mod h1:UZLveG17SmENt4sne2RSLIbioix30RZbRIQUzBAnOyY= gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462/go.mod h1:UZLveG17SmENt4sne2RSLIbioix30RZbRIQUzBAnOyY=
github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs= github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
+37 -6
View File
@@ -50,6 +50,13 @@ MAX_DIFF_CHARS="${MAX_DIFF_CHARS:-60000}"
MARKER="<!-- gadfly-review:${PROVIDER}:${MODEL} -->" MARKER="<!-- gadfly-review:${PROVIDER}:${MODEL} -->"
say() { echo "[gadfly-review:${PROVIDER}:${MODEL}] $*" >&2; } say() { echo "[gadfly-review:${PROVIDER}:${MODEL}] $*" >&2; }
# When the swarm is consolidating (GADFLY_CONSOLIDATE=1, set by entrypoint.sh for
# a multi-model run), this model does NOT post its own comment — it writes its
# findings to GADFLY_FINDINGS_OUT and a single cross-model consensus comment is
# posted after the whole swarm finishes. Live progress still shows on the status
# board. Default 0 (post a per-model comment, the standalone behavior).
CONSOLIDATE="${GADFLY_CONSOLIDATE:-0}"
# Display the model's ACTUAL backend: the provider segment of the spec # Display the model's ACTUAL backend: the provider segment of the spec
# ("m1pro/qwen3.6:35b-mlx" -> "m1pro"); a bare id uses GADFLY_PROVIDER (default # ("m1pro/qwen3.6:35b-mlx" -> "m1pro"); a bare id uses GADFLY_PROVIDER (default
# ollama-cloud). This is what the comment header shows, not the run.sh lane. # ollama-cloud). This is what the comment header shows, not the run.sh lane.
@@ -126,7 +133,9 @@ USR="$(printf 'PR #%s: %s\n\nDescription:\n%s\n\nUnified diff to review:\n```dif
# --- announce start (placeholder comment) ----------------------------------- # --- announce start (placeholder comment) -----------------------------------
START_TS="$(date +%s)" START_TS="$(date +%s)"
say "starting review with ${MODEL}" say "starting review with ${MODEL}"
upsert_comment "$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n⏳ Reviewing… this comment will update with findings and run time.' \ # Skip the per-model placeholder when consolidating (the consensus comment is
# posted later; live progress is on the status board).
[ "$CONSOLIDATE" = "1" ] || upsert_comment "$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n⏳ Reviewing… this comment will update with findings and run time.' \
"$MARKER" "$MODEL" "$MODEL_PROVIDER")" "$MARKER" "$MODEL" "$MODEL_PROVIDER")"
# --- call the model --------------------------------------------------------- # --- call the model ---------------------------------------------------------
@@ -168,8 +177,9 @@ case "$PROVIDER" in
GADFLY_SYSTEM_FILE="${SCRIPT_DIR}/system-prompt.txt" \ GADFLY_SYSTEM_FILE="${SCRIPT_DIR}/system-prompt.txt" \
GADFLY_TITLE="$TITLE" \ GADFLY_TITLE="$TITLE" \
GADFLY_BODY="$BODY" \ GADFLY_BODY="$BODY" \
GADFLY_MAX_DIFF_CHARS="$MAX_DIFF_CHARS" \ GADFLY_MAX_DIFF_CHARS="${GADFLY_MAX_DIFF_CHARS:-$MAX_DIFF_CHARS}" \
GADFLY_STATUS_FILE="${GADFLY_STATUS_FILE:-}" \ GADFLY_STATUS_FILE="${GADFLY_STATUS_FILE:-}" \
GADFLY_FINDINGS_OUT="${GADFLY_FINDINGS_OUT:-}" \
"$BIN" 2>"$ERR_FILE" "$BIN" 2>"$ERR_FILE"
)" )"
rc=$? rc=$?
@@ -204,7 +214,28 @@ esac
# --- assemble + post final comment (with run time) -------------------------- # --- assemble + post final comment (with run time) --------------------------
ELAPSED="$(( $(date +%s) - START_TS ))" ELAPSED="$(( $(date +%s) - START_TS ))"
DUR="$(fmt_duration "$ELAPSED")" DUR="$(fmt_duration "$ELAPSED")"
COMMENT="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s\n\n<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in %s</sub>' \
"$MARKER" "$MODEL" "$MODEL_PROVIDER" "$REVIEW" "$DUR")" # Consolidating: the binary writes its findings file on success. If it failed or
upsert_comment "$COMMENT" # was skipped (no file, or an empty one), write a stub so this model still shows
say "done in ${DUR}" # up in the consensus (as failed) and an all-models-fail run still posts a
# comment — never silently drop a model or the whole review.
if [ "$CONSOLIDATE" = "1" ] && [ -n "${GADFLY_FINDINGS_OUT:-}" ] && [ ! -s "${GADFLY_FINDINGS_OUT}" ]; then
jq -n --arg model "$MODEL" --arg provider "$MODEL_PROVIDER" --arg md "$REVIEW" \
'{model:$model, provider:$provider, verdict:"reviewer failed", errored:true, markdown:$md, findings:[]}' \
> "${GADFLY_FINDINGS_OUT}" 2>/dev/null || true
fi
# When consolidating, the binary has written this model's findings to
# GADFLY_FINDINGS_OUT; the consensus comment is posted by entrypoint.sh after the
# whole swarm finishes, so this model posts no comment of its own.
if [ "$CONSOLIDATE" = "1" ]; then
say "done in ${DUR} (consolidated; no per-model comment)"
else
# An optional one-line notice (e.g. entrypoint's huge-PR downshift advisory),
# shown under the header so a shallower review is self-explaining.
NOTICE_BLOCK=""
[ -n "${GADFLY_NOTICE:-}" ] && NOTICE_BLOCK="> ${GADFLY_NOTICE}"$'\n\n'
COMMENT="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s%s\n\n<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in %s</sub>' \
"$MARKER" "$MODEL" "$MODEL_PROVIDER" "$NOTICE_BLOCK" "$REVIEW" "$DUR")"
upsert_comment "$COMMENT"
say "done in ${DUR}"
fi
+21 -1
View File
@@ -16,7 +16,9 @@ state. USE THEM to verify before you report. Do not review the diff in isolation
- list_dir([path]) — list a directory. - list_dir([path]) — list a directory.
- grep(pattern[, path, max_results]) — RE2 regex search across the repo. - grep(pattern[, path, max_results]) — RE2 regex search across the repo.
- find_files(name[, max_results]) — locate a file by path substring. - find_files(name[, max_results]) — locate a file by path substring.
- get_diff() — the full unified diff (the task message may truncate it). - get_diff([path, start_line, limit]) — the unified diff as a paginated, numbered window;
pass `path` to fetch just one changed file's hunks (do this on a big PR instead of pulling
the whole diff at once).
Mandatory verification discipline — this is the whole point of giving you tools: Mandatory verification discipline — this is the whole point of giving you tools:
- Before claiming a missing/duplicate import, an undefined symbol, a wrong signature, - Before claiming a missing/duplicate import, an undefined symbol, a wrong signature,
@@ -43,3 +45,21 @@ Output rules:
- Only report issues you are reasonably confident are real after checking. If the diff - Only report issues you are reasonably confident are real after checking. If the diff
is clean, say so plainly rather than inventing nits. is clean, say so plainly rather than inventing nits.
- When you are done investigating, STOP calling tools and reply with the final review. - When you are done investigating, STOP calling tools and reply with the final review.
Machine-readable findings — AFTER the prose review, append ONE fenced code block,
tagged `gadfly-findings`, holding a JSON array of the SAME findings you described above
(this block is consumed by tooling and hidden from the rendered comment):
```gadfly-findings
[
{"file": "path/to/file.go", "line": 123, "severity": "high", "confidence": "high", "title": "one-line summary of the issue"}
]
```
- One object per real finding, in the same order as your prose. `file`/`line` must be a
concrete location you verified (the line the issue is at). `severity` is one of
`critical`, `high`, `medium`, `small`, `trivial`. `confidence` is your post-verification
confidence the issue is real: one of `high`, `medium`, `low`.
- Include ONLY genuine problems — never verification notes ("confirmed X is safe at f:line"),
and never an "Outside my lens:" aside. If your lens is clean, emit an empty array `[]`.
- This block is in ADDITION to the prose; do not drop the human-readable findings.