Compare commits
9 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| ac6ce06cdd | |||
| 5007597cf9 | |||
| 3095ebff23 | |||
| 8f5adc91b2 | |||
| 88f74aa768 | |||
| 84b891b1ba | |||
| 53971603d3 | |||
| f49699fc12 | |||
| 6e87a3e73f |
@@ -60,6 +60,8 @@ on:
|
|||||||
worker_model: { type: string, default: "" } # GADFLY_WORKER_MODEL
|
worker_model: { type: string, default: "" } # GADFLY_WORKER_MODEL
|
||||||
allowed_users: { type: string, default: "" } # GADFLY_ALLOWED_USERS (consumer-specific; set in your stub)
|
allowed_users: { type: string, default: "" } # GADFLY_ALLOWED_USERS (consumer-specific; set in your stub)
|
||||||
trigger_phrase: { type: string, default: "" } # GADFLY_TRIGGER_PHRASE
|
trigger_phrase: { type: string, default: "" } # GADFLY_TRIGGER_PHRASE
|
||||||
|
consolidate: { type: string, default: "" } # GADFLY_CONSOLIDATE — "" => auto (one consensus comment for >=2 models); "0" => one comment per model
|
||||||
|
inline_review: { type: string, default: "" } # GADFLY_INLINE_REVIEW — "" => on (post a COMMENT-state PR review with inline comments on changed lines); "0" => off
|
||||||
# Job wall-clock cap. 90 as a default: the 5-lens suite across a slow lane
|
# Job wall-clock cap. 90 as a default: the 5-lens suite across a slow lane
|
||||||
# (claude-code with extended thinking) over two passes can run long.
|
# (claude-code with extended thinking) over two passes can run long.
|
||||||
timeout_minutes: { type: number, default: 90 }
|
timeout_minutes: { type: number, default: 90 }
|
||||||
@@ -92,7 +94,7 @@ jobs:
|
|||||||
runs-on: ubuntu-latest
|
runs-on: ubuntu-latest
|
||||||
timeout-minutes: ${{ inputs.timeout_minutes }}
|
timeout-minutes: ${{ inputs.timeout_minutes }}
|
||||||
steps:
|
steps:
|
||||||
- uses: docker://gitea.stevedudenhoeffer.com/steve/gadfly:sha-c342bdb
|
- uses: docker://gitea.stevedudenhoeffer.com/steve/gadfly:sha-3095ebf
|
||||||
env:
|
env:
|
||||||
# --- event context (from the CALLER's github.*) -------------------
|
# --- event context (from the CALLER's github.*) -------------------
|
||||||
GITEA_API: ${{ github.server_url }}/api/v1/repos/${{ github.repository }}
|
GITEA_API: ${{ github.server_url }}/api/v1/repos/${{ github.repository }}
|
||||||
@@ -143,3 +145,5 @@ jobs:
|
|||||||
GADFLY_WORKER_MODEL: ${{ inputs.worker_model }}
|
GADFLY_WORKER_MODEL: ${{ inputs.worker_model }}
|
||||||
GADFLY_ALLOWED_USERS: ${{ inputs.allowed_users }}
|
GADFLY_ALLOWED_USERS: ${{ inputs.allowed_users }}
|
||||||
GADFLY_TRIGGER_PHRASE: ${{ inputs.trigger_phrase }}
|
GADFLY_TRIGGER_PHRASE: ${{ inputs.trigger_phrase }}
|
||||||
|
GADFLY_CONSOLIDATE: ${{ inputs.consolidate }}
|
||||||
|
GADFLY_INLINE_REVIEW: ${{ inputs.inline_review }}
|
||||||
|
|||||||
@@ -22,15 +22,21 @@ verifies each one against the actual code, and posts its findings as a comment.
|
|||||||
4. **Provider-agnostic.** Powered by [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo),
|
4. **Provider-agnostic.** Powered by [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo),
|
||||||
so it can target Ollama (local/cloud), OpenAI, Anthropic, Google, or any
|
so it can target Ollama (local/cloud), OpenAI, Anthropic, Google, or any
|
||||||
OpenAI/Ollama-compatible endpoint. Don't re-hardcode a single provider.
|
OpenAI/Ollama-compatible endpoint. Don't re-hardcode a single provider.
|
||||||
5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib + majordomo. Keep
|
5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib, majordomo, and
|
||||||
it that way — no heavyweight deps, no coupling to any one consumer repo (e.g. mort).
|
[executus](https://gitea.stevedudenhoeffer.com/steve/executus) (whose *core* — `run`/`compact`/
|
||||||
|
`model`/`fanout`/`tool` — is itself majordomo+stdlib only, so the binary stays static; do NOT
|
||||||
|
pull executus's `contrib/store` or any battery that drags in a DB driver). No heavyweight deps,
|
||||||
|
no coupling to any one consumer repo (e.g. mort). Gadfly is executus's canonical *light* consumer.
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
```
|
```
|
||||||
cmd/gadfly/ the reviewer binary — pure producer of review markdown (stdout)
|
cmd/gadfly/ the reviewer binary — pure producer of review markdown (stdout)
|
||||||
main.go orchestration: loop specialists, each a review pass + adversarial recheck
|
main.go orchestration: fan specialists out (executus/fanout), each a review pass + recheck
|
||||||
engine.go reviewEngine abstraction: majordomo agent loop vs claude-code CLI shell-out
|
engine.go reviewEngine abstraction: executus run.Executor (majordomo agent loop +
|
||||||
|
compaction/bounding/budget/critic) vs claude-code CLI shell-out
|
||||||
|
executus.go executus wiring: tool.Registry over the repo tools, the run.Executor build
|
||||||
|
(compact + model context-limit threshold + per-PR budget + wrap-up critic)
|
||||||
specialists.go specialist lenses: built-ins, default suite, env + .gadfly.yml resolution
|
specialists.go specialist lenses: built-ins, default suite, env + .gadfly.yml resolution
|
||||||
auto.go dynamic `auto` selection: a selector model picks lenses per-diff (may invent)
|
auto.go dynamic `auto` selection: a selector model picks lenses per-diff (may invent)
|
||||||
delegate.go worker-tier delegate_investigation tool (cheap sub-agent does legwork)
|
delegate.go worker-tier delegate_investigation tool (cheap sub-agent does legwork)
|
||||||
@@ -69,7 +75,7 @@ verdict. Verdict is one of: `No material issues found` / `Minor issues` / `Block
|
|||||||
## Build / test
|
## Build / test
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
go build ./cmd/gadfly # needs read access to the private majordomo module
|
go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
|
||||||
go test ./...
|
go test ./...
|
||||||
gofmt -l cmd/ # must be clean
|
gofmt -l cmd/ # must be clean
|
||||||
docker build -t gadfly:dev --secret id=REGISTRY_USER,env=REGISTRY_USER --secret id=REGISTRY_PASSWORD,env=REGISTRY_PASSWORD .
|
docker build -t gadfly:dev --secret id=REGISTRY_USER,env=REGISTRY_USER --secret id=REGISTRY_PASSWORD,env=REGISTRY_PASSWORD .
|
||||||
@@ -149,3 +155,21 @@ are actually exercised. OpenAI/Anthropic/Google come from majordomo's abstractio
|
|||||||
parallel, `cap` (from `GADFLY_PROVIDER_CONCURRENCY` else `GADFLY_CONCURRENCY`, default 1) bounds
|
parallel, `cap` (from `GADFLY_PROVIDER_CONCURRENCY` else `GADFLY_CONCURRENCY`, default 1) bounds
|
||||||
models-at-once within a lane. The review timeout (`GADFLY_TIMEOUT_SECS`) is **per-lens**, not
|
models-at-once within a lane. The review timeout (`GADFLY_TIMEOUT_SECS`) is **per-lens**, not
|
||||||
shared across the suite — a slow model can't starve later lenses (the original timeout bug).
|
shared across the suite — a slow model can't starve later lenses (the original timeout bug).
|
||||||
|
- **Large-PR token burn**: the agent loop re-sends the whole transcript every step, so a giant
|
||||||
|
diff (the old `get_diff` dumped it untruncated, and it was embedded in both the review and
|
||||||
|
recheck task) was re-transmitted ~steps × lenses × passes × models times — a ~250 K-token PR
|
||||||
|
could drain a metered usage block in minutes. Fixed in three size-gated layers (small PRs
|
||||||
|
untouched): paginated `get_diff` + `executus/compact` compaction in the binary; an
|
||||||
|
`entrypoint.sh` downshift above `GADFLY_HUGE_DIFF_BYTES` (one cheap model, fewer lenses/steps,
|
||||||
|
no recheck); and a swarm-wide `GADFLY_PR_BUDGET_SECS` wall-clock backstop. Compaction's threshold
|
||||||
|
is intentionally LOW (`GADFLY_COMPACT_RATIO` 0.45, not executus's 0.7) because the burning
|
||||||
|
transcript on the embedded path rarely reaches 0.7×context.
|
||||||
|
- **executus re-platform**: the in-process review path runs through `executus/run`'s `run.Executor`
|
||||||
|
(compaction, run-bounding, `Ports.Budget`, the wrap-up nudge as `Ports.Critic`), wiring it in
|
||||||
|
`cmd/gadfly/executus.go`. Gadfly KEEPS its own `model.go` resolution (so `GADFLY_ENDPOINT_<NAME>`
|
||||||
|
http aliases + the claude-code engine survive) and only hands `run.Executor` the already-resolved
|
||||||
|
model via a trivial resolver — do NOT route review-model resolution through
|
||||||
|
`model.ParseModelForContext` (it bypasses gadfly's endpoint aliases). `run.Result` exposes no
|
||||||
|
transcript, so the old transcript-based forced-finalization fallback is gone; the wrap-up critic
|
||||||
|
nudge is the remaining "always emit something" mechanism. The claude-code engine still shells out
|
||||||
|
and is unaffected.
|
||||||
|
|||||||
+4
-1
@@ -24,7 +24,10 @@ RUN --mount=type=cache,target=/go/pkg/mod \
|
|||||||
go build -trimpath -ldflags="-s -w" -o /out/gadfly ./cmd/gadfly
|
go build -trimpath -ldflags="-s -w" -o /out/gadfly ./cmd/gadfly
|
||||||
|
|
||||||
FROM alpine:3.20
|
FROM alpine:3.20
|
||||||
RUN apk add --no-cache bash git curl jq ca-certificates nodejs npm
|
# procps provides pkill/pgrep, which entrypoint.sh's per-PR wall-clock backstop
|
||||||
|
# (GADFLY_PR_BUDGET_SECS) uses to stop the review subtrees — busybox's applets
|
||||||
|
# are not guaranteed to include them.
|
||||||
|
RUN apk add --no-cache bash git curl jq ca-certificates nodejs npm procps
|
||||||
# Bundle the Claude Code CLI so the `claude-code` review engine works out of the
|
# Bundle the Claude Code CLI so the `claude-code` review engine works out of the
|
||||||
# box (GADFLY_MODELS=claude-code or claude-code/<model>). This adds Node + the
|
# box (GADFLY_MODELS=claude-code or claude-code/<model>). This adds Node + the
|
||||||
# CLI to the image (notably larger); ollama-only users pay the size but nothing
|
# CLI to the image (notably larger); ollama-only users pay the size but nothing
|
||||||
|
|||||||
@@ -277,6 +277,43 @@ every `GADFLY_STATUS_POLL_SECS` (default 12s) until the swarm finishes. It's adv
|
|||||||
best-effort — the per-model findings comments are unaffected — and entirely separate from those.
|
best-effort — the per-model findings comments are unaffected — and entirely separate from those.
|
||||||
Turn it off with `GADFLY_STATUS_BOARD=0`.
|
Turn it off with `GADFLY_STATUS_BOARD=0`.
|
||||||
|
|
||||||
|
### Consensus consolidation
|
||||||
|
|
||||||
|
With **two or more models**, posting one comment each means a reader faces N walls of prose that
|
||||||
|
mostly agree. Instead Gadfly consolidates: every model writes its findings to a shared file, and
|
||||||
|
after the whole swarm finishes a single pass clusters those findings by location, counts **how
|
||||||
|
many models independently flagged each one**, and posts **one consensus comment**:
|
||||||
|
|
||||||
|
```
|
||||||
|
## 🪰 Gadfly review — consensus across 7 models
|
||||||
|
|
||||||
|
**Verdict: Blocking issues found** · 9 findings (3 with multi-model agreement)
|
||||||
|
|
||||||
|
| | Finding | Where | Models | Lens |
|
||||||
|
|--|--|--|--|--|
|
||||||
|
| 🔴 | Auth bypass: token not verified | `auth/login.go:42` | 6/7 | security |
|
||||||
|
| 🟠 | Unbounded retry loop | `sync/worker.go:88` | 3/7 | error-handling |
|
||||||
|
|
||||||
|
<details><summary>4 single-model findings (lower confidence)</summary> … </details>
|
||||||
|
<details><summary>Per-model detail</summary> … each model's full review, folded … </details>
|
||||||
|
```
|
||||||
|
|
||||||
|
Cross-model agreement is the strongest real-vs-false-positive signal available, so findings are
|
||||||
|
ranked by it (a lone low-severity finding folds away; a lone *critical* still surfaces). The
|
||||||
|
per-model comments are suppressed in this mode — each model's full review is preserved, folded,
|
||||||
|
inside the consensus comment — and nothing is lost: if consolidation can't run, Gadfly falls
|
||||||
|
back to posting the per-model comments. Controlled by `GADFLY_CONSOLIDATE`: `auto` (default — on
|
||||||
|
for ≥2 models), `1` (force on), `0` (force off, one comment per model). Single-model runs are
|
||||||
|
unaffected.
|
||||||
|
|
||||||
|
**Inline PR review.** Alongside the consensus comment, Gadfly also posts a single Gitea **pull
|
||||||
|
review** (state `COMMENT` — advisory, **never** request-changes or approve, so it can't block a
|
||||||
|
merge) whose inline comments anchor each consensus finding to the exact changed line it's about.
|
||||||
|
Only findings that land on a line in the diff are anchored (Gitea rejects comments off the diff);
|
||||||
|
the rest stay in the consensus comment. A re-run replaces the previous review instead of stacking.
|
||||||
|
It's the "reviewer integrated with Gitea" without the blocking — turn it off with
|
||||||
|
`GADFLY_INLINE_REVIEW=0`.
|
||||||
|
|
||||||
### Triggers
|
### Triggers
|
||||||
|
|
||||||
1. A **new/reopened/ready** non-draft PR — automatic.
|
1. A **new/reopened/ready** non-draft PR — automatic.
|
||||||
@@ -359,9 +396,20 @@ The reviewer binary reads these (the stub/entrypoint set sane defaults):
|
|||||||
| `GADFLY_TIMEOUT_SECS` | 300 | deadline **per specialist lens** (review+recheck) |
|
| `GADFLY_TIMEOUT_SECS` | 300 | deadline **per specialist lens** (review+recheck) |
|
||||||
| `GADFLY_RECHECK` | on | set `0`/`false` to skip the recheck pass |
|
| `GADFLY_RECHECK` | on | set `0`/`false` to skip the recheck pass |
|
||||||
| `GADFLY_RECHECK_MAX_STEPS` | 16 | recheck-pass step cap |
|
| `GADFLY_RECHECK_MAX_STEPS` | 16 | recheck-pass step cap |
|
||||||
| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the prompt (full diff via `get_diff`) |
|
| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the **review** prompt (the full diff is reachable via the paginated `get_diff` tool, scoped per file with its `path` arg) |
|
||||||
|
| `GADFLY_RECHECK_DIFF_CHARS` | 20000 | diff chars embedded in the **recheck** prompt (smaller — the recheck pages `get_diff` for the hunks it verifies) |
|
||||||
|
| `GADFLY_COMPACT` | on | context compaction (via [executus](https://gitea.stevedudenhoeffer.com/steve/executus)): fold the transcript's runaway middle into a summary as it nears the model's context window, so a big diff + accumulating tool output can't balloon every step. `0` disables |
|
||||||
|
| `GADFLY_COMPACT_RATIO` | 0.45 | fraction of the model's context window at which compaction fires |
|
||||||
|
| `GADFLY_COMPACT_MODEL` | worker, else review model | cheap model the compactor uses to summarize the folded middle |
|
||||||
|
| `GADFLY_COMPACT_KEEP_RECENT` | 8 | most-recent messages kept verbatim during compaction |
|
||||||
|
| `GADFLY_COMPACT_SUMMARY_WORDS` | 200 | word cap on the compaction summary |
|
||||||
|
| `GADFLY_MODEL_CONTEXT_TOKENS` | *(auto)* | override the model's context-window size (tokens) for the compaction threshold; set it for self-hosted endpoints executus can't introspect (Ollama Cloud models resolve automatically) |
|
||||||
|
| `GADFLY_PR_TOKEN_BUDGET` | — | per-model token ceiling for this PR; once spent, remaining lenses/passes are skipped (advisory). 0 = off |
|
||||||
|
| `GADFLY_PR_TIME_BUDGET_SECS` | — | per-model wall-clock ceiling for this PR (advisory). 0 = off |
|
||||||
| `GADFLY_STATUS_BOARD` | on | set `0` to disable the live status-board comment |
|
| `GADFLY_STATUS_BOARD` | on | set `0` to disable the live status-board comment |
|
||||||
| `GADFLY_STATUS_POLL_SECS` | 12 | how often the status board re-renders/upserts |
|
| `GADFLY_STATUS_POLL_SECS` | 12 | how often the status board re-renders/upserts |
|
||||||
|
| `GADFLY_CONSOLIDATE` | `auto` | cross-model consensus comment: `auto` (on for ≥2 models), `1` (force on), `0` (off — one comment per model) |
|
||||||
|
| `GADFLY_INLINE_REVIEW` | on | when consolidating, also post a `COMMENT`-state PR review with inline comments on changed lines; `0` disables |
|
||||||
| `GADFLY_TRIGGER_PHRASE` | `@gadfly review` | comment phrase that re-triggers |
|
| `GADFLY_TRIGGER_PHRASE` | `@gadfly review` | comment phrase that re-triggers |
|
||||||
| `GADFLY_ALLOWED_USERS` | *(collaborators)* | comma-separated allow-list for comment triggers |
|
| `GADFLY_ALLOWED_USERS` | *(collaborators)* | comma-separated allow-list for comment triggers |
|
||||||
| `GADFLY_FINDINGS_URL` | — | gadfly-reports store base URL; set to enable findings telemetry (off when empty) |
|
| `GADFLY_FINDINGS_URL` | — | gadfly-reports store base URL; set to enable findings telemetry (off when empty) |
|
||||||
@@ -369,6 +417,37 @@ The reviewer binary reads these (the stub/entrypoint set sane defaults):
|
|||||||
| `GADFLY_REPO` | *(from `GITEA_API`)* | `owner/repo` slug stamped on emitted runs/findings (set by `entrypoint.sh`) |
|
| `GADFLY_REPO` | *(from `GITEA_API`)* | `owner/repo` slug stamped on emitted runs/findings (set by `entrypoint.sh`) |
|
||||||
| `GADFLY_PR` | *(from event)* | PR number stamped on emitted runs/findings (set by `entrypoint.sh`) |
|
| `GADFLY_PR` | *(from event)* | PR number stamped on emitted runs/findings (set by `entrypoint.sh`) |
|
||||||
|
|
||||||
|
### Large-PR cost controls
|
||||||
|
|
||||||
|
A very large diff is the one thing that can blow the budget: every review step
|
||||||
|
re-sends it, multiplied across models × lenses × passes × steps (a single
|
||||||
|
~250 K-token PR can otherwise burn a whole metered usage block). Gadfly handles
|
||||||
|
big PRs in three layers, all **size-gated so small PRs are untouched**:
|
||||||
|
|
||||||
|
1. **Paginated `get_diff` + compaction** (reviewer binary, on by default) —
|
||||||
|
`get_diff` returns a paginated, optionally per-file window instead of the whole
|
||||||
|
diff, and once a transcript nears the model's context window its middle is
|
||||||
|
folded into a summary (powered by [executus](https://gitea.stevedudenhoeffer.com/steve/executus)'s
|
||||||
|
`compact`). Tune with the `GADFLY_COMPACT_*` knobs above.
|
||||||
|
2. **Downshift** (`entrypoint.sh`) — above `GADFLY_HUGE_DIFF_BYTES` the whole fleet
|
||||||
|
collapses to a single cheap model + a focused lens subset, fewer steps, and no
|
||||||
|
recheck. A finished shallow review beats a budget-nuking one, and the posted
|
||||||
|
comment says so.
|
||||||
|
3. **Hard backstop** (`entrypoint.sh`) — `GADFLY_PR_BUDGET_SECS` is a wall-clock
|
||||||
|
ceiling across the *entire* fleet; on expiry the review is stopped and whatever
|
||||||
|
was found so far is posted. Like everything else, it never fails CI.
|
||||||
|
|
||||||
|
| Env | Default | Meaning |
|
||||||
|
|-----|---------|---------|
|
||||||
|
| `GADFLY_HUGE_DIFF_BYTES` | 600000 | downshift the fleet when the PR diff exceeds this many bytes (0 = never downshift) |
|
||||||
|
| `GADFLY_HUGE_DIFF_MODELS` | first model | model(s) to run on a downshifted huge PR |
|
||||||
|
| `GADFLY_HUGE_DIFF_SPECIALISTS` | `security,correctness,error-handling` | lenses on a downshifted huge PR |
|
||||||
|
| `GADFLY_HUGE_DIFF_MAX_STEPS` | 12 | review step cap on a huge PR |
|
||||||
|
| `GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS` | 8 | recheck step cap on a huge PR |
|
||||||
|
| `GADFLY_HUGE_DIFF_RECHECK` | 0 | run the recheck pass on a huge PR (off by default) |
|
||||||
|
| `GADFLY_HUGE_DIFF_MAX_DIFF_CHARS` | 20000 | embedded review-diff chars on a huge PR |
|
||||||
|
| `GADFLY_PR_BUDGET_SECS` | — | swarm-wide wall-clock backstop; stops the whole fleet when reached (0 = off) |
|
||||||
|
|
||||||
## Findings telemetry (optional)
|
## Findings telemetry (optional)
|
||||||
|
|
||||||
Gadfly can record what it found so model quality can be tracked over time. It is
|
Gadfly can record what it found so model quality can be tracked over time. It is
|
||||||
@@ -392,7 +471,7 @@ code.
|
|||||||
## Building locally
|
## Building locally
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
go build ./cmd/gadfly # needs read access to the private majordomo module
|
go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
|
||||||
go test ./...
|
go test ./...
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,477 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
// Cross-model consensus consolidation. The swarm runs each model independently
|
||||||
|
// (entrypoint.sh fans them out across provider lanes); historically each model
|
||||||
|
// posted its OWN comment, so a reader faced N walls of prose that mostly agreed.
|
||||||
|
//
|
||||||
|
// Instead, every model writes its findings to a shared directory
|
||||||
|
// (GADFLY_FINDINGS_OUT, one JSON file per model), and after the whole swarm
|
||||||
|
// finishes a single consolidation pass (GADFLY_CONSOLIDATE_DIR) clusters those
|
||||||
|
// findings by location, counts how many models independently flagged each one,
|
||||||
|
// and renders ONE comment: an agreement-ranked table up top (cross-model
|
||||||
|
// agreement is the strongest real-vs-false-positive signal we have), with each
|
||||||
|
// model's full review folded below for drill-down.
|
||||||
|
//
|
||||||
|
// This file owns: the per-model artifact (modelFindings), writing it
|
||||||
|
// (writeFindingsOut), reading the directory back, clustering, and rendering the
|
||||||
|
// consensus markdown (renderConsensus). It depends only on the stdlib.
|
||||||
|
|
||||||
|
import (
|
||||||
|
"encoding/json"
|
||||||
|
"errors"
|
||||||
|
"fmt"
|
||||||
|
"os"
|
||||||
|
"path/filepath"
|
||||||
|
"sort"
|
||||||
|
"strings"
|
||||||
|
)
|
||||||
|
|
||||||
|
// consensusMarker tags the single consolidated comment so entrypoint.sh can
|
||||||
|
// upsert it in place across re-runs (mirrors run.sh's per-model marker).
|
||||||
|
const consensusMarker = "<!-- gadfly-consensus -->"
|
||||||
|
|
||||||
|
// modelFindings is the per-model artifact written to GADFLY_FINDINGS_OUT. It
|
||||||
|
// carries enough to rebuild the consolidated comment without re-running models:
|
||||||
|
// the structured findings (for clustering) plus the full rendered review (for
|
||||||
|
// the folded per-model drill-down).
|
||||||
|
type modelFindings struct {
|
||||||
|
Model string `json:"model"`
|
||||||
|
Provider string `json:"provider"`
|
||||||
|
Verdict string `json:"verdict"` // worst lens verdict, as a label
|
||||||
|
Errored bool `json:"errored"` // the model produced no usable review (every lens failed, or the run crashed)
|
||||||
|
Markdown string `json:"markdown"` // full rendered per-model review (findings block already stripped)
|
||||||
|
Findings []outFinding `json:"findings"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// outFinding is one finding in the per-model artifact (a flattened `finding`
|
||||||
|
// plus its lens).
|
||||||
|
type outFinding struct {
|
||||||
|
Lens string `json:"lens"`
|
||||||
|
File string `json:"file"`
|
||||||
|
Line int `json:"line"`
|
||||||
|
Severity string `json:"severity"`
|
||||||
|
Confidence string `json:"confidence"`
|
||||||
|
Title string `json:"title"`
|
||||||
|
Detail string `json:"detail"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// collectFindings returns a lens result's findings with severity always filled
|
||||||
|
// in: per-finding when the structured block supplied it, else derived from the
|
||||||
|
// lens verdict (so heuristic-scraped findings still carry a canonical word). A
|
||||||
|
// clean or errored lens yields nothing. Shared by the telemetry emit and the
|
||||||
|
// per-model findings file so both agree on what a finding is.
|
||||||
|
func collectFindings(r specialistResult) []finding {
|
||||||
|
if r.errored || r.verdict == verdictClean {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
fs := extractStructuredFindingsOrScrape(r)
|
||||||
|
lensSev := r.verdict.severity()
|
||||||
|
for i := range fs {
|
||||||
|
if fs[i].severity == "" {
|
||||||
|
fs[i].severity = lensSev
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return fs
|
||||||
|
}
|
||||||
|
|
||||||
|
// writeFindingsOut writes this model's findings + rendered review to
|
||||||
|
// GADFLY_FINDINGS_OUT for the later consolidation pass. No-op unless the env is
|
||||||
|
// set. Best-effort: any error is logged to stderr and never affects the review
|
||||||
|
// (it runs after the markdown is already on stdout).
|
||||||
|
func writeFindingsOut(results []specialistResult) {
|
||||||
|
path := strings.TrimSpace(os.Getenv("GADFLY_FINDINGS_OUT"))
|
||||||
|
if path == "" {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
mf := modelFindings{
|
||||||
|
Model: strings.TrimSpace(os.Getenv("GADFLY_MODEL")),
|
||||||
|
Provider: modelProvider(),
|
||||||
|
Verdict: worstVerdict(results).label(),
|
||||||
|
Errored: allErrored(results),
|
||||||
|
Markdown: renderConsolidated(results),
|
||||||
|
}
|
||||||
|
for _, r := range results {
|
||||||
|
for _, f := range collectFindings(r) {
|
||||||
|
mf.Findings = append(mf.Findings, outFinding{
|
||||||
|
Lens: r.spec.Name,
|
||||||
|
File: f.file,
|
||||||
|
Line: f.line,
|
||||||
|
Severity: f.severity,
|
||||||
|
Confidence: f.confidence,
|
||||||
|
Title: f.title,
|
||||||
|
Detail: f.detail,
|
||||||
|
})
|
||||||
|
}
|
||||||
|
}
|
||||||
|
data, err := json.Marshal(mf)
|
||||||
|
if err != nil {
|
||||||
|
fmt.Fprintln(os.Stderr, "gadfly: marshal findings out:", err)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
// Defensive: make sure the parent dir exists (entrypoint creates it, but a
|
||||||
|
// missing dir would otherwise silently drop this model from the consensus).
|
||||||
|
if dir := filepath.Dir(path); dir != "" {
|
||||||
|
_ = os.MkdirAll(dir, 0o755)
|
||||||
|
}
|
||||||
|
if err := os.WriteFile(path, data, 0o644); err != nil {
|
||||||
|
fmt.Fprintln(os.Stderr, "gadfly: write findings out:", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// allErrored reports whether every lens of a review failed (so the model
|
||||||
|
// produced no usable findings). Such a model is recorded but excluded from the
|
||||||
|
// consensus agreement denominator — counting it would dilute every ratio with a
|
||||||
|
// model that never actually reviewed.
|
||||||
|
func allErrored(results []specialistResult) bool {
|
||||||
|
if len(results) == 0 {
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
for _, r := range results {
|
||||||
|
if !r.errored {
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
|
||||||
|
// runConsolidate is the consolidation entry point (GADFLY_CONSOLIDATE_DIR set):
|
||||||
|
// read every per-model artifact in the directory, render the consensus comment
|
||||||
|
// to stdout. Errors are fatal to THIS process only — entrypoint.sh treats a
|
||||||
|
// failed consolidation as advisory and falls back to per-model comments.
|
||||||
|
func runConsolidate() error {
|
||||||
|
dir := strings.TrimSpace(os.Getenv("GADFLY_CONSOLIDATE_DIR"))
|
||||||
|
if dir == "" {
|
||||||
|
return errors.New("GADFLY_CONSOLIDATE_DIR is empty")
|
||||||
|
}
|
||||||
|
entries, err := os.ReadDir(dir)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("read consolidate dir: %w", err)
|
||||||
|
}
|
||||||
|
var models []modelFindings
|
||||||
|
for _, e := range entries {
|
||||||
|
if e.IsDir() || !strings.HasSuffix(e.Name(), ".json") {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
data, err := os.ReadFile(filepath.Join(dir, e.Name()))
|
||||||
|
if err != nil {
|
||||||
|
fmt.Fprintln(os.Stderr, "gadfly: read", e.Name(), err)
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
var mf modelFindings
|
||||||
|
if err := json.Unmarshal(data, &mf); err != nil {
|
||||||
|
fmt.Fprintln(os.Stderr, "gadfly: parse", e.Name(), err)
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if strings.TrimSpace(mf.Model) == "" {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
models = append(models, mf)
|
||||||
|
}
|
||||||
|
if len(models) == 0 {
|
||||||
|
return errors.New("no model findings to consolidate")
|
||||||
|
}
|
||||||
|
// Cluster once, then render the consensus comment and (best-effort) the inline
|
||||||
|
// PR review from the same clusters so the two views can't drift.
|
||||||
|
clusters := clusterFindings(models)
|
||||||
|
|
||||||
|
// Lead with the marker so entrypoint.sh can upsert this comment in place
|
||||||
|
// (same pattern as run.sh's per-model marker); it appends the advisory footer.
|
||||||
|
fmt.Println(consensusMarker)
|
||||||
|
fmt.Println(renderConsensus(models, clusters))
|
||||||
|
|
||||||
|
// Inline PR review (COMMENT state) anchoring findings to changed lines.
|
||||||
|
// Best-effort: no-op without diff/API creds, never affects stdout/exit.
|
||||||
|
postInlineReview(clusters)
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// cluster is a group of findings (across models) judged to be the same issue:
|
||||||
|
// same file, lines within lineTolerance of the cluster's current span. The span
|
||||||
|
// [line,maxLine] slides as members join, so a chain of nearby findings merges
|
||||||
|
// instead of splitting once it drifts past the first line.
|
||||||
|
type cluster struct {
|
||||||
|
file string
|
||||||
|
line int // representative (smallest) line
|
||||||
|
maxLine int // largest line in the cluster — the span's upper edge
|
||||||
|
severity string
|
||||||
|
title string
|
||||||
|
detail string // highest-severity report's detail; rendered in the inline review comment
|
||||||
|
models map[string]bool
|
||||||
|
lenses map[string]bool
|
||||||
|
}
|
||||||
|
|
||||||
|
// findingRef is one model's finding (carrying which model reported it), used
|
||||||
|
// while grouping findings into clusters.
|
||||||
|
type findingRef struct {
|
||||||
|
f outFinding
|
||||||
|
model string
|
||||||
|
}
|
||||||
|
|
||||||
|
// lineTolerance: a finding in the same file within this many lines of a
|
||||||
|
// cluster's current span is treated as the same issue (models often cite a line
|
||||||
|
// or two apart).
|
||||||
|
const lineTolerance = 3
|
||||||
|
|
||||||
|
// renderConsensus builds the single consolidated comment body from every model's
|
||||||
|
// findings. It does NOT emit the marker or advisory footer — entrypoint.sh wraps
|
||||||
|
// it (mirroring run.sh's per-model framing).
|
||||||
|
func renderConsensus(models []modelFindings, clusters []cluster) string {
|
||||||
|
// effective = models that actually produced a review. Errored models are
|
||||||
|
// shown (folded, below) but excluded from the agreement denominator so a
|
||||||
|
// failed model doesn't dilute every ratio.
|
||||||
|
effective := 0
|
||||||
|
for _, m := range models {
|
||||||
|
if !m.Errored {
|
||||||
|
effective++
|
||||||
|
}
|
||||||
|
}
|
||||||
|
errored := len(models) - effective
|
||||||
|
|
||||||
|
worst := verdictClean
|
||||||
|
for _, m := range models {
|
||||||
|
if v := parseVerdict(m.Verdict); v > worst {
|
||||||
|
worst = v
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Partition in one pass: "headline" findings (multi-model agreement, OR a
|
||||||
|
// lone CRITICAL) vs folded "single-model" lower-confidence findings. Also
|
||||||
|
// count multi-model agreements for the summary line.
|
||||||
|
var headline, folded []cluster
|
||||||
|
agreed := 0
|
||||||
|
for _, c := range clusters {
|
||||||
|
if len(c.models) >= 2 {
|
||||||
|
agreed++
|
||||||
|
}
|
||||||
|
if len(c.models) >= 2 || sevRank(c.severity) >= sevRank("critical") {
|
||||||
|
headline = append(headline, c)
|
||||||
|
} else {
|
||||||
|
folded = append(folded, c)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
var b strings.Builder
|
||||||
|
fmt.Fprintf(&b, "## 🪰 Gadfly review — consensus across %d model%s", effective, plural(effective))
|
||||||
|
if errored > 0 {
|
||||||
|
fmt.Fprintf(&b, " (%d failed)", errored)
|
||||||
|
}
|
||||||
|
b.WriteString("\n\n")
|
||||||
|
fmt.Fprintf(&b, "**Verdict: %s** · %d finding%s (%d with multi-model agreement)\n",
|
||||||
|
worst.label(), len(clusters), plural(len(clusters)), agreed)
|
||||||
|
|
||||||
|
if len(headline) > 0 {
|
||||||
|
b.WriteString("\n| | Finding | Where | Models | Lens |\n|--|--|--|--|--|\n")
|
||||||
|
for _, c := range headline {
|
||||||
|
fmt.Fprintf(&b, "| %s | %s | `%s` | %d/%d | %s |\n",
|
||||||
|
sevIcon(c.severity), mdCell(c.title), mdCell(location(c.file, c.line)),
|
||||||
|
len(c.models), effective, mdCell(lensList(c.lenses)))
|
||||||
|
}
|
||||||
|
} else if len(clusters) == 0 {
|
||||||
|
b.WriteString("\nNo material issues found by consensus.\n")
|
||||||
|
}
|
||||||
|
// else: only single-model findings — they're shown folded below, so don't
|
||||||
|
// claim "no material issues" (there are some, just none with consensus).
|
||||||
|
|
||||||
|
if len(folded) > 0 {
|
||||||
|
fmt.Fprintf(&b, "\n<details><summary>%d single-model finding%s (lower confidence)</summary>\n\n",
|
||||||
|
len(folded), plural(len(folded)))
|
||||||
|
b.WriteString("| | Finding | Where | Model | Lens |\n|--|--|--|--|--|\n")
|
||||||
|
for _, c := range folded {
|
||||||
|
fmt.Fprintf(&b, "| %s | %s | `%s` | %s | %s |\n",
|
||||||
|
sevIcon(c.severity), mdCell(c.title), mdCell(location(c.file, c.line)),
|
||||||
|
mdCell(oneModel(c.models)), mdCell(lensList(c.lenses)))
|
||||||
|
}
|
||||||
|
b.WriteString("\n</details>\n")
|
||||||
|
}
|
||||||
|
|
||||||
|
// Per-model full reviews, folded for drill-down (nothing is lost).
|
||||||
|
b.WriteString("\n<details><summary>Per-model detail</summary>\n")
|
||||||
|
for _, m := range models {
|
||||||
|
body := strings.TrimSpace(m.Markdown)
|
||||||
|
if body == "" {
|
||||||
|
body = "_(no output)_"
|
||||||
|
}
|
||||||
|
verdict := m.Verdict
|
||||||
|
if m.Errored {
|
||||||
|
verdict = "⚠️ reviewer failed"
|
||||||
|
}
|
||||||
|
fmt.Fprintf(&b, "\n<details><summary><b>%s</b> (%s) — %s</summary>\n\n%s\n\n</details>\n",
|
||||||
|
mdCell(m.Model), mdCell(m.Provider), verdict, body)
|
||||||
|
}
|
||||||
|
b.WriteString("\n</details>")
|
||||||
|
return b.String()
|
||||||
|
}
|
||||||
|
|
||||||
|
// clusterFindings groups every model's findings into cross-model clusters,
|
||||||
|
// sorted by agreement (desc), then severity (desc), then location.
|
||||||
|
func clusterFindings(models []modelFindings) []cluster {
|
||||||
|
// Group by file, then greedily merge by line proximity.
|
||||||
|
byFile := map[string][]findingRef{}
|
||||||
|
for _, m := range models {
|
||||||
|
for _, f := range m.Findings {
|
||||||
|
if strings.TrimSpace(f.File) == "" {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
byFile[f.File] = append(byFile[f.File], findingRef{f, m.Model})
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
var clusters []cluster
|
||||||
|
for file, items := range byFile {
|
||||||
|
sort.SliceStable(items, func(i, j int) bool { return items[i].f.Line < items[j].f.Line })
|
||||||
|
// Cluster within THIS file only (clusters never span files), so the inner
|
||||||
|
// scan is over same-file clusters, not every cluster seen so far.
|
||||||
|
var fileClusters []cluster
|
||||||
|
for _, it := range items {
|
||||||
|
placed := false
|
||||||
|
for ci := range fileClusters {
|
||||||
|
c := &fileClusters[ci]
|
||||||
|
// Join if the line falls within the cluster's span, widened by the
|
||||||
|
// tolerance on both edges — so the window slides as the span grows.
|
||||||
|
if it.f.Line >= c.line-lineTolerance && it.f.Line <= c.maxLine+lineTolerance {
|
||||||
|
mergeIntoCluster(c, it.f, it.model)
|
||||||
|
placed = true
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if !placed {
|
||||||
|
c := cluster{
|
||||||
|
file: file,
|
||||||
|
line: it.f.Line,
|
||||||
|
maxLine: it.f.Line,
|
||||||
|
severity: it.f.Severity,
|
||||||
|
title: it.f.Title,
|
||||||
|
detail: it.f.Detail,
|
||||||
|
models: map[string]bool{},
|
||||||
|
lenses: map[string]bool{},
|
||||||
|
}
|
||||||
|
mergeIntoCluster(&c, it.f, it.model)
|
||||||
|
fileClusters = append(fileClusters, c)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
clusters = append(clusters, fileClusters...)
|
||||||
|
}
|
||||||
|
|
||||||
|
sort.SliceStable(clusters, func(i, j int) bool {
|
||||||
|
if len(clusters[i].models) != len(clusters[j].models) {
|
||||||
|
return len(clusters[i].models) > len(clusters[j].models)
|
||||||
|
}
|
||||||
|
if sevRank(clusters[i].severity) != sevRank(clusters[j].severity) {
|
||||||
|
return sevRank(clusters[i].severity) > sevRank(clusters[j].severity)
|
||||||
|
}
|
||||||
|
if clusters[i].file != clusters[j].file {
|
||||||
|
return clusters[i].file < clusters[j].file
|
||||||
|
}
|
||||||
|
return clusters[i].line < clusters[j].line
|
||||||
|
})
|
||||||
|
return clusters
|
||||||
|
}
|
||||||
|
|
||||||
|
// mergeIntoCluster folds one finding into a cluster: union the model/lens sets,
|
||||||
|
// widen the [line,maxLine] span, and keep the highest-severity report's title.
|
||||||
|
func mergeIntoCluster(c *cluster, f outFinding, model string) {
|
||||||
|
if model != "" {
|
||||||
|
c.models[model] = true
|
||||||
|
}
|
||||||
|
if f.Lens != "" {
|
||||||
|
c.lenses[f.Lens] = true
|
||||||
|
}
|
||||||
|
if f.Line > 0 && (c.line == 0 || f.Line < c.line) {
|
||||||
|
c.line = f.Line
|
||||||
|
}
|
||||||
|
if f.Line > c.maxLine {
|
||||||
|
c.maxLine = f.Line
|
||||||
|
}
|
||||||
|
// Backfill an empty title/detail from any report, regardless of severity, so a
|
||||||
|
// higher-severity-but-terse finding doesn't leave the cluster without context.
|
||||||
|
if strings.TrimSpace(c.title) == "" && strings.TrimSpace(f.Title) != "" {
|
||||||
|
c.title = f.Title
|
||||||
|
}
|
||||||
|
if strings.TrimSpace(c.detail) == "" && strings.TrimSpace(f.Detail) != "" {
|
||||||
|
c.detail = f.Detail
|
||||||
|
}
|
||||||
|
// A strictly-higher-severity report takes over the title/detail.
|
||||||
|
if sevRank(f.Severity) > sevRank(c.severity) {
|
||||||
|
c.severity = f.Severity
|
||||||
|
if strings.TrimSpace(f.Title) != "" {
|
||||||
|
c.title = f.Title
|
||||||
|
}
|
||||||
|
if strings.TrimSpace(f.Detail) != "" {
|
||||||
|
c.detail = f.Detail
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// sevRank orders the canonical severity words for sorting/comparison.
|
||||||
|
func sevRank(s string) int {
|
||||||
|
switch strings.ToLower(strings.TrimSpace(s)) {
|
||||||
|
case "critical":
|
||||||
|
return 5
|
||||||
|
case "high":
|
||||||
|
return 4
|
||||||
|
case "medium":
|
||||||
|
return 3
|
||||||
|
case "small":
|
||||||
|
return 2
|
||||||
|
case "trivial":
|
||||||
|
return 1
|
||||||
|
default:
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// sevIcon is the at-a-glance severity badge for the consensus table.
|
||||||
|
func sevIcon(s string) string {
|
||||||
|
switch strings.ToLower(strings.TrimSpace(s)) {
|
||||||
|
case "critical", "high":
|
||||||
|
return "🔴"
|
||||||
|
case "medium":
|
||||||
|
return "🟠"
|
||||||
|
case "small":
|
||||||
|
return "🟡"
|
||||||
|
default:
|
||||||
|
return "⚪"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func location(file string, line int) string {
|
||||||
|
if line > 0 {
|
||||||
|
return fmt.Sprintf("%s:%d", file, line)
|
||||||
|
}
|
||||||
|
return file
|
||||||
|
}
|
||||||
|
|
||||||
|
func lensList(lenses map[string]bool) string {
|
||||||
|
out := make([]string, 0, len(lenses))
|
||||||
|
for l := range lenses {
|
||||||
|
out = append(out, l)
|
||||||
|
}
|
||||||
|
sort.Strings(out)
|
||||||
|
return strings.Join(out, ", ")
|
||||||
|
}
|
||||||
|
|
||||||
|
func oneModel(models map[string]bool) string {
|
||||||
|
for m := range models {
|
||||||
|
return m
|
||||||
|
}
|
||||||
|
return ""
|
||||||
|
}
|
||||||
|
|
||||||
|
// mdCell makes a string safe for a one-line markdown table cell: collapse
|
||||||
|
// newlines, escape pipes (which delimit columns), and neutralize backticks
|
||||||
|
// (a stray one would break an inline-code span — a backslash can't escape it
|
||||||
|
// inside code, so replace with an apostrophe). Inputs are model-influenced, so
|
||||||
|
// this keeps a malformed file path or title from breaking the table.
|
||||||
|
func mdCell(s string) string {
|
||||||
|
s = strings.ReplaceAll(s, "\n", " ")
|
||||||
|
s = strings.ReplaceAll(s, "|", "\\|")
|
||||||
|
s = strings.ReplaceAll(s, "`", "'")
|
||||||
|
return strings.TrimSpace(s)
|
||||||
|
}
|
||||||
|
|
||||||
|
func plural(n int) string {
|
||||||
|
if n == 1 {
|
||||||
|
return ""
|
||||||
|
}
|
||||||
|
return "s"
|
||||||
|
}
|
||||||
@@ -0,0 +1,217 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"encoding/json"
|
||||||
|
"os"
|
||||||
|
"path/filepath"
|
||||||
|
"strings"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestClusterFindingsAgreementAndTolerance(t *testing.T) {
|
||||||
|
models := []modelFindings{
|
||||||
|
{Model: "m1", Verdict: "Blocking issues found", Findings: []outFinding{
|
||||||
|
{Lens: "security", File: "a.go", Line: 10, Severity: "high", Title: "auth bypass"},
|
||||||
|
{Lens: "perf", File: "b.go", Line: 5, Severity: "trivial", Title: "tiny nit"},
|
||||||
|
}},
|
||||||
|
{Model: "m2", Verdict: "Minor issues", Findings: []outFinding{
|
||||||
|
{Lens: "security", File: "a.go", Line: 11, Severity: "critical", Title: "auth bypass (crit)"}, // within tolerance of a.go:10
|
||||||
|
}},
|
||||||
|
{Model: "m3", Verdict: "Minor issues", Findings: []outFinding{
|
||||||
|
{Lens: "correctness", File: "a.go", Line: 10, Severity: "medium", Title: "auth bypass"},
|
||||||
|
}},
|
||||||
|
}
|
||||||
|
clusters := clusterFindings(models)
|
||||||
|
if len(clusters) != 2 {
|
||||||
|
t.Fatalf("want 2 clusters (a.go:10±, b.go:5), got %d: %+v", len(clusters), clusters)
|
||||||
|
}
|
||||||
|
// First cluster (highest agreement) is the a.go auth one: 3 models, severity
|
||||||
|
// escalated to critical, representative line the smallest (10).
|
||||||
|
c := clusters[0]
|
||||||
|
if len(c.models) != 3 {
|
||||||
|
t.Errorf("want 3 models on the top cluster, got %d", len(c.models))
|
||||||
|
}
|
||||||
|
if c.severity != "critical" {
|
||||||
|
t.Errorf("want escalated severity critical, got %q", c.severity)
|
||||||
|
}
|
||||||
|
if c.line != 10 {
|
||||||
|
t.Errorf("want representative line 10, got %d", c.line)
|
||||||
|
}
|
||||||
|
if !c.lenses["security"] || !c.lenses["correctness"] {
|
||||||
|
t.Errorf("want union of lenses, got %v", c.lenses)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestRenderConsensusFoldsSingleModelNits(t *testing.T) {
|
||||||
|
models := []modelFindings{
|
||||||
|
{Model: "m1", Provider: "p", Verdict: "Blocking issues found", Markdown: "m1 detail", Findings: []outFinding{
|
||||||
|
{Lens: "security", File: "a.go", Line: 10, Severity: "high", Title: "auth bypass"},
|
||||||
|
{Lens: "perf", File: "b.go", Line: 5, Severity: "trivial", Title: "tiny nit"},
|
||||||
|
}},
|
||||||
|
{Model: "m2", Provider: "p", Verdict: "Minor issues", Markdown: "m2 detail", Findings: []outFinding{
|
||||||
|
{Lens: "security", File: "a.go", Line: 10, Severity: "high", Title: "auth bypass"},
|
||||||
|
}},
|
||||||
|
}
|
||||||
|
out := renderConsensus(models, clusterFindings(models))
|
||||||
|
// Headline table: the agreed finding with a 2/2 badge.
|
||||||
|
if !strings.Contains(out, "2/2") {
|
||||||
|
t.Errorf("expected a 2/2 agreement badge in headline:\n%s", out)
|
||||||
|
}
|
||||||
|
if !strings.Contains(out, "auth bypass") || !strings.Contains(out, "a.go:10") {
|
||||||
|
t.Errorf("headline missing the consensus finding:\n%s", out)
|
||||||
|
}
|
||||||
|
// The lone trivial finding is folded, not in the headline table.
|
||||||
|
if !strings.Contains(out, "single-model finding") {
|
||||||
|
t.Errorf("expected a folded single-model section:\n%s", out)
|
||||||
|
}
|
||||||
|
// Per-model detail is preserved (folded).
|
||||||
|
if !strings.Contains(out, "m1 detail") || !strings.Contains(out, "m2 detail") {
|
||||||
|
t.Errorf("per-model detail not preserved:\n%s", out)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestRenderConsensusHighSeverityLoneFindingStaysHeadline(t *testing.T) {
|
||||||
|
// A single model, single critical finding must still surface in the headline
|
||||||
|
// (not be folded as "low confidence").
|
||||||
|
models := []modelFindings{
|
||||||
|
{Model: "solo", Verdict: "Blocking issues found", Markdown: "x", Findings: []outFinding{
|
||||||
|
{Lens: "security", File: "a.go", Line: 1, Severity: "critical", Title: "rce"},
|
||||||
|
}},
|
||||||
|
}
|
||||||
|
out := renderConsensus(models, clusterFindings(models))
|
||||||
|
headline := out
|
||||||
|
if i := strings.Index(out, "single-model finding"); i >= 0 {
|
||||||
|
headline = out[:i]
|
||||||
|
}
|
||||||
|
if !strings.Contains(headline, "rce") {
|
||||||
|
t.Errorf("lone critical should be in the headline, not folded:\n%s", out)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestClusterSlidingWindowMergesChain(t *testing.T) {
|
||||||
|
// Findings at 10, 13, 16 (each 3 apart) from three models must merge into ONE
|
||||||
|
// cluster — the window slides with the span instead of anchoring at line 10.
|
||||||
|
models := []modelFindings{
|
||||||
|
{Model: "m1", Findings: []outFinding{{Lens: "x", File: "a.go", Line: 10, Severity: "medium", Title: "t"}}},
|
||||||
|
{Model: "m2", Findings: []outFinding{{Lens: "x", File: "a.go", Line: 13, Severity: "medium", Title: "t"}}},
|
||||||
|
{Model: "m3", Findings: []outFinding{{Lens: "x", File: "a.go", Line: 16, Severity: "medium", Title: "t"}}},
|
||||||
|
}
|
||||||
|
clusters := clusterFindings(models)
|
||||||
|
if len(clusters) != 1 {
|
||||||
|
t.Fatalf("chain 10/13/16 should merge into 1 cluster, got %d", len(clusters))
|
||||||
|
}
|
||||||
|
if len(clusters[0].models) != 3 {
|
||||||
|
t.Errorf("want 3 models in the merged cluster, got %d", len(clusters[0].models))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestRenderConsensusExcludesErroredFromDenominator(t *testing.T) {
|
||||||
|
models := []modelFindings{
|
||||||
|
{Model: "m1", Verdict: "Minor issues", Markdown: "a", Findings: []outFinding{
|
||||||
|
{Lens: "security", File: "a.go", Line: 9, Severity: "medium", Title: "leak"}}},
|
||||||
|
{Model: "m2", Verdict: "Minor issues", Markdown: "b", Findings: []outFinding{
|
||||||
|
{Lens: "security", File: "a.go", Line: 9, Severity: "medium", Title: "leak"}}},
|
||||||
|
{Model: "broken", Verdict: "reviewer failed", Errored: true, Markdown: "boom"},
|
||||||
|
}
|
||||||
|
out := renderConsensus(models, clusterFindings(models))
|
||||||
|
// Denominator is the 2 effective models, not 3; the failure is noted.
|
||||||
|
if !strings.Contains(out, "2/2") {
|
||||||
|
t.Errorf("errored model must be excluded from the denominator (want 2/2):\n%s", out)
|
||||||
|
}
|
||||||
|
if !strings.Contains(out, "1 failed") {
|
||||||
|
t.Errorf("expected a '1 failed' note:\n%s", out)
|
||||||
|
}
|
||||||
|
if !strings.Contains(out, "reviewer failed") {
|
||||||
|
t.Errorf("errored model should still appear (folded) as failed:\n%s", out)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestRenderConsensusLoneHighFolds(t *testing.T) {
|
||||||
|
// A single-model HIGH (not critical) folds — only consensus or a lone CRITICAL
|
||||||
|
// earns the headline, so a lone Blocking-lens finding doesn't reintroduce noise.
|
||||||
|
models := []modelFindings{
|
||||||
|
{Model: "solo", Verdict: "Blocking issues found", Markdown: "x", Findings: []outFinding{
|
||||||
|
{Lens: "security", File: "a.go", Line: 1, Severity: "high", Title: "maybe-bug"}}},
|
||||||
|
}
|
||||||
|
out := renderConsensus(models, clusterFindings(models))
|
||||||
|
head := out
|
||||||
|
if i := strings.Index(out, "single-model finding"); i >= 0 {
|
||||||
|
head = out[:i]
|
||||||
|
}
|
||||||
|
if strings.Contains(head, "maybe-bug") {
|
||||||
|
t.Errorf("a lone HIGH should fold, not headline:\n%s", out)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestWriteAndConsolidateRoundTrip(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
|
||||||
|
// Two model artifacts on disk.
|
||||||
|
write := func(name string, mf modelFindings) {
|
||||||
|
data, _ := json.Marshal(mf)
|
||||||
|
if err := os.WriteFile(filepath.Join(dir, name), data, 0o644); err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
write("m1.json", modelFindings{Model: "m1", Provider: "ollama", Verdict: "Minor issues", Markdown: "md1",
|
||||||
|
Findings: []outFinding{{Lens: "security", File: "x.go", Line: 3, Severity: "medium", Title: "leak"}}})
|
||||||
|
write("m2.json", modelFindings{Model: "m2", Provider: "ollama", Verdict: "Minor issues", Markdown: "md2",
|
||||||
|
Findings: []outFinding{{Lens: "security", File: "x.go", Line: 3, Severity: "high", Title: "leak"}}})
|
||||||
|
// A junk file must be skipped, not crash consolidation.
|
||||||
|
if err := os.WriteFile(filepath.Join(dir, "notes.txt"), []byte("ignore me"), 0o644); err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
|
||||||
|
t.Setenv("GADFLY_CONSOLIDATE_DIR", dir)
|
||||||
|
// runConsolidate prints to stdout; capture it.
|
||||||
|
out := captureStdout(t, func() {
|
||||||
|
if err := runConsolidate(); err != nil {
|
||||||
|
t.Fatalf("runConsolidate: %v", err)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
if !strings.HasPrefix(strings.TrimSpace(out), consensusMarker) {
|
||||||
|
t.Errorf("consolidated output must lead with the marker:\n%s", out)
|
||||||
|
}
|
||||||
|
if !strings.Contains(out, "2/2") || !strings.Contains(out, "x.go:3") {
|
||||||
|
t.Errorf("expected the agreed x.go:3 finding at 2/2:\n%s", out)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestRunConsolidateEmptyDirErrors(t *testing.T) {
|
||||||
|
t.Setenv("GADFLY_CONSOLIDATE_DIR", t.TempDir())
|
||||||
|
if err := runConsolidate(); err == nil {
|
||||||
|
t.Error("want an error for an empty consolidate dir (entrypoint falls back)")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// captureStdout redirects os.Stdout for the duration of fn and returns what was
|
||||||
|
// written.
|
||||||
|
func captureStdout(t *testing.T, fn func()) string {
|
||||||
|
t.Helper()
|
||||||
|
orig := os.Stdout
|
||||||
|
r, w, err := os.Pipe()
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
os.Stdout = w
|
||||||
|
defer func() { os.Stdout = orig }() // restore even if fn panics
|
||||||
|
done := make(chan string)
|
||||||
|
go func() {
|
||||||
|
var sb strings.Builder
|
||||||
|
buf := make([]byte, 4096)
|
||||||
|
for {
|
||||||
|
n, err := r.Read(buf)
|
||||||
|
if n > 0 {
|
||||||
|
sb.Write(buf[:n])
|
||||||
|
}
|
||||||
|
if err != nil {
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
r.Close()
|
||||||
|
done <- sb.String()
|
||||||
|
}()
|
||||||
|
fn()
|
||||||
|
w.Close()
|
||||||
|
return <-done
|
||||||
|
}
|
||||||
@@ -28,6 +28,22 @@ func (v verdict) label() string {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// severity maps a verdict to a canonical severity word (the same vocabulary as
|
||||||
|
// the structured findings: critical/high/medium/small/trivial). Used as the
|
||||||
|
// raw_severity for heuristic-scraped findings, which carry no per-finding
|
||||||
|
// severity of their own — so the telemetry store sees one consistent vocabulary
|
||||||
|
// instead of a mix of canonical words and full verdict phrases.
|
||||||
|
func (v verdict) severity() string {
|
||||||
|
switch v {
|
||||||
|
case verdictBlocking:
|
||||||
|
return "high"
|
||||||
|
case verdictMinor:
|
||||||
|
return "small"
|
||||||
|
default:
|
||||||
|
return "trivial"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// parseVerdict extracts a specialist's self-reported verdict from its output.
|
// parseVerdict extracts a specialist's self-reported verdict from its output.
|
||||||
// The base prompt tells each lens to lead with one of the three phrases.
|
// The base prompt tells each lens to lead with one of the three phrases.
|
||||||
func parseVerdict(out string) verdict {
|
func parseVerdict(out string) verdict {
|
||||||
@@ -100,7 +116,9 @@ func renderConsolidated(results []specialistResult) string {
|
|||||||
headline, len(results), strings.Join(specialistNames(results), ", "))
|
headline, len(results), strings.Join(specialistNames(results), ", "))
|
||||||
|
|
||||||
for _, r := range results {
|
for _, r := range results {
|
||||||
body := strings.TrimSpace(r.out)
|
// Strip the machine-readable ```gadfly-findings block — it's for tooling
|
||||||
|
// (telemetry/consolidation), not for human readers of the comment.
|
||||||
|
body := strings.TrimSpace(stripFindingsBlock(r.out))
|
||||||
if body == "" {
|
if body == "" {
|
||||||
body = "_(no output)_"
|
body = "_(no output)_"
|
||||||
}
|
}
|
||||||
|
|||||||
+20
-19
@@ -10,11 +10,14 @@ package main
|
|||||||
// the review markdown consumed by run.sh) and never changes the exit code.
|
// the review markdown consumed by run.sh) and never changes the exit code.
|
||||||
// - It depends only on the Go stdlib (net/http).
|
// - It depends only on the Go stdlib (net/http).
|
||||||
//
|
//
|
||||||
// Findings are extracted heuristically from each lens's markdown: a "path:line"
|
// Findings come PRIMARILY from each lens's machine-readable ```gadfly-findings
|
||||||
// reference (e.g. run/executor.go:166) anchors a finding, whose title is the
|
// block (findings.go: exact file/line + per-finding severity/confidence). When a
|
||||||
// nearest preceding markdown heading / numbered item / bold lead-in (else the
|
// model emits no parseable block, emit falls back to a heuristic prose scrape: a
|
||||||
// first sentence of the finding's paragraph). This is best-effort signal for the
|
// "path:line" reference (e.g. run/executor.go:166) anchors a finding, whose title
|
||||||
// store to aggregate — not a structured contract the reviewer guarantees.
|
// is the nearest preceding markdown heading / numbered item / bold lead-in (else
|
||||||
|
// the first sentence of the finding's paragraph). The scrape is best-effort
|
||||||
|
// signal for the store to aggregate, not a structured contract the reviewer
|
||||||
|
// guarantees.
|
||||||
|
|
||||||
import (
|
import (
|
||||||
"bytes"
|
"bytes"
|
||||||
@@ -50,11 +53,16 @@ var pathLineRe = regexp.MustCompile(`([A-Za-z0-9_./-]+\.[A-Za-z0-9]+):(\d+)`)
|
|||||||
var numberedRe = regexp.MustCompile(`^\d+[.)]\s+(.+)$`)
|
var numberedRe = regexp.MustCompile(`^\d+[.)]\s+(.+)$`)
|
||||||
|
|
||||||
// finding is one extracted issue: where it points and a derived human title.
|
// finding is one extracted issue: where it points and a derived human title.
|
||||||
|
// severity/confidence are populated from the structured ```gadfly-findings block
|
||||||
|
// (extractStructuredFindings); they are empty for findings recovered by the
|
||||||
|
// heuristic prose scrape (parseFindings), which has no per-finding signal.
|
||||||
type finding struct {
|
type finding struct {
|
||||||
file string
|
file string
|
||||||
line int
|
line int
|
||||||
title string
|
title string
|
||||||
detail string
|
detail string
|
||||||
|
severity string // per-finding: critical/high/medium/small/trivial ("" if heuristic)
|
||||||
|
confidence string // per-finding: high/medium/low ("" if heuristic)
|
||||||
}
|
}
|
||||||
|
|
||||||
// runPayload is the POST /runs body. Field names match the gadfly-reports API EXACTLY.
|
// runPayload is the POST /runs body. Field names match the gadfly-reports API EXACTLY.
|
||||||
@@ -86,6 +94,7 @@ type reportPayload struct {
|
|||||||
Provider string `json:"provider"`
|
Provider string `json:"provider"`
|
||||||
RunID string `json:"run_id"`
|
RunID string `json:"run_id"`
|
||||||
RawSeverity string `json:"raw_severity"`
|
RawSeverity string `json:"raw_severity"`
|
||||||
|
Confidence string `json:"confidence"` // per-finding high/medium/low ("" if heuristic)
|
||||||
Detail string `json:"detail"`
|
Detail string `json:"detail"`
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -116,21 +125,12 @@ func emit(results []specialistResult, elapsed time.Duration) {
|
|||||||
// InputTokens/OutputTokens/CostUSD stay nil -> JSON null (not metered).
|
// InputTokens/OutputTokens/CostUSD stay nil -> JSON null (not metered).
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// collectFindings (consensus.go) skips clean/errored lenses and fills in each
|
||||||
|
// finding's severity (per-finding when structured, else derived from the lens
|
||||||
|
// verdict), so every reportPayload carries a canonical raw_severity.
|
||||||
var reports []reportPayload
|
var reports []reportPayload
|
||||||
for _, r := range results {
|
for _, r := range results {
|
||||||
if r.errored {
|
for _, f := range collectFindings(r) {
|
||||||
continue // a failed lens contributes no findings
|
|
||||||
}
|
|
||||||
// A lens that reports "No material issues found" has nothing to flag —
|
|
||||||
// its path:line references are verification notes ("verified X at
|
|
||||||
// file:line is safe"), not problems. Extracting them pollutes the
|
|
||||||
// findings store with false positives and unfairly penalizes thorough
|
|
||||||
// reviewers that do clean passes, so a clean lens emits no findings.
|
|
||||||
if r.verdict == verdictClean {
|
|
||||||
continue
|
|
||||||
}
|
|
||||||
sev := r.verdict.label()
|
|
||||||
for _, f := range parseFindings(r.spec, r.out) {
|
|
||||||
reports = append(reports, reportPayload{
|
reports = append(reports, reportPayload{
|
||||||
Repo: repo,
|
Repo: repo,
|
||||||
PR: pr,
|
PR: pr,
|
||||||
@@ -141,7 +141,8 @@ func emit(results []specialistResult, elapsed time.Duration) {
|
|||||||
Model: model,
|
Model: model,
|
||||||
Provider: provider,
|
Provider: provider,
|
||||||
RunID: runID,
|
RunID: runID,
|
||||||
RawSeverity: sev,
|
RawSeverity: f.severity,
|
||||||
|
Confidence: f.confidence,
|
||||||
Detail: f.detail,
|
Detail: f.detail,
|
||||||
})
|
})
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -157,7 +157,7 @@ func TestEmit_PostsRunsAndReports(t *testing.T) {
|
|||||||
if len(reportBody) != 2 {
|
if len(reportBody) != 2 {
|
||||||
t.Fatalf("/reports array length = %d, want 2", len(reportBody))
|
t.Fatalf("/reports array length = %d, want 2", len(reportBody))
|
||||||
}
|
}
|
||||||
for _, k := range []string{"repo", "pr", "lens", "file", "line", "title", "model", "provider", "run_id", "raw_severity", "detail"} {
|
for _, k := range []string{"repo", "pr", "lens", "file", "line", "title", "model", "provider", "run_id", "raw_severity", "confidence", "detail"} {
|
||||||
if _, ok := reportBody[0][k]; !ok {
|
if _, ok := reportBody[0][k]; !ok {
|
||||||
t.Errorf("/reports[0] missing field %q (got keys %v)", k, keysOf(reportBody[0]))
|
t.Errorf("/reports[0] missing field %q (got keys %v)", k, keysOf(reportBody[0]))
|
||||||
}
|
}
|
||||||
@@ -171,8 +171,11 @@ func TestEmit_PostsRunsAndReports(t *testing.T) {
|
|||||||
if reportBody[0]["line"] != float64(166) {
|
if reportBody[0]["line"] != float64(166) {
|
||||||
t.Errorf("reports[0].line = %v, want 166", reportBody[0]["line"])
|
t.Errorf("reports[0].line = %v, want 166", reportBody[0]["line"])
|
||||||
}
|
}
|
||||||
if reportBody[0]["raw_severity"] != "Blocking issues found" {
|
// No structured block in sampleLensMarkdown, so this is a heuristic-scraped
|
||||||
t.Errorf("reports[0].raw_severity = %v, want 'Blocking issues found'", reportBody[0]["raw_severity"])
|
// finding: raw_severity is the canonical word derived from the lens verdict
|
||||||
|
// (Blocking -> "high"), not the full verdict phrase.
|
||||||
|
if reportBody[0]["raw_severity"] != "high" {
|
||||||
|
t.Errorf("reports[0].raw_severity = %v, want 'high'", reportBody[0]["raw_severity"])
|
||||||
}
|
}
|
||||||
if reportBody[0]["run_id"] != "owner/repo#7:ollama-cloud/qwen3" {
|
if reportBody[0]["run_id"] != "owner/repo#7:ollama-cloud/qwen3" {
|
||||||
t.Errorf("reports[0].run_id = %v, want owner/repo#7:ollama-cloud/qwen3", reportBody[0]["run_id"])
|
t.Errorf("reports[0].run_id = %v, want owner/repo#7:ollama-cloud/qwen3", reportBody[0]["run_id"])
|
||||||
|
|||||||
@@ -33,14 +33,18 @@ type reviewEngine interface {
|
|||||||
runPass(ctx context.Context, system, task string, maxSteps int) (string, error)
|
runPass(ctx context.Context, system, task string, maxSteps int) (string, error)
|
||||||
}
|
}
|
||||||
|
|
||||||
// majordomoEngine drives the in-process majordomo agent over the repo sandbox.
|
// majordomoEngine drives the in-process review path over the repo sandbox. It no
|
||||||
|
// longer calls majordomo's agent loop directly: each pass runs through an
|
||||||
|
// executus run.Executor (see executus.go), which adds context compaction, run
|
||||||
|
// bounding, the per-PR budget gate, and the wrap-up critic. mdl is retained only
|
||||||
|
// so auto-select can fall back to the review model as its selector.
|
||||||
type majordomoEngine struct {
|
type majordomoEngine struct {
|
||||||
|
rex *reviewExecutor
|
||||||
mdl llm.Model
|
mdl llm.Model
|
||||||
fsTools *repoFS
|
|
||||||
}
|
}
|
||||||
|
|
||||||
func (e *majordomoEngine) runPass(ctx context.Context, system, task string, maxSteps int) (string, error) {
|
func (e *majordomoEngine) runPass(ctx context.Context, system, task string, maxSteps int) (string, error) {
|
||||||
return runAgent(ctx, e.mdl, e.fsTools, system, task, maxSteps)
|
return e.rex.run(ctx, system, task, maxSteps)
|
||||||
}
|
}
|
||||||
|
|
||||||
// claudeCodeEngine reviews by shelling out to the `claude` CLI (Claude Code) in
|
// claudeCodeEngine reviews by shelling out to the `claude` CLI (Claude Code) in
|
||||||
|
|||||||
@@ -0,0 +1,371 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
// executus.go wires gadfly's agentic review path onto the executus run kernel
|
||||||
|
// (gitea.stevedudenhoeffer.com/steve/executus), layered above majordomo. The
|
||||||
|
// majordomoEngine no longer drives majordomo's agent loop directly; it builds a
|
||||||
|
// run.Executor that gives gadfly, for free:
|
||||||
|
//
|
||||||
|
// - context compaction (executus/compact): once the transcript a step would
|
||||||
|
// SEND crosses a token threshold derived from the model's real context
|
||||||
|
// window, the runaway middle is folded into a one-paragraph summary by a
|
||||||
|
// cheap summarizer model — so a big diff + accumulating read_file/grep
|
||||||
|
// results can't balloon every re-sent step (the large-PR burn).
|
||||||
|
// - run bounding + a per-PR spend budget (executus/run Ports.Budget): a hard
|
||||||
|
// token/seconds ceiling so a pathological PR can't drain the usage block.
|
||||||
|
// - the wrap-up nudge, re-expressed as an executus Critic (Ports.Critic): the
|
||||||
|
// steer that tells a step-hungry model to stop investigating and write its
|
||||||
|
// answer is now the critic seam, not a bespoke RunOption.
|
||||||
|
//
|
||||||
|
// Everything degrades to today's behavior when unconfigured: nil summarizer or a
|
||||||
|
// 0 context window disables compaction; nil budget disables the ceiling; the
|
||||||
|
// claude-code engine shells out and is unaffected by any of this.
|
||||||
|
//
|
||||||
|
// gadfly keeps its own model.go resolution (so GADFLY_ENDPOINT_<NAME> http
|
||||||
|
// aliases, failover chains, and the claude-code engine all survive) — the
|
||||||
|
// run.Executor is handed gadfly's already-resolved model via a trivial resolver,
|
||||||
|
// not routed through executus's tier table.
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"errors"
|
||||||
|
"fmt"
|
||||||
|
"os"
|
||||||
|
"strconv"
|
||||||
|
"strings"
|
||||||
|
"sync"
|
||||||
|
"sync/atomic"
|
||||||
|
"time"
|
||||||
|
|
||||||
|
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
|
||||||
|
|
||||||
|
"gitea.stevedudenhoeffer.com/steve/executus/compact"
|
||||||
|
"gitea.stevedudenhoeffer.com/steve/executus/model"
|
||||||
|
exrun "gitea.stevedudenhoeffer.com/steve/executus/run"
|
||||||
|
exectool "gitea.stevedudenhoeffer.com/steve/executus/tool"
|
||||||
|
)
|
||||||
|
|
||||||
|
const (
|
||||||
|
// defaultCompactRatio is the fraction of the model's context window at which
|
||||||
|
// compaction fires. It is deliberately LOWER than executus's own 0.7 default:
|
||||||
|
// on the large-PR burn the per-step transcript is the embedded diff (~17K) plus
|
||||||
|
// accumulating read_file results, which rarely reaches 0.7×262K≈183K — so a
|
||||||
|
// 0.7 threshold never bites. ~0.45×262K≈118K folds the runaway middle while a
|
||||||
|
// transcript is still well under the cap. Override with GADFLY_COMPACT_RATIO.
|
||||||
|
defaultCompactRatio = 0.45
|
||||||
|
// defaultCompactKeepRecent / defaultCompactSummaryWords mirror executus's own
|
||||||
|
// compactor defaults; surfaced as gadfly env knobs for tuning.
|
||||||
|
defaultCompactKeepRecent = 8
|
||||||
|
defaultCompactSummaryWords = 200
|
||||||
|
// contextTokenLookupTimeout bounds the one-shot /api/show call that resolves a
|
||||||
|
// cloud model's context window at executor-build time. Kept short so a slow or
|
||||||
|
// unreachable endpoint adds at most this to startup before degrading to
|
||||||
|
// no-compaction (rather than the provider cache's default 15s).
|
||||||
|
contextTokenLookupTimeout = 5 * time.Second
|
||||||
|
)
|
||||||
|
|
||||||
|
// runSeq mints a unique-per-process RunID suffix for each executor run so audit
|
||||||
|
// and the run kernel can tell one pass from another within a binary process.
|
||||||
|
var runSeq atomic.Uint64
|
||||||
|
|
||||||
|
// wrappedTool adapts an already-built majordomo llm.Tool (gadfly's sandboxed
|
||||||
|
// read_file/grep/get_diff/… closures over the repoFS) to executus's tool.Tool
|
||||||
|
// interface so the run kernel can build a toolbox from them by name. gadfly's
|
||||||
|
// tools need no caller/channel identity, so BuildLLM ignores the Invocation and
|
||||||
|
// returns the pre-built tool; Permission is the zero value (private, ungated).
|
||||||
|
type wrappedTool struct{ t llm.Tool }
|
||||||
|
|
||||||
|
func (w wrappedTool) Name() string { return w.t.Name }
|
||||||
|
func (w wrappedTool) Description() string { return w.t.Description }
|
||||||
|
func (w wrappedTool) Permission() exectool.Permission { return exectool.Permission{} }
|
||||||
|
func (w wrappedTool) BuildLLM(_ exectool.Invocation) llm.Tool { return w.t }
|
||||||
|
|
||||||
|
// gadflyToolRegistry registers the repo's read-only tools (plus the optional
|
||||||
|
// delegate_investigation worker tool) in a fresh executus tool.Registry and
|
||||||
|
// returns it along with the tool names for RunnableAgent.LowLevelTools.
|
||||||
|
func gadflyToolRegistry(fs *repoFS) (exectool.Registry, []string, error) {
|
||||||
|
reg := exectool.NewRegistry()
|
||||||
|
tools := fs.allTools()
|
||||||
|
names := make([]string, 0, len(tools))
|
||||||
|
for _, t := range tools {
|
||||||
|
if err := reg.Register(wrappedTool{t: t}); err != nil {
|
||||||
|
return nil, nil, fmt.Errorf("register tool %q: %w", t.Name, err)
|
||||||
|
}
|
||||||
|
names = append(names, t.Name)
|
||||||
|
}
|
||||||
|
return reg, names, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// gadflyBudget is gadfly's per-PR spend ceiling, satisfying run.Ports.Budget.
|
||||||
|
// It gates a run BEFORE it makes any model call (Check) once the process has
|
||||||
|
// spent its token or wall-clock allowance on this PR. Tokens are fed in
|
||||||
|
// out-of-band via addUsage (the Budget interface's Commit only carries seconds);
|
||||||
|
// the engine calls addUsage after each pass with run.Result.Usage. A nil
|
||||||
|
// *gadflyBudget is never installed — caps of 0 mean "unlimited", so the port is
|
||||||
|
// only wired when at least one cap is set.
|
||||||
|
//
|
||||||
|
// The guard is PASS-granular: Check runs before each pass, so it stops the NEXT
|
||||||
|
// pass once the budget is spent but cannot abort a single runaway pass mid-flight.
|
||||||
|
// The swarm-wide GADFLY_PR_BUDGET_SECS wall-clock backstop (entrypoint.sh) is what
|
||||||
|
// bounds a mid-pass runaway.
|
||||||
|
type gadflyBudget struct {
|
||||||
|
mu sync.Mutex
|
||||||
|
maxTokens int64
|
||||||
|
maxSeconds float64
|
||||||
|
tokens int64
|
||||||
|
seconds float64
|
||||||
|
}
|
||||||
|
|
||||||
|
// newPRBudget builds the per-PR budget from env, or nil when neither cap is set
|
||||||
|
// (the default — the swarm-wide ceiling lives in entrypoint.sh; this is the
|
||||||
|
// per-process belt to its suspenders).
|
||||||
|
func newPRBudget() *gadflyBudget {
|
||||||
|
toks := envInt("GADFLY_PR_TOKEN_BUDGET", 0)
|
||||||
|
secs := envInt("GADFLY_PR_TIME_BUDGET_SECS", 0)
|
||||||
|
if toks <= 0 && secs <= 0 {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
return &gadflyBudget{maxTokens: int64(toks), maxSeconds: float64(secs)}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (b *gadflyBudget) Check(_ context.Context, _ string) error {
|
||||||
|
if b == nil {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
b.mu.Lock()
|
||||||
|
defer b.mu.Unlock()
|
||||||
|
if b.maxTokens > 0 && b.tokens >= b.maxTokens {
|
||||||
|
return fmt.Errorf("gadfly: per-PR token budget exhausted (%d/%d)", b.tokens, b.maxTokens)
|
||||||
|
}
|
||||||
|
if b.maxSeconds > 0 && b.seconds >= b.maxSeconds {
|
||||||
|
return fmt.Errorf("gadfly: per-PR time budget exhausted (%.0f/%.0fs)", b.seconds, b.maxSeconds)
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (b *gadflyBudget) Commit(_ context.Context, _ string, runtimeSeconds float64) {
|
||||||
|
if b == nil {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
b.mu.Lock()
|
||||||
|
defer b.mu.Unlock()
|
||||||
|
b.seconds += runtimeSeconds
|
||||||
|
}
|
||||||
|
|
||||||
|
// addUsage records a finished pass's token spend toward the budget. Safe on nil.
|
||||||
|
func (b *gadflyBudget) addUsage(u llm.Usage) {
|
||||||
|
if b == nil {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
b.mu.Lock()
|
||||||
|
defer b.mu.Unlock()
|
||||||
|
b.tokens += int64(u.InputTokens) + int64(u.OutputTokens)
|
||||||
|
}
|
||||||
|
|
||||||
|
// wrapUpCritic re-expresses gadfly's wrap-up nudge as an executus run.Critic:
|
||||||
|
// once a run comes within wrapUpReserve steps of its cap, Steer() injects the
|
||||||
|
// "stop calling tools and write your final answer" message so a thorough model
|
||||||
|
// spends its last steps finalizing instead of hard-failing empty. It sets no
|
||||||
|
// hard deadline (Deadline()==zero) and never raises the step ceiling
|
||||||
|
// (MaxSteps()==0, defer to the run's MaxIterations) — it is purely the nudge.
|
||||||
|
type wrapUpCritic struct{ reserve int }
|
||||||
|
|
||||||
|
func (c *wrapUpCritic) Monitor(_ context.Context, info exrun.RunInfo, _ time.Duration) exrun.CriticHandle {
|
||||||
|
return &wrapUpHandle{maxSteps: info.MaxIterations, reserve: c.reserve}
|
||||||
|
}
|
||||||
|
|
||||||
|
type wrapUpHandle struct {
|
||||||
|
mu sync.Mutex
|
||||||
|
maxSteps int
|
||||||
|
reserve int
|
||||||
|
done int // steps completed so far
|
||||||
|
nudged bool
|
||||||
|
}
|
||||||
|
|
||||||
|
func (h *wrapUpHandle) RecordStep(iter int, _ *llm.Response) {
|
||||||
|
h.mu.Lock()
|
||||||
|
h.done = iter + 1
|
||||||
|
h.mu.Unlock()
|
||||||
|
}
|
||||||
|
func (h *wrapUpHandle) RecordToolStart(string, string) {}
|
||||||
|
func (h *wrapUpHandle) Steer() []llm.Message {
|
||||||
|
h.mu.Lock()
|
||||||
|
defer h.mu.Unlock()
|
||||||
|
at := h.maxSteps - h.reserve
|
||||||
|
if at < 1 {
|
||||||
|
at = 1
|
||||||
|
}
|
||||||
|
if !h.nudged && h.maxSteps > 0 && h.done >= at {
|
||||||
|
h.nudged = true
|
||||||
|
return []llm.Message{llm.UserText(wrapUpInstruction)}
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
func (h *wrapUpHandle) Deadline() time.Time { return time.Time{} }
|
||||||
|
func (h *wrapUpHandle) MaxSteps() int { return 0 }
|
||||||
|
func (h *wrapUpHandle) KillCause() error { return nil }
|
||||||
|
func (h *wrapUpHandle) Stop() {}
|
||||||
|
|
||||||
|
// reviewExecutor bundles a run.Executor with the per-run wiring the engine needs
|
||||||
|
// for each pass (the tool names to expose, the model spec to report as the tier,
|
||||||
|
// the per-PR caller id, and the budget to feed token usage into).
|
||||||
|
type reviewExecutor struct {
|
||||||
|
ex *exrun.Executor
|
||||||
|
toolNames []string
|
||||||
|
modelSpec string
|
||||||
|
callerID string
|
||||||
|
budget *gadflyBudget
|
||||||
|
}
|
||||||
|
|
||||||
|
// newReviewExecutor builds the run.Executor for the in-process majordomo review
|
||||||
|
// path. mdl is gadfly's already-resolved review model; summarizer is the cheap
|
||||||
|
// model the compactor uses (nil disables compaction). Compaction also needs the
|
||||||
|
// model's context window (resolved once here, not per pass); a 0 window likewise
|
||||||
|
// disables it. The budget (may be nil) becomes the run.Ports.Budget gate.
|
||||||
|
func newReviewExecutor(fs *repoFS, mdl, summarizer llm.Model, modelSpec string, budget *gadflyBudget) (*reviewExecutor, error) {
|
||||||
|
reg, names, err := gadflyToolRegistry(fs)
|
||||||
|
if err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
|
||||||
|
// gadfly resolves exactly one model per process; the run kernel's resolver
|
||||||
|
// just hands that model back regardless of the tier string it is asked for.
|
||||||
|
modelsResolver := func(ctx context.Context, _ string) (context.Context, llm.Model, error) {
|
||||||
|
return ctx, mdl, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
var compactor compact.CompactorFactory
|
||||||
|
var ctxTokens func(string) int
|
||||||
|
if summarizer != nil && compactionEnabled() {
|
||||||
|
if window := resolveContextTokens(modelSpec); window > 0 {
|
||||||
|
sumResolver := func(ctx context.Context, _ string) (context.Context, llm.Model, error) {
|
||||||
|
return ctx, summarizer, nil
|
||||||
|
}
|
||||||
|
compactor = compact.NewCompactor(compact.CompactorConfig{
|
||||||
|
Models: sumResolver,
|
||||||
|
KeepRecent: envInt("GADFLY_COMPACT_KEEP_RECENT", defaultCompactKeepRecent),
|
||||||
|
SummaryWordCap: envInt("GADFLY_COMPACT_SUMMARY_WORDS", defaultCompactSummaryWords),
|
||||||
|
})
|
||||||
|
ctxTokens = func(string) int { return window } // memoized: one window per process
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
var ports exrun.Ports
|
||||||
|
ports.Critic = &wrapUpCritic{reserve: wrapUpReserve()}
|
||||||
|
if budget != nil {
|
||||||
|
ports.Budget = budget
|
||||||
|
}
|
||||||
|
|
||||||
|
cfg := exrun.Config{
|
||||||
|
Registry: reg,
|
||||||
|
Models: modelsResolver,
|
||||||
|
Compactor: compactor,
|
||||||
|
ContextTokens: ctxTokens,
|
||||||
|
Defaults: exrun.Defaults{
|
||||||
|
// MaxIterations/MaxRuntime are intentionally omitted: every pass sets its
|
||||||
|
// own per-run cap on the RunnableAgent below (the review and recheck caps
|
||||||
|
// differ), so a Defaults value here would always be overridden — dead. This
|
||||||
|
// leaves only the cross-pass guards + the compaction ratio.
|
||||||
|
MaxConsecutiveToolErrors: 4,
|
||||||
|
MaxSameToolCallRepeats: 4,
|
||||||
|
CompactionThresholdRatio: compactRatio(),
|
||||||
|
FallbackTier: modelSpec,
|
||||||
|
},
|
||||||
|
Ports: ports,
|
||||||
|
}
|
||||||
|
return &reviewExecutor{
|
||||||
|
ex: exrun.New(cfg),
|
||||||
|
toolNames: names,
|
||||||
|
modelSpec: modelSpec,
|
||||||
|
callerID: prCallerID(),
|
||||||
|
budget: budget,
|
||||||
|
}, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// run executes one agent pass (review or recheck) through the run kernel and
|
||||||
|
// returns the model's final text. An empty answer with no error is reported as
|
||||||
|
// an error so the caller (reviewWithSpecialist) renders the advisory "reviewer
|
||||||
|
// failed to complete" notice rather than a blank section.
|
||||||
|
func (r *reviewExecutor) run(ctx context.Context, system, task string, maxSteps int) (string, error) {
|
||||||
|
res := r.ex.Run(ctx, exrun.RunnableAgent{
|
||||||
|
Name: "gadfly-review",
|
||||||
|
SystemPrompt: system,
|
||||||
|
ModelTier: r.modelSpec,
|
||||||
|
MaxIterations: maxSteps,
|
||||||
|
MaxRuntime: reviewTimeout(),
|
||||||
|
LowLevelTools: r.toolNames,
|
||||||
|
Critic: exrun.CriticConfig{Enabled: true},
|
||||||
|
}, exectool.Invocation{
|
||||||
|
RunID: fmt.Sprintf("gadfly-%d", runSeq.Add(1)),
|
||||||
|
CallerID: r.callerID,
|
||||||
|
}, task)
|
||||||
|
|
||||||
|
// Feed token spend toward the per-PR budget out-of-band (Commit carries only
|
||||||
|
// seconds; the executor already called it). Safe on a nil budget.
|
||||||
|
r.budget.addUsage(res.Usage)
|
||||||
|
|
||||||
|
if res.Err != nil {
|
||||||
|
return "", res.Err
|
||||||
|
}
|
||||||
|
if out := strings.TrimSpace(res.Output); out != "" {
|
||||||
|
return out, nil
|
||||||
|
}
|
||||||
|
return "", errors.New("agent produced no output")
|
||||||
|
}
|
||||||
|
|
||||||
|
// prCallerID is the budget/audit caller key: the repo + PR, so a budget keyed on
|
||||||
|
// it is naturally per-PR. Falls back to "local" for an out-of-CI run.
|
||||||
|
func prCallerID() string {
|
||||||
|
repo := strings.TrimSpace(os.Getenv("GADFLY_REPO"))
|
||||||
|
pr := strings.TrimSpace(os.Getenv("GADFLY_PR"))
|
||||||
|
if repo == "" && pr == "" {
|
||||||
|
return "local"
|
||||||
|
}
|
||||||
|
return repo + "#" + pr
|
||||||
|
}
|
||||||
|
|
||||||
|
// compactionEnabled reports whether context compaction should be wired. On
|
||||||
|
// unless GADFLY_COMPACT is explicitly falsey.
|
||||||
|
func compactionEnabled() bool { return envBool("GADFLY_COMPACT", true) }
|
||||||
|
|
||||||
|
// compactRatio is the compaction threshold as a fraction of the model context
|
||||||
|
// window (GADFLY_COMPACT_RATIO), clamped to (0,1]; default defaultCompactRatio.
|
||||||
|
func compactRatio() float64 {
|
||||||
|
v := strings.TrimSpace(os.Getenv("GADFLY_COMPACT_RATIO"))
|
||||||
|
if v == "" {
|
||||||
|
return defaultCompactRatio
|
||||||
|
}
|
||||||
|
f, err := strconv.ParseFloat(v, 64)
|
||||||
|
if err != nil || f <= 0 || f > 1 {
|
||||||
|
return defaultCompactRatio
|
||||||
|
}
|
||||||
|
return f
|
||||||
|
}
|
||||||
|
|
||||||
|
// resolveContextTokens returns the review model's context window in tokens, used
|
||||||
|
// to set the compaction threshold. GADFLY_MODEL_CONTEXT_TOKENS overrides it
|
||||||
|
// (needed for custom/self-hosted endpoints executus can't introspect); otherwise
|
||||||
|
// it asks executus/model, which knows the static catalog and can fetch an
|
||||||
|
// Ollama Cloud model's limit via /api/show (one call, at executor-build time).
|
||||||
|
// Returns 0 — disabling compaction — for an unknown model, mirroring executus's
|
||||||
|
// "unknown ⇒ don't budget" contract.
|
||||||
|
func resolveContextTokens(modelSpec string) int {
|
||||||
|
if v := envInt("GADFLY_MODEL_CONTEXT_TOKENS", 0); v > 0 {
|
||||||
|
return v
|
||||||
|
}
|
||||||
|
key := strings.TrimSpace(os.Getenv("GADFLY_API_KEY"))
|
||||||
|
if key == "" {
|
||||||
|
key = strings.TrimSpace(os.Getenv("OLLAMA_API_KEY"))
|
||||||
|
}
|
||||||
|
cache := model.NewCloudOllamaLimitCache("", key, nil)
|
||||||
|
ctx, cancel := context.WithTimeout(context.Background(), contextTokenLookupTimeout)
|
||||||
|
defer cancel()
|
||||||
|
if n, ok := model.MaxContextTokensResolving(ctx, modelSpec, cache); ok {
|
||||||
|
return n
|
||||||
|
}
|
||||||
|
// Unknown model or a failed lookup (e.g. no key / unreachable endpoint): don't
|
||||||
|
// guess — compaction is disabled. Log it so a misconfiguration is debuggable
|
||||||
|
// rather than silently dropping the protection. Set GADFLY_MODEL_CONTEXT_TOKENS
|
||||||
|
// to force a window for an endpoint executus can't introspect.
|
||||||
|
fmt.Fprintf(os.Stderr, "gadfly: no context window resolved for %q; compaction disabled (set GADFLY_MODEL_CONTEXT_TOKENS to enable it)\n", modelSpec)
|
||||||
|
return 0
|
||||||
|
}
|
||||||
@@ -0,0 +1,140 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"testing"
|
||||||
|
|
||||||
|
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestGadflyBudget(t *testing.T) {
|
||||||
|
ctx := context.Background()
|
||||||
|
|
||||||
|
// A nil budget never blocks and never panics.
|
||||||
|
var nilB *gadflyBudget
|
||||||
|
if err := nilB.Check(ctx, "pr"); err != nil {
|
||||||
|
t.Errorf("nil budget Check should be nil, got %v", err)
|
||||||
|
}
|
||||||
|
nilB.Commit(ctx, "pr", 10)
|
||||||
|
nilB.addUsage(llm.Usage{InputTokens: 5})
|
||||||
|
|
||||||
|
// Token ceiling: passes until usage crosses it.
|
||||||
|
b := &gadflyBudget{maxTokens: 100}
|
||||||
|
if err := b.Check(ctx, "pr"); err != nil {
|
||||||
|
t.Fatalf("fresh budget should pass, got %v", err)
|
||||||
|
}
|
||||||
|
b.addUsage(llm.Usage{InputTokens: 60, OutputTokens: 50}) // 110 >= 100
|
||||||
|
if err := b.Check(ctx, "pr"); err == nil {
|
||||||
|
t.Error("budget over the token cap should reject the next run")
|
||||||
|
}
|
||||||
|
|
||||||
|
// Seconds ceiling, accumulated via Commit.
|
||||||
|
s := &gadflyBudget{maxSeconds: 30}
|
||||||
|
s.Commit(ctx, "pr", 31)
|
||||||
|
if err := s.Check(ctx, "pr"); err == nil {
|
||||||
|
t.Error("budget over the time cap should reject the next run")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNewPRBudget(t *testing.T) {
|
||||||
|
t.Setenv("GADFLY_PR_TOKEN_BUDGET", "")
|
||||||
|
t.Setenv("GADFLY_PR_TIME_BUDGET_SECS", "")
|
||||||
|
if newPRBudget() != nil {
|
||||||
|
t.Error("no caps set should yield a nil (disabled) budget")
|
||||||
|
}
|
||||||
|
t.Setenv("GADFLY_PR_TOKEN_BUDGET", "1000")
|
||||||
|
if b := newPRBudget(); b == nil || b.maxTokens != 1000 {
|
||||||
|
t.Errorf("token cap should build a budget with maxTokens=1000, got %+v", b)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestCompactRatio(t *testing.T) {
|
||||||
|
t.Setenv("GADFLY_COMPACT_RATIO", "")
|
||||||
|
if got := compactRatio(); got != defaultCompactRatio {
|
||||||
|
t.Errorf("default ratio = %v, want %v", got, defaultCompactRatio)
|
||||||
|
}
|
||||||
|
t.Setenv("GADFLY_COMPACT_RATIO", "0.6")
|
||||||
|
if got := compactRatio(); got != 0.6 {
|
||||||
|
t.Errorf("ratio override = %v, want 0.6", got)
|
||||||
|
}
|
||||||
|
for _, bad := range []string{"0", "-1", "2", "nope"} {
|
||||||
|
t.Setenv("GADFLY_COMPACT_RATIO", bad)
|
||||||
|
if got := compactRatio(); got != defaultCompactRatio {
|
||||||
|
t.Errorf("invalid ratio %q should fall back to the default, got %v", bad, got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestResolveContextTokensOverride(t *testing.T) {
|
||||||
|
// The explicit override short-circuits any model introspection (no network).
|
||||||
|
t.Setenv("GADFLY_MODEL_CONTEXT_TOKENS", "123456")
|
||||||
|
if got := resolveContextTokens("anything"); got != 123456 {
|
||||||
|
t.Errorf("explicit context-token override = %d, want 123456", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestPRCallerID(t *testing.T) {
|
||||||
|
t.Setenv("GADFLY_REPO", "")
|
||||||
|
t.Setenv("GADFLY_PR", "")
|
||||||
|
if got := prCallerID(); got != "local" {
|
||||||
|
t.Errorf("no repo/PR should be %q, got %q", "local", got)
|
||||||
|
}
|
||||||
|
t.Setenv("GADFLY_REPO", "steve/mort")
|
||||||
|
t.Setenv("GADFLY_PR", "1367")
|
||||||
|
if got := prCallerID(); got != "steve/mort#1367" {
|
||||||
|
t.Errorf("callerID = %q, want steve/mort#1367", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestCompactionEnabled(t *testing.T) {
|
||||||
|
for _, v := range []string{"", "1", "true", "yes"} {
|
||||||
|
t.Setenv("GADFLY_COMPACT", v)
|
||||||
|
if !compactionEnabled() {
|
||||||
|
t.Errorf("GADFLY_COMPACT=%q should be enabled", v)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
for _, v := range []string{"0", "false", "no", "off"} {
|
||||||
|
t.Setenv("GADFLY_COMPACT", v)
|
||||||
|
if compactionEnabled() {
|
||||||
|
t.Errorf("GADFLY_COMPACT=%q should be disabled", v)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestGadflyToolRegistry(t *testing.T) {
|
||||||
|
fs, err := newRepoFS(t.TempDir(), "diff")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
_, names, err := gadflyToolRegistry(fs)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("gadflyToolRegistry: %v", err)
|
||||||
|
}
|
||||||
|
want := map[string]bool{"read_file": true, "list_dir": true, "grep": true, "find_files": true, "get_diff": true}
|
||||||
|
has := func(n string) bool {
|
||||||
|
for _, g := range names {
|
||||||
|
if g == n {
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
for n := range want {
|
||||||
|
if !has(n) {
|
||||||
|
t.Errorf("registry missing tool %q (got %v)", n, names)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if has("delegate_investigation") {
|
||||||
|
t.Error("delegate_investigation must be absent without a worker model")
|
||||||
|
}
|
||||||
|
|
||||||
|
// With a worker model the delegate tool is registered too.
|
||||||
|
fs.worker = fakeModel(t, "x")
|
||||||
|
_, names, err = gadflyToolRegistry(fs)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("gadflyToolRegistry with worker: %v", err)
|
||||||
|
}
|
||||||
|
if !has("delegate_investigation") {
|
||||||
|
t.Errorf("delegate_investigation should be registered with a worker model, got %v", names)
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,236 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
// Structured findings: the machine-readable contract between a lens's output and
|
||||||
|
// the telemetry/consolidation pipeline. Each lens is asked (see
|
||||||
|
// scripts/system-prompt.txt) to append a fenced ```gadfly-findings code block
|
||||||
|
// holding a JSON array of its findings. Parsing that exact block is far more
|
||||||
|
// reliable than scraping prose with a path:line regex (emit.go's heuristic),
|
||||||
|
// and it carries PER-FINDING severity + confidence the prose verdict can't.
|
||||||
|
//
|
||||||
|
// Everything here degrades gracefully: a missing, unterminated, or malformed
|
||||||
|
// block makes extractStructuredFindings return ok=false (and yield no findings),
|
||||||
|
// so the caller falls back to the heuristic scrape — a weak model that ignores
|
||||||
|
// the contract still contributes findings, exactly as before.
|
||||||
|
|
||||||
|
import (
|
||||||
|
"encoding/json"
|
||||||
|
"strconv"
|
||||||
|
"strings"
|
||||||
|
)
|
||||||
|
|
||||||
|
// structuredFinding mirrors one element of the ```gadfly-findings JSON array.
|
||||||
|
// Line is json.Number so we tolerate both 123 and "123" from less-precise models.
|
||||||
|
type structuredFinding struct {
|
||||||
|
File string `json:"file"`
|
||||||
|
Line json.Number `json:"line"`
|
||||||
|
Severity string `json:"severity"`
|
||||||
|
Confidence string `json:"confidence"`
|
||||||
|
Title string `json:"title"`
|
||||||
|
Detail string `json:"detail"` // optional; the prose paragraph is used when absent
|
||||||
|
}
|
||||||
|
|
||||||
|
// findingsFence is the info-string that tags the machine-readable block.
|
||||||
|
const findingsFence = "gadfly-findings"
|
||||||
|
|
||||||
|
// extractStructuredFindings parses the ```gadfly-findings JSON block out of a
|
||||||
|
// lens's markdown. It returns the findings and ok=true when a TERMINATED block is
|
||||||
|
// present AND parses as a JSON array (an empty array is valid "nothing found" —
|
||||||
|
// ok is still true). A missing, unterminated, or unparseable block returns
|
||||||
|
// ok=false so the caller falls back to the heuristic scrape.
|
||||||
|
//
|
||||||
|
// Findings are deduped by file:line (keeping the first, matching parseFindings),
|
||||||
|
// findings with no usable file are dropped, and each title/detail is backfilled
|
||||||
|
// from the prose when the JSON omits it (best of both: exact location + the human
|
||||||
|
// context the model already wrote).
|
||||||
|
func extractStructuredFindings(out string) ([]finding, bool) {
|
||||||
|
lines := strings.Split(out, "\n")
|
||||||
|
start, end, ok := findingsSpan(lines)
|
||||||
|
if !ok {
|
||||||
|
return nil, false
|
||||||
|
}
|
||||||
|
var raw []structuredFinding
|
||||||
|
if err := json.Unmarshal([]byte(strings.Join(lines[start+1:end], "\n")), &raw); err != nil {
|
||||||
|
return nil, false
|
||||||
|
}
|
||||||
|
|
||||||
|
findings := make([]finding, 0, len(raw))
|
||||||
|
seen := map[string]bool{}
|
||||||
|
var prose map[string]string // built lazily — only if a finding lacks its own detail
|
||||||
|
for _, sf := range raw {
|
||||||
|
file := strings.TrimSpace(sf.File)
|
||||||
|
if file == "" {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
ln := 0
|
||||||
|
if n, err := strconv.Atoi(strings.TrimSpace(sf.Line.String())); err == nil && n > 0 {
|
||||||
|
ln = n
|
||||||
|
}
|
||||||
|
key := file + ":" + strconv.Itoa(ln)
|
||||||
|
if ln > 0 { // only dedupe concrete locations; unknown-line findings are kept
|
||||||
|
if seen[key] {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
seen[key] = true
|
||||||
|
}
|
||||||
|
detail := strings.TrimSpace(sf.Detail)
|
||||||
|
if detail == "" && ln > 0 {
|
||||||
|
if prose == nil {
|
||||||
|
prose = proseParagraphs(out)
|
||||||
|
}
|
||||||
|
detail = prose[key]
|
||||||
|
}
|
||||||
|
title := strings.TrimSpace(sf.Title)
|
||||||
|
if title == "" { // never empty: fall back to the prose detail, then the location
|
||||||
|
if detail != "" {
|
||||||
|
title = truncate(detail, 120)
|
||||||
|
} else if ln > 0 {
|
||||||
|
title = key
|
||||||
|
} else {
|
||||||
|
title = file
|
||||||
|
}
|
||||||
|
}
|
||||||
|
findings = append(findings, finding{
|
||||||
|
file: file,
|
||||||
|
line: ln,
|
||||||
|
title: title,
|
||||||
|
detail: truncate(detail, 500),
|
||||||
|
severity: normalizeSeverity(sf.Severity),
|
||||||
|
confidence: normalizeConfidence(sf.Confidence),
|
||||||
|
})
|
||||||
|
if len(findings) >= maxFindingsPerLens {
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return findings, true
|
||||||
|
}
|
||||||
|
|
||||||
|
// extractStructuredFindingsOrScrape returns a lens's findings, preferring the
|
||||||
|
// structured ```gadfly-findings block and falling back to the heuristic prose
|
||||||
|
// scrape when the block is absent, unterminated/malformed, OR parsed to zero
|
||||||
|
// usable findings (e.g. an empty [] emitted alongside real prose findings).
|
||||||
|
// Factored out of emit() so the fallback rule is unit-testable.
|
||||||
|
func extractStructuredFindingsOrScrape(r specialistResult) []finding {
|
||||||
|
if fs, _ := extractStructuredFindings(r.out); len(fs) > 0 {
|
||||||
|
return fs
|
||||||
|
}
|
||||||
|
return parseFindings(r.spec, r.out)
|
||||||
|
}
|
||||||
|
|
||||||
|
// stripFindingsBlock removes every TERMINATED ```gadfly-findings block from out so
|
||||||
|
// the machine-readable JSON never shows in the rendered comment. An UNTERMINATED
|
||||||
|
// fence is left in place — treating it as a block would swallow the rest of the
|
||||||
|
// comment (e.g. when a model's output was truncated mid-block). Trailing
|
||||||
|
// whitespace is trimmed.
|
||||||
|
func stripFindingsBlock(out string) string {
|
||||||
|
lines := strings.Split(out, "\n")
|
||||||
|
for {
|
||||||
|
start, end, ok := findingsSpan(lines)
|
||||||
|
if !ok {
|
||||||
|
break
|
||||||
|
}
|
||||||
|
lines = append(lines[:start:start], lines[end+1:]...)
|
||||||
|
}
|
||||||
|
return strings.TrimRight(strings.Join(lines, "\n"), "\n")
|
||||||
|
}
|
||||||
|
|
||||||
|
// findingsSpan returns the [start,end] inclusive line indices of the first
|
||||||
|
// TERMINATED ```gadfly-findings block in lines (start = the opening fence, end =
|
||||||
|
// the closing fence), or ok=false when there is none or it is unterminated.
|
||||||
|
// extract and strip share it so they always agree on what is (and isn't) a block.
|
||||||
|
func findingsSpan(lines []string) (start, end int, ok bool) {
|
||||||
|
start = -1
|
||||||
|
for i, ln := range lines {
|
||||||
|
if isFindingsOpen(ln) {
|
||||||
|
start = i
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if start < 0 {
|
||||||
|
return 0, 0, false
|
||||||
|
}
|
||||||
|
for j := start + 1; j < len(lines); j++ {
|
||||||
|
if isFenceClose(lines[j]) {
|
||||||
|
return start, j, true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return 0, 0, false // unterminated
|
||||||
|
}
|
||||||
|
|
||||||
|
// fenceInfo returns the info-string (text after the backticks) of a code-fence
|
||||||
|
// line and whether the line opens/closes a fence at all. A bare ``` yields ("",
|
||||||
|
// true); ```gadfly-findings yields ("gadfly-findings", true).
|
||||||
|
func fenceInfo(line string) (string, bool) {
|
||||||
|
t := strings.TrimSpace(line)
|
||||||
|
if !strings.HasPrefix(t, "```") {
|
||||||
|
return "", false
|
||||||
|
}
|
||||||
|
return strings.TrimSpace(strings.TrimLeft(t, "`")), true
|
||||||
|
}
|
||||||
|
|
||||||
|
// isFindingsOpen reports whether line opens a ```gadfly-findings block, matching
|
||||||
|
// the info-string EXACTLY (not as a substring) so a fence like ```not-findings
|
||||||
|
// can't masquerade as ours.
|
||||||
|
func isFindingsOpen(line string) bool {
|
||||||
|
info, ok := fenceInfo(line)
|
||||||
|
return ok && info == findingsFence
|
||||||
|
}
|
||||||
|
|
||||||
|
// isFenceClose reports whether line is a bare closing fence (``` with no info).
|
||||||
|
func isFenceClose(line string) bool {
|
||||||
|
info, ok := fenceInfo(line)
|
||||||
|
return ok && info == ""
|
||||||
|
}
|
||||||
|
|
||||||
|
// proseParagraphs maps "file:line" -> the prose paragraph that first references
|
||||||
|
// it, so a structured finding without its own detail can borrow the human
|
||||||
|
// context the model already wrote. Built from the markdown OUTSIDE the findings
|
||||||
|
// block (the block is JSON, not prose).
|
||||||
|
func proseParagraphs(out string) map[string]string {
|
||||||
|
prose := stripFindingsBlock(out)
|
||||||
|
lines := strings.Split(prose, "\n")
|
||||||
|
m := map[string]string{}
|
||||||
|
for _, loc := range pathLineRe.FindAllStringSubmatchIndex(prose, -1) {
|
||||||
|
key := prose[loc[2]:loc[3]] + ":" + prose[loc[4]:loc[5]]
|
||||||
|
if _, dup := m[key]; dup {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
li := strings.Count(prose[:loc[0]], "\n")
|
||||||
|
m[key] = paragraphAt(lines, li)
|
||||||
|
}
|
||||||
|
return m
|
||||||
|
}
|
||||||
|
|
||||||
|
// normalizeSeverity maps a model's severity word onto the canonical set
|
||||||
|
// (critical/high/medium/small/trivial), accepting common synonyms. An
|
||||||
|
// unrecognized value is returned lowercased so the store still sees the raw word.
|
||||||
|
func normalizeSeverity(s string) string {
|
||||||
|
switch t := strings.ToLower(strings.TrimSpace(s)); t {
|
||||||
|
case "critical", "crit", "blocker", "blocking":
|
||||||
|
return "critical"
|
||||||
|
case "high", "major", "severe":
|
||||||
|
return "high"
|
||||||
|
case "medium", "moderate":
|
||||||
|
return "medium"
|
||||||
|
case "small", "low", "minor":
|
||||||
|
return "small"
|
||||||
|
case "trivial", "nit", "nitpick", "info", "informational", "style", "cosmetic":
|
||||||
|
return "trivial"
|
||||||
|
default:
|
||||||
|
return t
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// normalizeConfidence maps a model's confidence word onto high/medium/low,
|
||||||
|
// accepting common synonyms; an unrecognized value is returned lowercased.
|
||||||
|
func normalizeConfidence(s string) string {
|
||||||
|
switch t := strings.ToLower(strings.TrimSpace(s)); t {
|
||||||
|
case "high", "certain", "confirmed", "verified":
|
||||||
|
return "high"
|
||||||
|
case "medium", "med", "moderate":
|
||||||
|
return "medium"
|
||||||
|
case "low", "unsure", "tentative", "unverified", "speculative":
|
||||||
|
return "low"
|
||||||
|
default:
|
||||||
|
return t
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,146 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"strings"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
const sampleReview = "**Blocking issues found**\n\n" +
|
||||||
|
"- **Unauthenticated endpoint** — `model.go:184` leaks PR content to a third party.\n" +
|
||||||
|
"- A nit about naming in `util.go:12`.\n\n" +
|
||||||
|
"```gadfly-findings\n" +
|
||||||
|
"[\n" +
|
||||||
|
" {\"file\": \"model.go\", \"line\": 184, \"severity\": \"high\", \"confidence\": \"high\", \"title\": \"Unauthenticated endpoint\"},\n" +
|
||||||
|
" {\"file\": \"util.go\", \"line\": 12, \"severity\": \"nit\", \"confidence\": \"medium\", \"title\": \"naming\"}\n" +
|
||||||
|
"]\n" +
|
||||||
|
"```\n"
|
||||||
|
|
||||||
|
func TestExtractStructuredFindings(t *testing.T) {
|
||||||
|
fs, ok := extractStructuredFindings(sampleReview)
|
||||||
|
if !ok {
|
||||||
|
t.Fatal("expected ok=true for a well-formed block")
|
||||||
|
}
|
||||||
|
if len(fs) != 2 {
|
||||||
|
t.Fatalf("want 2 findings, got %d", len(fs))
|
||||||
|
}
|
||||||
|
if fs[0].file != "model.go" || fs[0].line != 184 || fs[0].severity != "high" || fs[0].confidence != "high" {
|
||||||
|
t.Errorf("finding[0] mismatch: %+v", fs[0])
|
||||||
|
}
|
||||||
|
// "nit" must normalize to the canonical "trivial".
|
||||||
|
if fs[1].severity != "trivial" {
|
||||||
|
t.Errorf("want severity normalized to trivial, got %q", fs[1].severity)
|
||||||
|
}
|
||||||
|
// Detail is borrowed from the prose paragraph referencing the same file:line.
|
||||||
|
if fs[0].detail == "" {
|
||||||
|
t.Error("expected detail borrowed from prose, got empty")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestExtractStructuredFindingsEmptyArray(t *testing.T) {
|
||||||
|
out := "No material issues found\n\n```gadfly-findings\n[]\n```\n"
|
||||||
|
fs, ok := extractStructuredFindings(out)
|
||||||
|
if !ok {
|
||||||
|
t.Fatal("an empty array is a valid block; want ok=true")
|
||||||
|
}
|
||||||
|
if len(fs) != 0 {
|
||||||
|
t.Fatalf("want 0 findings, got %d", len(fs))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestExtractStructuredFindingsFallback(t *testing.T) {
|
||||||
|
// No block at all -> ok=false so the caller uses the heuristic scrape.
|
||||||
|
if _, ok := extractStructuredFindings("Minor issues\n\n- something at `x.go:1`\n"); ok {
|
||||||
|
t.Error("want ok=false when there is no block")
|
||||||
|
}
|
||||||
|
// Malformed JSON -> ok=false (graceful fallback).
|
||||||
|
bad := "Minor issues\n\n```gadfly-findings\n{not json}\n```\n"
|
||||||
|
if _, ok := extractStructuredFindings(bad); ok {
|
||||||
|
t.Error("want ok=false for malformed JSON")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestExtractStructuredFindingsStringLine(t *testing.T) {
|
||||||
|
// Tolerate a quoted line number from a less-precise model.
|
||||||
|
out := "Minor issues\n\n```gadfly-findings\n[{\"file\":\"a.go\",\"line\":\"42\",\"severity\":\"medium\",\"title\":\"x\"}]\n```\n"
|
||||||
|
fs, ok := extractStructuredFindings(out)
|
||||||
|
if !ok || len(fs) != 1 || fs[0].line != 42 {
|
||||||
|
t.Fatalf("want one finding at line 42, got ok=%v %+v", ok, fs)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestStripFindingsBlock(t *testing.T) {
|
||||||
|
stripped := stripFindingsBlock(sampleReview)
|
||||||
|
if strings.Contains(stripped, findingsFence) {
|
||||||
|
t.Errorf("block not stripped: %q", stripped)
|
||||||
|
}
|
||||||
|
// The prose findings must survive.
|
||||||
|
if !strings.Contains(stripped, "Unauthenticated endpoint") || !strings.Contains(stripped, "util.go:12") {
|
||||||
|
t.Errorf("prose lost during strip: %q", stripped)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestStripFindingsBlockUnterminated(t *testing.T) {
|
||||||
|
// A truncated, unterminated block must NOT swallow the prose before it.
|
||||||
|
out := "Minor issues\n\n- real finding at `x.go:1`\n\n```gadfly-findings\n[{\"file\":\"x.go\""
|
||||||
|
got := stripFindingsBlock(out)
|
||||||
|
if !strings.Contains(got, "real finding at `x.go:1`") {
|
||||||
|
t.Errorf("unterminated block swallowed the prose: %q", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestStripFindingsBlockNoBlock(t *testing.T) {
|
||||||
|
in := "Minor issues\n\n- finding at `x.go:9`"
|
||||||
|
if out := stripFindingsBlock(in); out != in {
|
||||||
|
t.Errorf("strip changed block-free text: %q != %q", out, in)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeSeverity(t *testing.T) {
|
||||||
|
cases := map[string]string{
|
||||||
|
"Critical": "critical", "blocker": "critical",
|
||||||
|
"major": "high", "HIGH": "high",
|
||||||
|
"moderate": "medium",
|
||||||
|
"minor": "small", "low": "small", // "minor" and "low" both map to small (consistently)
|
||||||
|
"nit": "trivial", "Style": "trivial",
|
||||||
|
"weird": "weird", // unknown passes through, lowercased
|
||||||
|
}
|
||||||
|
for in, want := range cases {
|
||||||
|
if got := normalizeSeverity(in); got != want {
|
||||||
|
t.Errorf("normalizeSeverity(%q) = %q, want %q", in, got, want)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNormalizeConfidence(t *testing.T) {
|
||||||
|
cases := map[string]string{
|
||||||
|
"High": "high", "confirmed": "high",
|
||||||
|
"MEDIUM": "medium", "moderate": "medium",
|
||||||
|
"low": "low", "unverified": "low",
|
||||||
|
"hunch": "hunch", // unknown passes through, lowercased
|
||||||
|
}
|
||||||
|
for in, want := range cases {
|
||||||
|
if got := normalizeConfidence(in); got != want {
|
||||||
|
t.Errorf("normalizeConfidence(%q) = %q, want %q", in, got, want)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestExtractStructuredFindingsFallbackOnEmpty(t *testing.T) {
|
||||||
|
// A non-clean lens that emitted an empty [] but listed prose findings must
|
||||||
|
// fall through to the heuristic scrape, not silently drop everything.
|
||||||
|
out := "Minor issues\n\n- bug at `pkg/a.go:7`\n\n```gadfly-findings\n[]\n```\n"
|
||||||
|
r := specialistResult{spec: Specialist{Name: "correctness"}, out: out, verdict: verdictMinor}
|
||||||
|
fs := extractStructuredFindingsOrScrape(r)
|
||||||
|
if len(fs) == 0 {
|
||||||
|
t.Fatal("empty [] must fall back to the heuristic scrape, got no findings")
|
||||||
|
}
|
||||||
|
if fs[0].file != "pkg/a.go" || fs[0].line != 7 {
|
||||||
|
t.Errorf("heuristic fallback wrong: %+v", fs[0])
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestVerdictSeverity(t *testing.T) {
|
||||||
|
if verdictBlocking.severity() != "high" || verdictMinor.severity() != "small" || verdictUnknown.severity() != "trivial" {
|
||||||
|
t.Error("verdict.severity mapping changed unexpectedly")
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -104,7 +104,7 @@ func TestRunSpecialists_FansOut(t *testing.T) {
|
|||||||
}
|
}
|
||||||
specs := threeLenses()
|
specs := threeLenses()
|
||||||
|
|
||||||
results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff")
|
results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff")
|
||||||
|
|
||||||
if got := peak(); got != 3 {
|
if got := peak(); got != 3 {
|
||||||
t.Errorf("peak concurrent lenses = %d, want 3", got)
|
t.Errorf("peak concurrent lenses = %d, want 3", got)
|
||||||
@@ -124,7 +124,7 @@ func TestRunSpecialists_SequentialByDefault(t *testing.T) {
|
|||||||
}
|
}
|
||||||
specs := threeLenses()
|
specs := threeLenses()
|
||||||
|
|
||||||
results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff")
|
results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff")
|
||||||
|
|
||||||
if got := peak(); got != 1 {
|
if got := peak(); got != 1 {
|
||||||
t.Errorf("peak concurrent lenses = %d, want 1 (sequential by default)", got)
|
t.Errorf("peak concurrent lenses = %d, want 1 (sequential by default)", got)
|
||||||
@@ -146,7 +146,7 @@ func TestRunSpecialists_PerProviderFanOut(t *testing.T) {
|
|||||||
}
|
}
|
||||||
specs := threeLenses()
|
specs := threeLenses()
|
||||||
|
|
||||||
results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff")
|
results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff")
|
||||||
|
|
||||||
if got := peak(); got != 3 {
|
if got := peak(); got != 3 {
|
||||||
t.Errorf("peak concurrent lenses = %d, want 3 (m1 per-provider override)", got)
|
t.Errorf("peak concurrent lenses = %d, want 3 (m1 per-provider override)", got)
|
||||||
|
|||||||
+81
-132
@@ -39,10 +39,9 @@
|
|||||||
// GADFLY_TITLE PR title (optional).
|
// GADFLY_TITLE PR title (optional).
|
||||||
// GADFLY_BODY PR description (optional).
|
// GADFLY_BODY PR description (optional).
|
||||||
// GADFLY_MAX_STEPS review-pass step cap (optional, default 24).
|
// GADFLY_MAX_STEPS review-pass step cap (optional, default 24).
|
||||||
// GADFLY_WRAPUP_RESERVE steps before the cap at which the agent is told to
|
// GADFLY_WRAPUP_RESERVE steps before the cap at which the wrap-up critic nudges
|
||||||
// stop investigating and write its answer (optional,
|
// the agent to stop investigating and write its answer
|
||||||
// default 4). Plus a tool-free finalization fallback
|
// (optional, default 4).
|
||||||
// guarantees a step-exhausted pass still emits output.
|
|
||||||
// GADFLY_RECHECK set to 0/false to skip the recheck pass (optional, default on).
|
// GADFLY_RECHECK set to 0/false to skip the recheck pass (optional, default on).
|
||||||
// GADFLY_RECHECK_MAX_STEPS recheck-pass step cap (optional, default 16).
|
// GADFLY_RECHECK_MAX_STEPS recheck-pass step cap (optional, default 16).
|
||||||
// GADFLY_TIMEOUT_SECS overall deadline in seconds, shared by both passes (optional, default 300).
|
// GADFLY_TIMEOUT_SECS overall deadline in seconds, shared by both passes (optional, default 300).
|
||||||
@@ -55,8 +54,8 @@
|
|||||||
// lanes as GADFLY_PROVIDER_CONCURRENCY (e.g.
|
// lanes as GADFLY_PROVIDER_CONCURRENCY (e.g.
|
||||||
// "ollama-cloud=3,m1=1"). Wins over GADFLY_LENS_CONCURRENCY
|
// "ollama-cloud=3,m1=1"). Wins over GADFLY_LENS_CONCURRENCY
|
||||||
// for the model's provider; falls back to it otherwise.
|
// for the model's provider; falls back to it otherwise.
|
||||||
// GADFLY_MAX_DIFF_CHARS diff chars embedded in the prompt (optional, default 60000;
|
// GADFLY_MAX_DIFF_CHARS diff chars embedded in the review prompt (optional, default 60000;
|
||||||
// the full diff is always available via the get_diff tool).
|
// the full diff is reachable via the paginated get_diff tool).
|
||||||
//
|
//
|
||||||
// On success it prints the review to stdout and exits 0. On a usage/config or
|
// On success it prints the review to stdout and exits 0. On a usage/config or
|
||||||
// model error it prints a diagnostic to stderr and exits non-zero; run.sh then
|
// model error it prints a diagnostic to stderr and exits non-zero; run.sh then
|
||||||
@@ -70,11 +69,9 @@ import (
|
|||||||
"os"
|
"os"
|
||||||
"strconv"
|
"strconv"
|
||||||
"strings"
|
"strings"
|
||||||
"sync"
|
|
||||||
"time"
|
"time"
|
||||||
|
|
||||||
"gitea.stevedudenhoeffer.com/steve/majordomo/agent"
|
"gitea.stevedudenhoeffer.com/steve/executus/fanout"
|
||||||
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
|
|
||||||
)
|
)
|
||||||
|
|
||||||
const (
|
const (
|
||||||
@@ -107,13 +104,6 @@ const wrapUpInstruction = "⚠️ You are almost out of your investigation budge
|
|||||||
"Do not begin any new investigation. If a finding could not be confirmed, drop it or mark it explicitly as unverified. " +
|
"Do not begin any new investigation. If a finding could not be confirmed, drop it or mark it explicitly as unverified. " +
|
||||||
"Output the review in the required format right now."
|
"Output the review in the required format right now."
|
||||||
|
|
||||||
// finalizeInstruction is the user message sent on the tool-free fallback pass
|
|
||||||
// when the agent exhausted its budget (or tripped a loop guard) without ever
|
|
||||||
// producing a final answer. It forces the model to synthesize whatever it has.
|
|
||||||
const finalizeInstruction = "You have run out of investigation steps. Do NOT call any tools. " +
|
|
||||||
"Based solely on what you have already gathered above, write your final answer now in the required format. " +
|
|
||||||
"If you could not confirm some findings, omit them or mark them as unverified, but produce the answer."
|
|
||||||
|
|
||||||
func main() {
|
func main() {
|
||||||
if err := run(); err != nil {
|
if err := run(); err != nil {
|
||||||
fmt.Fprintln(os.Stderr, "gadfly:", err)
|
fmt.Fprintln(os.Stderr, "gadfly:", err)
|
||||||
@@ -123,6 +113,14 @@ func main() {
|
|||||||
|
|
||||||
func run() error {
|
func run() error {
|
||||||
start := time.Now()
|
start := time.Now()
|
||||||
|
|
||||||
|
// Consolidation mode: not a review at all — read the per-model findings the
|
||||||
|
// swarm wrote and print the single cross-model consensus comment. entrypoint.sh
|
||||||
|
// runs this once, after every model has finished.
|
||||||
|
if strings.TrimSpace(os.Getenv("GADFLY_CONSOLIDATE_DIR")) != "" {
|
||||||
|
return runConsolidate()
|
||||||
|
}
|
||||||
|
|
||||||
repoDir := os.Getenv("GADFLY_REPO_DIR")
|
repoDir := os.Getenv("GADFLY_REPO_DIR")
|
||||||
diffFile := os.Getenv("GADFLY_DIFF_FILE")
|
diffFile := os.Getenv("GADFLY_DIFF_FILE")
|
||||||
systemFile := os.Getenv("GADFLY_SYSTEM_FILE")
|
systemFile := os.Getenv("GADFLY_SYSTEM_FILE")
|
||||||
@@ -169,7 +167,18 @@ func run() error {
|
|||||||
} else if worker != nil {
|
} else if worker != nil {
|
||||||
fsTools.worker = worker
|
fsTools.worker = worker
|
||||||
}
|
}
|
||||||
eng = &majordomoEngine{mdl: mdl, fsTools: fsTools}
|
// The context compactor needs a cheap summarizer; reuse the worker model
|
||||||
|
// when present, else the review model. A bad explicit GADFLY_COMPACT_MODEL
|
||||||
|
// just disables compaction rather than sinking the review.
|
||||||
|
summarizer, serr := resolveSummarizerModel(mdl, fsTools.worker)
|
||||||
|
if serr != nil {
|
||||||
|
fmt.Fprintln(os.Stderr, "gadfly: compaction summarizer disabled:", serr)
|
||||||
|
}
|
||||||
|
rex, rerr := newReviewExecutor(fsTools, mdl, summarizer, os.Getenv("GADFLY_MODEL"), newPRBudget())
|
||||||
|
if rerr != nil {
|
||||||
|
return fmt.Errorf("build review executor: %w", rerr)
|
||||||
|
}
|
||||||
|
eng = &majordomoEngine{rex: rex, mdl: mdl}
|
||||||
}
|
}
|
||||||
|
|
||||||
specialists, registry, auto, serrs := resolveSpecialists(repoDir)
|
specialists, registry, auto, serrs := resolveSpecialists(repoDir)
|
||||||
@@ -214,59 +223,64 @@ func run() error {
|
|||||||
// Optional, best-effort telemetry. OFF unless GADFLY_FINDINGS_URL is set;
|
// Optional, best-effort telemetry. OFF unless GADFLY_FINDINGS_URL is set;
|
||||||
// any failure is logged to stderr and never affects stdout or the exit code.
|
// any failure is logged to stderr and never affects stdout or the exit code.
|
||||||
emit(results, time.Since(start))
|
emit(results, time.Since(start))
|
||||||
|
|
||||||
|
// Optional per-model findings artifact for the cross-model consolidation
|
||||||
|
// pass. No-op unless GADFLY_FINDINGS_OUT is set (entrypoint sets it for a
|
||||||
|
// multi-model swarm). Best-effort, never affects stdout or the exit code.
|
||||||
|
writeFindingsOut(results)
|
||||||
return nil
|
return nil
|
||||||
}
|
}
|
||||||
|
|
||||||
// runSpecialists reviews the diff through each lens and returns the results in
|
// runSpecialists reviews the diff through each lens and returns the results in
|
||||||
// the SAME order as specialists, regardless of finish order. Up to
|
// the SAME order as specialists, regardless of finish order. It uses executus's
|
||||||
// GADFLY_LENS_CONCURRENCY lenses run concurrently; the default of 1 keeps the
|
// fanout primitive: up to GADFLY_LENS_CONCURRENCY lenses run concurrently (the
|
||||||
// suite sequential, exactly as before. Each lens already runs under its own
|
// default of 1 keeps the suite sequential, exactly as before), and fanout.Run
|
||||||
// per-lens timeout (reviewWithSpecialist), so concurrency simply overlaps those
|
// returns one result per lens in input order. Each lens already runs under its
|
||||||
// independent passes — and because reviewWithSpecialist builds a fresh toolbox
|
// own per-lens timeout (reviewWithSpecialist) and the lenses only read the
|
||||||
// per pass and the lenses only read the immutable repoFS, they share no mutable
|
// immutable repoFS, so concurrency simply overlaps independent passes.
|
||||||
// state. Results are stored by index so the consolidated comment keeps the
|
|
||||||
// configured lens order.
|
|
||||||
//
|
//
|
||||||
// Caution: this fans out WITHIN one model. It multiplies with entrypoint.sh's
|
// Caution: this fans out WITHIN one model. It multiplies with entrypoint.sh's
|
||||||
// per-provider model concurrency, so total concurrent backend requests ≈
|
// per-provider model concurrency, so total concurrent backend requests ≈
|
||||||
// (models at once) × (lenses at once). To fan lenses out without oversubscribing
|
// (models at once) × (lenses at once). To fan lenses out without oversubscribing
|
||||||
// the backend, run models one at a time (provider lane cap 1) and raise this.
|
// the backend, run models one at a time (provider lane cap 1) and raise this.
|
||||||
func runSpecialists(eng reviewEngine, base string, specialists []Specialist, task, diff string) []specialistResult {
|
func runSpecialists(eng reviewEngine, base string, specialists []Specialist, task, diff string) []specialistResult {
|
||||||
results := make([]specialistResult, len(specialists))
|
|
||||||
|
|
||||||
// Optional live status board: publishes this model's per-lens progress to a
|
// Optional live status board: publishes this model's per-lens progress to a
|
||||||
// file the entrypoint board renders. Inert (no-op) unless GADFLY_STATUS_FILE
|
// file the entrypoint board renders. Inert (no-op) unless GADFLY_STATUS_FILE
|
||||||
// is set, so plain runs are unaffected.
|
// is set, so plain runs are unaffected.
|
||||||
sw := newStatusWriter(os.Getenv("GADFLY_MODEL"), modelProvider(), specialists)
|
sw := newStatusWriter(os.Getenv("GADFLY_MODEL"), modelProvider(), specialists)
|
||||||
|
|
||||||
conc := min(lensConcurrency(), len(specialists))
|
fanResults := fanout.Run(context.Background(), specialists, fanout.Options[Specialist]{
|
||||||
|
MaxConcurrent: lensConcurrency(),
|
||||||
sem := make(chan struct{}, conc)
|
}, func(_ context.Context, sp Specialist) (res specialistResult, _ error) {
|
||||||
var wg sync.WaitGroup
|
// A panic in one lens must not crash the whole binary (which would kill
|
||||||
for i, sp := range specialists {
|
// every other lens's output) or leave this lens stuck at "running" on the
|
||||||
wg.Add(1)
|
// status board. fanout does not recover fn panics, so we do it here:
|
||||||
sem <- struct{}{} // blocks once `conc` lenses are already in flight
|
// record the panic as an errored result and mark the lens finished.
|
||||||
go func(i int, sp Specialist) {
|
|
||||||
defer wg.Done()
|
|
||||||
defer func() { <-sem }()
|
|
||||||
// A panic in one lens must not crash the whole binary (which would
|
|
||||||
// kill every other lens's output) or leave this lens stuck at
|
|
||||||
// "running" on the status board. Recover, record it as an errored
|
|
||||||
// result, and mark the lens finished so the board can complete.
|
|
||||||
defer func() {
|
defer func() {
|
||||||
if r := recover(); r != nil {
|
if r := recover(); r != nil {
|
||||||
results[i] = specialistResult{spec: sp, out: fmt.Sprintf("⚠️ This reviewer panicked: %v", r), verdict: verdictUnknown, errored: true}
|
res = specialistResult{spec: sp, out: fmt.Sprintf("⚠️ This reviewer panicked: %v", r), verdict: verdictUnknown, errored: true}
|
||||||
sw.set(sp.Name, lensFinished, "", true)
|
sw.set(sp.Name, lensFinished, "", true)
|
||||||
}
|
}
|
||||||
}()
|
}()
|
||||||
sw.set(sp.Name, lensRunning, "", false)
|
sw.set(sp.Name, lensRunning, "", false)
|
||||||
out, errored := reviewWithSpecialist(eng, base, sp, task, diff)
|
out, errored := reviewWithSpecialist(eng, base, sp, task, diff)
|
||||||
v := parseVerdict(out)
|
v := parseVerdict(out)
|
||||||
results[i] = specialistResult{spec: sp, out: out, verdict: v, errored: errored}
|
|
||||||
sw.set(sp.Name, lensFinished, v.label(), errored)
|
sw.set(sp.Name, lensFinished, v.label(), errored)
|
||||||
}(i, sp)
|
return specialistResult{spec: sp, out: out, verdict: v, errored: errored}, nil
|
||||||
|
})
|
||||||
|
|
||||||
|
// fanout guarantees input order; its Result.Err is set only when the context
|
||||||
|
// is cancelled before a lens ran (reviewWithSpecialist embeds its own failures
|
||||||
|
// in the result), so surface that as an errored lens rather than dropping it.
|
||||||
|
results := make([]specialistResult, len(specialists))
|
||||||
|
for i, r := range fanResults {
|
||||||
|
if r.Err != nil {
|
||||||
|
results[i] = specialistResult{spec: specialists[i], out: fmt.Sprintf("⚠️ This reviewer did not run: %v", r.Err), verdict: verdictUnknown, errored: true}
|
||||||
|
sw.set(specialists[i].Name, lensFinished, "", true)
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
results[i] = r.Value
|
||||||
}
|
}
|
||||||
wg.Wait()
|
|
||||||
return results
|
return results
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -307,8 +321,7 @@ func providerOverride(envName, provider string) (int, bool) {
|
|||||||
// returned bool is true when the review pass failed (rendered as an inline
|
// returned bool is true when the review pass failed (rendered as an inline
|
||||||
// notice — advisory; one lens failing never sinks the others or the job).
|
// notice — advisory; one lens failing never sinks the others or the job).
|
||||||
func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, diff string) (string, bool) {
|
func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, diff string) (string, bool) {
|
||||||
timeout := time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second
|
ctx, cancel := context.WithTimeout(context.Background(), reviewTimeout())
|
||||||
ctx, cancel := context.WithTimeout(context.Background(), timeout)
|
|
||||||
defer cancel()
|
defer cancel()
|
||||||
|
|
||||||
draft, err := eng.runPass(ctx, composeSpecialistPrompt(base, sp), task,
|
draft, err := eng.runPass(ctx, composeSpecialistPrompt(base, sp), task,
|
||||||
@@ -331,90 +344,6 @@ func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, di
|
|||||||
return final, false
|
return final, false
|
||||||
}
|
}
|
||||||
|
|
||||||
// runAgent runs one agent pass (its own fresh toolbox over the sandbox) and
|
|
||||||
// returns the final answer. An empty answer is an error — the caller decides
|
|
||||||
// whether that is fatal (review pass) or recoverable (recheck pass). A
|
|
||||||
// non-empty answer that ended on a budget/guard error is still returned: the
|
|
||||||
// model wrote its output, then ran out of steps.
|
|
||||||
//
|
|
||||||
// Two mechanisms keep a step-hungry model from hard-failing with no output:
|
|
||||||
// 1. A wrap-up steer: once the run comes within wrapUpReserve steps of the
|
|
||||||
// cap, a forceful "stop calling tools, write your final answer" message is
|
|
||||||
// injected so the model spends its remaining steps finalizing.
|
|
||||||
// 2. A finalization fallback: if the loop still ends empty (the model ignored
|
|
||||||
// the nudge, or a loop guard tripped), one tool-free model call forces a
|
|
||||||
// final answer out of the transcript already gathered.
|
|
||||||
func runAgent(ctx context.Context, mdl llm.Model, fsTools *repoFS, system, task string, maxSteps int) (string, error) {
|
|
||||||
box, err := fsTools.toolbox()
|
|
||||||
if err != nil {
|
|
||||||
return "", err
|
|
||||||
}
|
|
||||||
loop := agent.New(mdl, system,
|
|
||||||
agent.WithToolbox(box),
|
|
||||||
agent.WithMaxSteps(maxSteps),
|
|
||||||
// Guard rails: stop the model from spinning on failing or identical
|
|
||||||
// tool calls instead of writing its answer.
|
|
||||||
agent.WithToolErrorLimits(4, 4),
|
|
||||||
)
|
|
||||||
|
|
||||||
wrapUpAt := maxSteps - wrapUpReserve()
|
|
||||||
if wrapUpAt < 1 {
|
|
||||||
wrapUpAt = 1
|
|
||||||
}
|
|
||||||
var completed int // steps finished so far (updated after each step)
|
|
||||||
nudged := false
|
|
||||||
|
|
||||||
res, runErr := loop.Run(ctx, task,
|
|
||||||
agent.OnStep(func(s agent.Step) { completed = s.Index + 1 }),
|
|
||||||
agent.WithSteer(func() []llm.Message {
|
|
||||||
if !nudged && completed >= wrapUpAt {
|
|
||||||
nudged = true
|
|
||||||
return []llm.Message{llm.UserText(wrapUpInstruction)}
|
|
||||||
}
|
|
||||||
return nil
|
|
||||||
}),
|
|
||||||
)
|
|
||||||
|
|
||||||
out := ""
|
|
||||||
if res != nil {
|
|
||||||
out = strings.TrimSpace(res.Output)
|
|
||||||
}
|
|
||||||
if out != "" {
|
|
||||||
return out, nil
|
|
||||||
}
|
|
||||||
|
|
||||||
// No final answer. If we still have budget on the clock and a transcript to
|
|
||||||
// work from, force a tool-free finalization rather than losing the pass.
|
|
||||||
if res != nil && len(res.Messages) > 0 && ctx.Err() == nil {
|
|
||||||
if forced := forceFinalAnswer(ctx, mdl, system, res.Messages); forced != "" {
|
|
||||||
return forced, nil
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if runErr != nil {
|
|
||||||
return "", runErr
|
|
||||||
}
|
|
||||||
return "", errors.New("agent produced no output")
|
|
||||||
}
|
|
||||||
|
|
||||||
// forceFinalAnswer makes one tool-free model call to squeeze a final answer out
|
|
||||||
// of an agent that exhausted its step budget without producing one. Tools are
|
|
||||||
// forbidden (ToolChoice "none") so the model must synthesize from the transcript
|
|
||||||
// instead of investigating further. Best-effort: any error or empty reply
|
|
||||||
// returns "" and the caller falls back to its normal empty-output handling.
|
|
||||||
func forceFinalAnswer(ctx context.Context, mdl llm.Model, system string, transcript []llm.Message) string {
|
|
||||||
msgs := append(append([]llm.Message(nil), transcript...), llm.UserText(finalizeInstruction))
|
|
||||||
resp, err := mdl.Generate(ctx, llm.Request{
|
|
||||||
System: system,
|
|
||||||
Messages: msgs,
|
|
||||||
ToolChoice: "none",
|
|
||||||
})
|
|
||||||
if err != nil || resp == nil {
|
|
||||||
return ""
|
|
||||||
}
|
|
||||||
return strings.TrimSpace(resp.Text())
|
|
||||||
}
|
|
||||||
|
|
||||||
// wrapUpReserve is how many steps before the cap the wrap-up nudge fires,
|
// wrapUpReserve is how many steps before the cap the wrap-up nudge fires,
|
||||||
// overridable via GADFLY_WRAPUP_RESERVE.
|
// overridable via GADFLY_WRAPUP_RESERVE.
|
||||||
func wrapUpReserve() int {
|
func wrapUpReserve() int {
|
||||||
@@ -431,7 +360,7 @@ func buildTask(diff string) string {
|
|||||||
truncNote := ""
|
truncNote := ""
|
||||||
if maxDiff > 0 && len(diff) > maxDiff {
|
if maxDiff > 0 && len(diff) > maxDiff {
|
||||||
diff = diff[:maxDiff]
|
diff = diff[:maxDiff]
|
||||||
truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars in this message; read the changed files (or call get_diff, if available) for the full text.]", maxDiff)
|
truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars in this message; page the full diff with get_diff (paginated; pass a `path` to scope it to one file) or read the changed files.]", maxDiff)
|
||||||
}
|
}
|
||||||
|
|
||||||
var b strings.Builder
|
var b strings.Builder
|
||||||
@@ -458,3 +387,23 @@ func envInt(name string, def int) int {
|
|||||||
}
|
}
|
||||||
return n
|
return n
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// envBool reads a boolean-ish env var: def when unset, false for an explicit
|
||||||
|
// falsey value (0/false/no/off), true otherwise. The shared spelling for
|
||||||
|
// gadfly's "on unless disabled" opt-out flags (GADFLY_RECHECK, GADFLY_COMPACT).
|
||||||
|
func envBool(name string, def bool) bool {
|
||||||
|
switch strings.ToLower(strings.TrimSpace(os.Getenv(name))) {
|
||||||
|
case "":
|
||||||
|
return def
|
||||||
|
case "0", "false", "no", "off":
|
||||||
|
return false
|
||||||
|
default:
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// reviewTimeout is the per-specialist-lens deadline (GADFLY_TIMEOUT_SECS), shared
|
||||||
|
// across a lens's review+recheck passes and applied as each pass's run cap.
|
||||||
|
func reviewTimeout() time.Duration {
|
||||||
|
return time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second
|
||||||
|
}
|
||||||
|
|||||||
@@ -127,6 +127,27 @@ func resolveWorkerModel() (llm.Model, error) {
|
|||||||
return majordomo.Parse(buildSpec(provider, spec))
|
return majordomo.Parse(buildSpec(provider, spec))
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// resolveSummarizerModel picks the model the context compactor uses to compress
|
||||||
|
// the runaway middle of a transcript. It should be CHEAP, since it fires once per
|
||||||
|
// compaction: GADFLY_COMPACT_MODEL if set (honoring GADFLY_PROVIDER for a bare
|
||||||
|
// id), else the delegate worker model when one is configured (already cheap by
|
||||||
|
// design), else the review model itself. Returns (nil, err) only on an explicit
|
||||||
|
// bad GADFLY_COMPACT_MODEL spec — the caller logs it and simply runs without
|
||||||
|
// compaction rather than failing the review.
|
||||||
|
func resolveSummarizerModel(review, worker llm.Model) (llm.Model, error) {
|
||||||
|
if spec := strings.TrimSpace(os.Getenv("GADFLY_COMPACT_MODEL")); spec != "" {
|
||||||
|
provider := strings.TrimSpace(os.Getenv("GADFLY_PROVIDER"))
|
||||||
|
if provider == "" {
|
||||||
|
provider = defaultProvider
|
||||||
|
}
|
||||||
|
return majordomo.Parse(buildSpec(provider, spec))
|
||||||
|
}
|
||||||
|
if worker != nil {
|
||||||
|
return worker, nil
|
||||||
|
}
|
||||||
|
return review, nil
|
||||||
|
}
|
||||||
|
|
||||||
// buildSpec turns (provider, model) into a majordomo spec. A model id that
|
// buildSpec turns (provider, model) into a majordomo spec. A model id that
|
||||||
// already carries a "provider/" prefix (or is a multi-element failover chain)
|
// already carries a "provider/" prefix (or is a multi-element failover chain)
|
||||||
// is passed through verbatim; a bare id is prefixed with the provider.
|
// is passed through verbatim; a bare id is prefixed with the provider.
|
||||||
|
|||||||
+15
-11
@@ -2,7 +2,6 @@ package main
|
|||||||
|
|
||||||
import (
|
import (
|
||||||
"fmt"
|
"fmt"
|
||||||
"os"
|
|
||||||
"strings"
|
"strings"
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -11,6 +10,13 @@ import (
|
|||||||
// than discovering them.
|
// than discovering them.
|
||||||
const defaultRecheckMaxSteps = 16
|
const defaultRecheckMaxSteps = 16
|
||||||
|
|
||||||
|
// defaultRecheckDiffChars caps the diff embedded in the recheck task. It is much
|
||||||
|
// smaller than the review task's GADFLY_MAX_DIFF_CHARS: the recheck already has
|
||||||
|
// the draft findings to verify and can pull the exact hunks it needs via the
|
||||||
|
// paginated get_diff tool (optionally scoped to a path), so re-embedding the
|
||||||
|
// whole diff on every recheck step is pure burn. Override: GADFLY_RECHECK_DIFF_CHARS.
|
||||||
|
const defaultRecheckDiffChars = 20000
|
||||||
|
|
||||||
// recheckSystemPrompt drives the second, adversarial verification pass. The
|
// recheckSystemPrompt drives the second, adversarial verification pass. The
|
||||||
// model is given a DRAFT review and must independently confirm each finding
|
// model is given a DRAFT review and must independently confirm each finding
|
||||||
// against the real code before letting it survive — the antidote to a
|
// against the real code before letting it survive — the antidote to a
|
||||||
@@ -49,18 +55,16 @@ Output rules:
|
|||||||
- Do NOT invent new findings; this is a verification gate, not a fresh review.
|
- Do NOT invent new findings; this is a verification gate, not a fresh review.
|
||||||
- Do NOT include meta-commentary about the verification process or which
|
- Do NOT include meta-commentary about the verification process or which
|
||||||
findings you dropped — output only the final, corrected review markdown.
|
findings you dropped — output only the final, corrected review markdown.
|
||||||
|
- The draft ends with a fenced ` + "`gadfly-findings`" + ` JSON block. Regenerate it
|
||||||
|
so it lists ONLY the findings that SURVIVED your verification, in the same schema
|
||||||
|
({"file","line","severity","confidence","title"}; severity one of
|
||||||
|
critical/high/medium/small/trivial, confidence one of high/medium/low). If every
|
||||||
|
finding was dropped, emit an empty array ` + "`[]`" + `. Keep the block last.
|
||||||
- When done investigating, STOP calling tools and reply with the review.`
|
- When done investigating, STOP calling tools and reply with the review.`
|
||||||
|
|
||||||
// recheckEnabled reports whether the verification pass should run. On unless
|
// recheckEnabled reports whether the verification pass should run. On unless
|
||||||
// GADFLY_RECHECK is explicitly a falsey value.
|
// GADFLY_RECHECK is explicitly a falsey value.
|
||||||
func recheckEnabled() bool {
|
func recheckEnabled() bool { return envBool("GADFLY_RECHECK", true) }
|
||||||
switch strings.ToLower(strings.TrimSpace(os.Getenv("GADFLY_RECHECK"))) {
|
|
||||||
case "0", "false", "no", "off":
|
|
||||||
return false
|
|
||||||
default:
|
|
||||||
return true
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// shouldRecheck decides whether to run the verification pass for a given draft.
|
// shouldRecheck decides whether to run the verification pass for a given draft.
|
||||||
// A clean "no material issues" draft has nothing to verify, so it is skipped
|
// A clean "no material issues" draft has nothing to verify, so it is skipped
|
||||||
@@ -79,11 +83,11 @@ func shouldRecheck(draft string) bool {
|
|||||||
// scrutinize, with the full diff available via get_diff (and embedded here,
|
// scrutinize, with the full diff available via get_diff (and embedded here,
|
||||||
// truncated, to save a tool call).
|
// truncated, to save a tool call).
|
||||||
func buildRecheckTask(draft, diff string) string {
|
func buildRecheckTask(draft, diff string) string {
|
||||||
maxDiff := envInt("GADFLY_MAX_DIFF_CHARS", defaultMaxDiffChars)
|
maxDiff := envInt("GADFLY_RECHECK_DIFF_CHARS", defaultRecheckDiffChars)
|
||||||
truncNote := ""
|
truncNote := ""
|
||||||
if maxDiff > 0 && len(diff) > maxDiff {
|
if maxDiff > 0 && len(diff) > maxDiff {
|
||||||
diff = diff[:maxDiff]
|
diff = diff[:maxDiff]
|
||||||
truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars here; read the changed files (or call get_diff, if available) for the full text.]", maxDiff)
|
truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars here; call get_diff (paginated; pass a `path` to scope it to one file) or read the changed files for the rest.]", maxDiff)
|
||||||
}
|
}
|
||||||
|
|
||||||
var b strings.Builder
|
var b strings.Builder
|
||||||
|
|||||||
@@ -49,7 +49,7 @@ func TestRecheckEnabled(t *testing.T) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
func TestBuildRecheckTask(t *testing.T) {
|
func TestBuildRecheckTask(t *testing.T) {
|
||||||
t.Setenv("GADFLY_MAX_DIFF_CHARS", "")
|
t.Setenv("GADFLY_RECHECK_DIFF_CHARS", "")
|
||||||
draft := "VERDICT: Blocking issues found\n- foo.go:1 broken"
|
draft := "VERDICT: Blocking issues found\n- foo.go:1 broken"
|
||||||
out := buildRecheckTask(draft, "diff --git a/x b/x\n+y\n")
|
out := buildRecheckTask(draft, "diff --git a/x b/x\n+y\n")
|
||||||
if !strings.Contains(out, draft) {
|
if !strings.Contains(out, draft) {
|
||||||
@@ -77,25 +77,46 @@ func fakeModel(t *testing.T, reply string) llm.Model {
|
|||||||
return m
|
return m
|
||||||
}
|
}
|
||||||
|
|
||||||
func TestRunAgent_ReturnsOutput(t *testing.T) {
|
// newTestReviewExecutor builds a reviewExecutor over a fake model + repo for unit
|
||||||
|
// tests: no compaction summarizer and no budget, so it exercises the bare agent
|
||||||
|
// loop through the executus run kernel without any network call.
|
||||||
|
func newTestReviewExecutor(t *testing.T, mdl llm.Model, fs *repoFS) *reviewExecutor {
|
||||||
|
t.Helper()
|
||||||
|
rex, err := newReviewExecutor(fs, mdl, nil, "mock", nil)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
return rex
|
||||||
|
}
|
||||||
|
|
||||||
|
// testEngine wraps newTestReviewExecutor in a majordomoEngine for the
|
||||||
|
// runSpecialists-level tests.
|
||||||
|
func testEngine(t *testing.T, mdl llm.Model, fs *repoFS) *majordomoEngine {
|
||||||
|
t.Helper()
|
||||||
|
return &majordomoEngine{rex: newTestReviewExecutor(t, mdl, fs), mdl: mdl}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestReviewExecutor_ReturnsOutput(t *testing.T) {
|
||||||
fs, err := newRepoFS(t.TempDir(), "diff")
|
fs, err := newRepoFS(t.TempDir(), "diff")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatal(err)
|
t.Fatal(err)
|
||||||
}
|
}
|
||||||
mdl := fakeModel(t, " corrected review: No material issues found. ")
|
mdl := fakeModel(t, " corrected review: No material issues found. ")
|
||||||
out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4)
|
rex := newTestReviewExecutor(t, mdl, fs)
|
||||||
|
out, err := rex.run(context.Background(), "sys", "task", 4)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("runAgent: %v", err)
|
t.Fatalf("run: %v", err)
|
||||||
}
|
}
|
||||||
if out != "corrected review: No material issues found." {
|
if out != "corrected review: No material issues found." {
|
||||||
t.Errorf("runAgent should return trimmed model output, got %q", out)
|
t.Errorf("run should return trimmed model output, got %q", out)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
func TestRunAgent_EmptyIsError(t *testing.T) {
|
func TestReviewExecutor_EmptyIsError(t *testing.T) {
|
||||||
fs, _ := newRepoFS(t.TempDir(), "diff")
|
fs, _ := newRepoFS(t.TempDir(), "diff")
|
||||||
mdl := fakeModel(t, " ")
|
mdl := fakeModel(t, " ")
|
||||||
if _, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4); err == nil {
|
rex := newTestReviewExecutor(t, mdl, fs)
|
||||||
t.Error("runAgent should error on empty model output")
|
if _, err := rex.run(context.Background(), "sys", "task", 4); err == nil {
|
||||||
|
t.Error("run should error on empty model output")
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -0,0 +1,296 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
// Inline PR review. After the consensus comment is rendered, Gadfly also posts a
|
||||||
|
// single Gitea pull review (state COMMENT — advisory, NEVER request-changes or
|
||||||
|
// approve) whose inline comments anchor consensus findings to the exact changed
|
||||||
|
// lines. The issue comment stays the ranked overview; the review puts each
|
||||||
|
// finding next to the code it's about — the "reviewer integrated with Gitea" the
|
||||||
|
// project wanted, without ever blocking a merge.
|
||||||
|
//
|
||||||
|
// All of this is best-effort: disabled by GADFLY_INLINE_REVIEW=0 or when the diff
|
||||||
|
// / API creds aren't available, only anchors findings that land on a line present
|
||||||
|
// in the diff (Gitea rejects comments off the diff), and any error is logged to
|
||||||
|
// stderr without touching the consensus comment (already on stdout) or the exit
|
||||||
|
// code.
|
||||||
|
|
||||||
|
import (
|
||||||
|
"bytes"
|
||||||
|
"encoding/json"
|
||||||
|
"fmt"
|
||||||
|
"io"
|
||||||
|
"net/http"
|
||||||
|
"os"
|
||||||
|
"regexp"
|
||||||
|
"strconv"
|
||||||
|
"strings"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
const (
|
||||||
|
// inlineReviewMarker tags our review body so a re-run can delete the previous
|
||||||
|
// one instead of stacking duplicate inline comments.
|
||||||
|
inlineReviewMarker = "<!-- gadfly-inline-review -->"
|
||||||
|
// maxInlineComments caps how many inline comments one review carries, so a
|
||||||
|
// huge diff can't produce a wall of annotations. Clusters are pre-sorted by
|
||||||
|
// agreement×severity, so the cap keeps the most important.
|
||||||
|
maxInlineComments = 25
|
||||||
|
inlineReviewHTTPTimeout = 20 * time.Second
|
||||||
|
)
|
||||||
|
|
||||||
|
// reviewComment is one inline comment in a Gitea pull review. Field names match
|
||||||
|
// the Gitea API EXACTLY (new_position = line in the new/head file).
|
||||||
|
type reviewComment struct {
|
||||||
|
Path string `json:"path"`
|
||||||
|
Body string `json:"body"`
|
||||||
|
NewPosition int `json:"new_position"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// createReview is the POST /pulls/{n}/reviews body. event is ALWAYS "COMMENT".
|
||||||
|
type createReview struct {
|
||||||
|
Body string `json:"body"`
|
||||||
|
Event string `json:"event"`
|
||||||
|
Comments []reviewComment `json:"comments"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// postInlineReview posts one COMMENT-state pull review with inline comments for
|
||||||
|
// consensus findings on changed lines. No-op + best-effort (see file comment).
|
||||||
|
func postInlineReview(clusters []cluster) {
|
||||||
|
if strings.EqualFold(strings.TrimSpace(os.Getenv("GADFLY_INLINE_REVIEW")), "0") {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
api := strings.TrimRight(strings.TrimSpace(os.Getenv("GITEA_API")), "/")
|
||||||
|
token := strings.TrimSpace(os.Getenv("GITEA_TOKEN"))
|
||||||
|
pr := strings.TrimSpace(os.Getenv("GADFLY_PR"))
|
||||||
|
diffPath := strings.TrimSpace(os.Getenv("GADFLY_DIFF_FILE"))
|
||||||
|
if api == "" || token == "" || pr == "" || diffPath == "" {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
diff, err := os.ReadFile(diffPath)
|
||||||
|
if err != nil {
|
||||||
|
fmt.Fprintln(os.Stderr, "gadfly: inline review: read diff:", err)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
comments := inlineComments(clusters, parseDiffNewLines(string(diff)))
|
||||||
|
if len(comments) == 0 {
|
||||||
|
return // nothing anchors to a changed line; the consensus comment covers it
|
||||||
|
}
|
||||||
|
|
||||||
|
client := &http.Client{Timeout: inlineReviewHTTPTimeout}
|
||||||
|
base := fmt.Sprintf("%s/pulls/%s/reviews", api, pr)
|
||||||
|
deletePriorReviews(client, base, token) // avoid stacking on re-runs
|
||||||
|
|
||||||
|
body := fmt.Sprintf("%s\n🪰 **Gadfly consensus review** — %d inline finding%s on changed lines. See the consensus comment for the full ranked summary.\n\n<sub>Advisory only — does not block merge.</sub>",
|
||||||
|
inlineReviewMarker, len(comments), plural(len(comments)))
|
||||||
|
if err := giteaSend(client, http.MethodPost, base, token, createReview{Body: body, Event: "COMMENT", Comments: comments}); err != nil {
|
||||||
|
fmt.Fprintln(os.Stderr, "gadfly: inline review post:", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// inlineComments builds inline comments for the clusters that anchor to a line
|
||||||
|
// present in the diff, in priority order (clusters are pre-sorted), capped.
|
||||||
|
func inlineComments(clusters []cluster, addable map[string]map[int]bool) []reviewComment {
|
||||||
|
var out []reviewComment
|
||||||
|
for _, c := range clusters {
|
||||||
|
path := normPath(c.file)
|
||||||
|
anchor := anchorLine(addable[path], c.line, c.maxLine)
|
||||||
|
if anchor == 0 {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
out = append(out, reviewComment{Path: path, NewPosition: anchor, Body: inlineBody(c)})
|
||||||
|
if len(out) >= maxInlineComments {
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return out
|
||||||
|
}
|
||||||
|
|
||||||
|
// anchorLine returns the first line in [lo,hi] that is an added line in the diff,
|
||||||
|
// or 0 if none. Scanning the cluster's whole span (not just its representative
|
||||||
|
// line) anchors a finding whose min line is just outside the diff but whose span
|
||||||
|
// still overlaps a changed line.
|
||||||
|
func anchorLine(added map[int]bool, lo, hi int) int {
|
||||||
|
if added == nil || lo <= 0 {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
if hi < lo { // single-line cluster (maxLine unset)
|
||||||
|
hi = lo
|
||||||
|
}
|
||||||
|
for ln := lo; ln <= hi; ln++ {
|
||||||
|
if added[ln] {
|
||||||
|
return ln
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
// inlineBody renders one inline comment: severity + title, who flagged it, detail.
|
||||||
|
func inlineBody(c cluster) string {
|
||||||
|
var b strings.Builder
|
||||||
|
fmt.Fprintf(&b, "%s **%s**", sevIcon(c.severity), strings.TrimSpace(c.title))
|
||||||
|
fmt.Fprintf(&b, "\n\n_%s · flagged by %d model%s_", lensList(c.lenses), len(c.models), plural(len(c.models)))
|
||||||
|
if d := strings.TrimSpace(c.detail); d != "" {
|
||||||
|
fmt.Fprintf(&b, "\n\n%s", d)
|
||||||
|
}
|
||||||
|
b.WriteString("\n\n<sub>🪰 Gadfly · advisory</sub>")
|
||||||
|
return b.String()
|
||||||
|
}
|
||||||
|
|
||||||
|
// parseDiffNewLines returns, per file, the set of NEW-file line numbers that were
|
||||||
|
// ADDED in the unified diff — the safest lines for an inline comment to anchor to
|
||||||
|
// (Gitea reliably accepts comments on added lines). Context lines are walked to
|
||||||
|
// keep the line counter correct but are NOT recorded: anchoring only to added
|
||||||
|
// lines avoids the all-or-nothing review POST being rejected for an off-change
|
||||||
|
// anchor. Hunk lengths from the @@ header bound each hunk, so a content line that
|
||||||
|
// happens to start with "+++ " or "@@" is still read as content, not a header.
|
||||||
|
func parseDiffNewLines(diff string) map[string]map[int]bool {
|
||||||
|
out := map[string]map[int]bool{}
|
||||||
|
var file string
|
||||||
|
var newLine, oldRem, newRem int
|
||||||
|
inHunk := false
|
||||||
|
for _, line := range strings.Split(diff, "\n") {
|
||||||
|
if inHunk && (newRem > 0 || oldRem > 0) {
|
||||||
|
switch {
|
||||||
|
case strings.HasPrefix(line, "+"):
|
||||||
|
record(out, file, newLine) // added line — anchorable
|
||||||
|
newLine++
|
||||||
|
newRem--
|
||||||
|
case strings.HasPrefix(line, "-"):
|
||||||
|
oldRem--
|
||||||
|
case strings.HasPrefix(line, "\\"): // "\ No newline at end of file"
|
||||||
|
default: // context line (leading space, or an empty line): advance, don't record
|
||||||
|
newLine++
|
||||||
|
newRem--
|
||||||
|
oldRem--
|
||||||
|
}
|
||||||
|
if newRem <= 0 && oldRem <= 0 {
|
||||||
|
inHunk = false
|
||||||
|
}
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
switch {
|
||||||
|
case strings.HasPrefix(line, "+++ "):
|
||||||
|
file = normPath(strings.TrimPrefix(line, "+++ "))
|
||||||
|
if file == "/dev/null" {
|
||||||
|
file = ""
|
||||||
|
}
|
||||||
|
case strings.HasPrefix(line, "@@"):
|
||||||
|
if m := hunkRe.FindStringSubmatch(line); m != nil && file != "" {
|
||||||
|
newLine, _ = strconv.Atoi(m[3])
|
||||||
|
oldRem = atoiOr(m[2], 1)
|
||||||
|
newRem = atoiOr(m[4], 1)
|
||||||
|
inHunk = newRem > 0 || oldRem > 0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return out
|
||||||
|
}
|
||||||
|
|
||||||
|
// hunkRe captures a unified-diff hunk header's old/new start+length:
|
||||||
|
// @@ -<oldStart>[,<oldLen>] +<newStart>[,<newLen>] @@
|
||||||
|
var hunkRe = regexp.MustCompile(`^@@ -(\d+)(?:,(\d+))? \+(\d+)(?:,(\d+))? @@`)
|
||||||
|
|
||||||
|
// normPath trims a unified-diff path down to a repo-relative one: strip a single
|
||||||
|
// leading "a/" or "b/" prefix (and any "./"), and surrounding whitespace. Applied
|
||||||
|
// to BOTH the diff paths and a finding's file so they match even when a model
|
||||||
|
// writes "./pkg/x.go" or the diff carries the "b/" prefix.
|
||||||
|
func normPath(p string) string {
|
||||||
|
p = strings.TrimSpace(p)
|
||||||
|
p = strings.TrimPrefix(p, "./")
|
||||||
|
if strings.HasPrefix(p, "a/") || strings.HasPrefix(p, "b/") {
|
||||||
|
p = p[2:]
|
||||||
|
}
|
||||||
|
return p
|
||||||
|
}
|
||||||
|
|
||||||
|
func record(out map[string]map[int]bool, file string, line int) {
|
||||||
|
if file == "" {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if out[file] == nil {
|
||||||
|
out[file] = map[int]bool{}
|
||||||
|
}
|
||||||
|
out[file][line] = true
|
||||||
|
}
|
||||||
|
|
||||||
|
// atoiOr parses s, returning def when s is empty or unparseable. Used for the
|
||||||
|
// optional hunk-length fields (absent => length 1).
|
||||||
|
func atoiOr(s string, def int) int {
|
||||||
|
if s == "" {
|
||||||
|
return def
|
||||||
|
}
|
||||||
|
if n, err := strconv.Atoi(s); err == nil {
|
||||||
|
return n
|
||||||
|
}
|
||||||
|
return def
|
||||||
|
}
|
||||||
|
|
||||||
|
// deletePriorReviews removes our previous inline reviews (matched by the body
|
||||||
|
// marker) so a re-run replaces rather than stacks. Best-effort and quiet.
|
||||||
|
func deletePriorReviews(client *http.Client, base, token string) {
|
||||||
|
const perPage = 50
|
||||||
|
for page := 1; page <= 10; page++ { // bound the scan, but page past 50 so a stale marked review isn't missed
|
||||||
|
req, err := http.NewRequest(http.MethodGet, fmt.Sprintf("%s?limit=%d&page=%d", base, perPage, page), nil)
|
||||||
|
if err != nil {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
req.Header.Set("Authorization", "token "+token)
|
||||||
|
resp, err := client.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
|
||||||
|
io.Copy(io.Discard, resp.Body)
|
||||||
|
resp.Body.Close()
|
||||||
|
return
|
||||||
|
}
|
||||||
|
var reviews []struct {
|
||||||
|
ID int `json:"id"`
|
||||||
|
Body string `json:"body"`
|
||||||
|
}
|
||||||
|
_ = json.NewDecoder(resp.Body).Decode(&reviews)
|
||||||
|
resp.Body.Close()
|
||||||
|
for _, r := range reviews {
|
||||||
|
if !strings.Contains(r.Body, inlineReviewMarker) {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
dreq, err := http.NewRequest(http.MethodDelete, fmt.Sprintf("%s/%d", base, r.ID), nil)
|
||||||
|
if err != nil {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
dreq.Header.Set("Authorization", "token "+token)
|
||||||
|
if dresp, err := client.Do(dreq); err == nil {
|
||||||
|
io.Copy(io.Discard, dresp.Body)
|
||||||
|
dresp.Body.Close()
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if len(reviews) < perPage {
|
||||||
|
return // last page
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// giteaSend marshals payload and sends it with Gitea's "token" auth scheme,
|
||||||
|
// treating a non-2xx response as an error (with a snippet of the body).
|
||||||
|
func giteaSend(client *http.Client, method, url, token string, payload any) error {
|
||||||
|
body, err := json.Marshal(payload)
|
||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
req, err := http.NewRequest(method, url, bytes.NewReader(body))
|
||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
req.Header.Set("Content-Type", "application/json")
|
||||||
|
req.Header.Set("Authorization", "token "+token)
|
||||||
|
resp, err := client.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
defer resp.Body.Close()
|
||||||
|
snippet, _ := io.ReadAll(io.LimitReader(resp.Body, 2048))
|
||||||
|
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
|
||||||
|
return fmt.Errorf("%s %s: status %d: %s", method, url, resp.StatusCode, strings.TrimSpace(string(snippet)))
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
@@ -0,0 +1,107 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"strings"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
func TestParseDiffNewLines(t *testing.T) {
|
||||||
|
diff := "diff --git a/a.go b/a.go\n" +
|
||||||
|
"index 111..222 100644\n" +
|
||||||
|
"--- a/a.go\n" +
|
||||||
|
"+++ b/a.go\n" +
|
||||||
|
"@@ -1,3 +1,4 @@\n" +
|
||||||
|
" line1\n" +
|
||||||
|
"-old2\n" +
|
||||||
|
"+new2\n" +
|
||||||
|
"+new3\n" +
|
||||||
|
" line4\n"
|
||||||
|
got := parseDiffNewLines(diff)
|
||||||
|
// Only ADDED lines anchor: new2 (line 2) and new3 (line 3). Context lines 1
|
||||||
|
// and 4 are walked for counting but not recorded.
|
||||||
|
want := map[int]bool{2: true, 3: true}
|
||||||
|
if len(got["a.go"]) != len(want) {
|
||||||
|
t.Fatalf("a.go lines = %v, want %v (added-only)", got["a.go"], want)
|
||||||
|
}
|
||||||
|
for ln := range want {
|
||||||
|
if !got["a.go"][ln] {
|
||||||
|
t.Errorf("expected a.go line %d anchorable", ln)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if got["a.go"][1] || got["a.go"][4] {
|
||||||
|
t.Error("context lines 1/4 should not be anchorable (added-only)")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseDiffNewLinesContentLooksLikeHeader(t *testing.T) {
|
||||||
|
// An added line whose CONTENT is "++ weird" appears as "+++ weird" in the
|
||||||
|
// diff. Hunk-length tracking must read it as content, not a file header.
|
||||||
|
diff := "--- a/x\n+++ b/x\n@@ -1,1 +1,2 @@\n ctx\n+++ weird\n"
|
||||||
|
got := parseDiffNewLines(diff)
|
||||||
|
if got["x"][1] { // context line, not recorded (added-only)
|
||||||
|
t.Errorf("line 1 is context, should not anchor: %v", got["x"])
|
||||||
|
}
|
||||||
|
if !got["x"][2] { // the "+++ weird" added line
|
||||||
|
t.Errorf("want added line 2 anchorable, got %v", got["x"])
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestAnchorLineScansSpan(t *testing.T) {
|
||||||
|
// A cluster spanning 10..14 whose min line (10) isn't in the diff but whose
|
||||||
|
// span includes added line 14 must anchor to 14, not be dropped.
|
||||||
|
added := map[string]map[int]bool{"a.go": {14: true}}
|
||||||
|
clusters := []cluster{{file: "a.go", line: 10, maxLine: 14, severity: "high", title: "t", models: set("m1"), lenses: set("x")}}
|
||||||
|
cs := inlineComments(clusters, added)
|
||||||
|
if len(cs) != 1 || cs[0].NewPosition != 14 {
|
||||||
|
t.Fatalf("want anchor at 14 via span scan, got %+v", cs)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestParseDiffNewLinesMultiFile(t *testing.T) {
|
||||||
|
diff := "diff --git a/one.go b/one.go\n--- a/one.go\n+++ b/one.go\n@@ -0,0 +1,2 @@\n+a\n+b\n" +
|
||||||
|
"diff --git a/two.go b/two.go\n--- a/two.go\n+++ b/two.go\n@@ -5,0 +6,1 @@\n+c\n"
|
||||||
|
got := parseDiffNewLines(diff)
|
||||||
|
if !got["one.go"][1] || !got["one.go"][2] {
|
||||||
|
t.Errorf("one.go: want 1,2 got %v", got["one.go"])
|
||||||
|
}
|
||||||
|
if !got["two.go"][6] {
|
||||||
|
t.Errorf("two.go: want 6 got %v", got["two.go"])
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestInlineCommentsNormalizesPaths(t *testing.T) {
|
||||||
|
// Diff path "b/pkg/x.go" -> "pkg/x.go"; a finding written as "./pkg/x.go" must
|
||||||
|
// still anchor (both normalize to the same repo-relative path).
|
||||||
|
addable := parseDiffNewLines("--- a/pkg/x.go\n+++ b/pkg/x.go\n@@ -0,0 +1,1 @@\n+bug\n")
|
||||||
|
clusters := []cluster{{file: "./pkg/x.go", line: 1, severity: "high", title: "t", models: set("m1"), lenses: set("x")}}
|
||||||
|
if cs := inlineComments(clusters, addable); len(cs) != 1 || cs[0].Path != "pkg/x.go" {
|
||||||
|
t.Errorf("path normalization failed to anchor: %+v", cs)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestInlineCommentsFiltersToDiffLines(t *testing.T) {
|
||||||
|
addable := map[string]map[int]bool{"a.go": {10: true, 11: true}}
|
||||||
|
clusters := []cluster{
|
||||||
|
{file: "a.go", line: 10, severity: "high", title: "anchored", models: set("m1", "m2"), lenses: set("security"), detail: "d"},
|
||||||
|
{file: "a.go", line: 99, severity: "high", title: "off-diff line", models: set("m1"), lenses: set("security")},
|
||||||
|
{file: "b.go", line: 1, severity: "high", title: "off-diff file", models: set("m1"), lenses: set("security")},
|
||||||
|
}
|
||||||
|
cs := inlineComments(clusters, addable)
|
||||||
|
if len(cs) != 1 {
|
||||||
|
t.Fatalf("want 1 anchorable inline comment, got %d", len(cs))
|
||||||
|
}
|
||||||
|
if cs[0].Path != "a.go" || cs[0].NewPosition != 10 {
|
||||||
|
t.Errorf("wrong anchor: %+v", cs[0])
|
||||||
|
}
|
||||||
|
if !strings.Contains(cs[0].Body, "anchored") || !strings.Contains(cs[0].Body, "2 model") {
|
||||||
|
t.Errorf("inline body missing title/agreement: %q", cs[0].Body)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func set(xs ...string) map[string]bool {
|
||||||
|
m := map[string]bool{}
|
||||||
|
for _, x := range xs {
|
||||||
|
m[x] = true
|
||||||
|
}
|
||||||
|
return m
|
||||||
|
}
|
||||||
+137
-10
@@ -9,6 +9,7 @@ import (
|
|||||||
"regexp"
|
"regexp"
|
||||||
"sort"
|
"sort"
|
||||||
"strings"
|
"strings"
|
||||||
|
"sync"
|
||||||
|
|
||||||
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
|
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
|
||||||
)
|
)
|
||||||
@@ -22,6 +23,8 @@ const (
|
|||||||
maxGrepResults = 200 // per grep call
|
maxGrepResults = 200 // per grep call
|
||||||
maxFindResults = 200 // per find_files call
|
maxFindResults = 200 // per find_files call
|
||||||
maxLineLen = 400 // truncate any single returned line to this
|
maxLineLen = 400 // truncate any single returned line to this
|
||||||
|
maxGetDiffLines = 800 // per get_diff call (paginated window)
|
||||||
|
maxGetDiffBytes = 64 * 1024 // per get_diff call
|
||||||
)
|
)
|
||||||
|
|
||||||
// skipDirs are never descended into by grep / find_files — noise and bulk that
|
// skipDirs are never descended into by grep / find_files — noise and bulk that
|
||||||
@@ -40,6 +43,12 @@ type repoFS struct {
|
|||||||
root string // absolute, symlink-resolved repo root
|
root string // absolute, symlink-resolved repo root
|
||||||
diff string // the full PR unified diff (served by get_diff)
|
diff string // the full PR unified diff (served by get_diff)
|
||||||
worker llm.Model // optional cheap model for delegate_investigation; nil = no delegation
|
worker llm.Model // optional cheap model for delegate_investigation; nil = no delegation
|
||||||
|
|
||||||
|
// diffLines caches the split diff so paging through get_diff doesn't re-split
|
||||||
|
// the whole (possibly large) diff on every call. The diff is immutable, so the
|
||||||
|
// cache is computed once and safe to share across concurrent lenses.
|
||||||
|
diffOnce sync.Once
|
||||||
|
diffLines []string
|
||||||
}
|
}
|
||||||
|
|
||||||
// newRepoFS resolves root to an absolute, symlink-free path.
|
// newRepoFS resolves root to an absolute, symlink-free path.
|
||||||
@@ -108,15 +117,24 @@ func (r *repoFS) fsTools() []llm.Tool {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// toolbox builds the reviewer's toolbox: the read-only repo tools, plus the
|
// allTools is the single source of truth for the reviewer's tool set: the
|
||||||
// delegate_investigation tool when a worker model is configured.
|
// read-only repo tools, plus delegate_investigation when a worker model is
|
||||||
func (r *repoFS) toolbox() (*llm.Toolbox, error) {
|
// configured. Both the executus registry (gadflyToolRegistry, the production
|
||||||
box := llm.NewToolbox("gadfly")
|
// path) and toolbox() build from this list.
|
||||||
|
func (r *repoFS) allTools() []llm.Tool {
|
||||||
tools := r.fsTools()
|
tools := r.fsTools()
|
||||||
if r.worker != nil {
|
if r.worker != nil {
|
||||||
tools = append(tools, r.delegateTool())
|
tools = append(tools, r.delegateTool())
|
||||||
}
|
}
|
||||||
for _, t := range tools {
|
return tools
|
||||||
|
}
|
||||||
|
|
||||||
|
// toolbox builds a majordomo toolbox from allTools(). The production review path
|
||||||
|
// now goes through executus's tool.Registry (see executus.go); this remains for
|
||||||
|
// the toolbox-level tests (the `call` helper).
|
||||||
|
func (r *repoFS) toolbox() (*llm.Toolbox, error) {
|
||||||
|
box := llm.NewToolbox("gadfly")
|
||||||
|
for _, t := range r.allTools() {
|
||||||
if err := box.Add(t); err != nil {
|
if err := box.Add(t); err != nil {
|
||||||
return nil, fmt.Errorf("add tool %q: %w", t.Name, err)
|
return nil, fmt.Errorf("add tool %q: %w", t.Name, err)
|
||||||
}
|
}
|
||||||
@@ -396,15 +414,124 @@ func (r *repoFS) findFilesTool() llm.Tool {
|
|||||||
)
|
)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
type getDiffArgs struct {
|
||||||
|
Path string `json:"path,omitempty" description:"Optional changed-file path (e.g. pkg/foo/bar.go); returns ONLY that file's diff hunks. Omit for the whole diff. Use this on a large PR to pull just the file a finding is about."`
|
||||||
|
StartLine int `json:"start_line,omitempty" description:"Optional 1-based line to start from within the (whole or path-scoped) diff (default 1)."`
|
||||||
|
Limit int `json:"limit,omitempty" description:"Optional max number of diff lines to return (default/maximum 800)."`
|
||||||
|
}
|
||||||
|
|
||||||
func (r *repoFS) getDiffTool() llm.Tool {
|
func (r *repoFS) getDiffTool() llm.Tool {
|
||||||
return llm.DefineTool[struct{}](
|
return llm.DefineTool[getDiffArgs](
|
||||||
"get_diff",
|
"get_diff",
|
||||||
"Return the complete unified diff under review. The diff is also included (possibly truncated) in the task message; call this to get the full, untruncated text.",
|
"Return the unified diff under review as a numbered, PAGINATED window (like read_file) — not the whole diff at once, so a huge PR can't blow the context window. Pass `path` to fetch just one changed file's hunks, or `start_line`/`limit` to page through. A truncated copy of the diff is also embedded in the task message.",
|
||||||
func(_ context.Context, _ struct{}) (any, error) {
|
func(_ context.Context, args getDiffArgs) (any, error) {
|
||||||
if strings.TrimSpace(r.diff) == "" {
|
scope := "the diff"
|
||||||
|
var lines []string
|
||||||
|
if p := strings.TrimSpace(args.Path); p != "" {
|
||||||
|
lines = diffLinesForPath(r.diff, p)
|
||||||
|
if len(lines) == 0 {
|
||||||
|
return fmt.Sprintf("(no diff hunks for path %q; check it against the changed-files list — the path must match a file the diff touches)", p), nil
|
||||||
|
}
|
||||||
|
scope = "the diff for " + p
|
||||||
|
} else {
|
||||||
|
lines = r.diffAllLines()
|
||||||
|
if len(lines) == 0 {
|
||||||
return "(empty diff)", nil
|
return "(empty diff)", nil
|
||||||
}
|
}
|
||||||
return r.diff, nil
|
}
|
||||||
|
return windowDiff(lines, scope, args.StartLine, args.Limit), nil
|
||||||
},
|
},
|
||||||
)
|
)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// diffAllLines returns the whole diff split into lines, cached so paging never
|
||||||
|
// re-splits the (possibly large) diff.
|
||||||
|
func (r *repoFS) diffAllLines() []string {
|
||||||
|
r.diffOnce.Do(func() { r.diffLines = splitDiffLines(r.diff) })
|
||||||
|
return r.diffLines
|
||||||
|
}
|
||||||
|
|
||||||
|
// splitDiffLines splits a unified diff into lines, dropping the single trailing
|
||||||
|
// empty element a trailing newline produces — otherwise windowDiff would emit a
|
||||||
|
// blank final line and over-count the total by one.
|
||||||
|
func splitDiffLines(diff string) []string {
|
||||||
|
lines := strings.Split(diff, "\n")
|
||||||
|
if n := len(lines); n > 0 && lines[n-1] == "" {
|
||||||
|
lines = lines[:n-1]
|
||||||
|
}
|
||||||
|
return lines
|
||||||
|
}
|
||||||
|
|
||||||
|
// windowDiff returns a numbered, paginated slice of pre-split diff lines,
|
||||||
|
// mirroring read_file's caps (maxGetDiffLines / maxGetDiffBytes / maxLineLen) so
|
||||||
|
// a single get_diff call can never dump a multi-hundred-KB diff into the
|
||||||
|
// transcript — the amplifier behind the large-PR token burn. The full diff stays
|
||||||
|
// reachable by paging with start_line, or scoped per file via the path arg.
|
||||||
|
func windowDiff(lines []string, scope string, start, limit int) string {
|
||||||
|
total := len(lines)
|
||||||
|
if start < 1 {
|
||||||
|
start = 1
|
||||||
|
}
|
||||||
|
if limit <= 0 || limit > maxGetDiffLines {
|
||||||
|
limit = maxGetDiffLines
|
||||||
|
}
|
||||||
|
if start > total {
|
||||||
|
return fmt.Sprintf("(%s has %d lines; nothing at/after line %d)", scope, total, start)
|
||||||
|
}
|
||||||
|
var b strings.Builder
|
||||||
|
emitted := 0
|
||||||
|
i := start - 1
|
||||||
|
for ; i < total; i++ {
|
||||||
|
if emitted >= limit || b.Len() >= maxGetDiffBytes {
|
||||||
|
break
|
||||||
|
}
|
||||||
|
line := lines[i]
|
||||||
|
if len(line) > maxLineLen {
|
||||||
|
line = line[:maxLineLen] + "…"
|
||||||
|
}
|
||||||
|
fmt.Fprintf(&b, "%d\t%s\n", i+1, line)
|
||||||
|
emitted++
|
||||||
|
}
|
||||||
|
if i < total {
|
||||||
|
// i (0-based) is the first line NOT emitted; line i was the last shown.
|
||||||
|
fmt.Fprintf(&b, "... (%s truncated after line %d of %d; call get_diff again with start_line=%d for the rest, or pass a `path` to scope to one file)\n", scope, i, total, i+1)
|
||||||
|
}
|
||||||
|
return b.String()
|
||||||
|
}
|
||||||
|
|
||||||
|
// diffLinesForPath returns the unified-diff lines for one changed file: from the
|
||||||
|
// `diff --git` header that names path through to the next header (or end). The
|
||||||
|
// header names two path tokens (a/<old> b/<new>); a match is on a WHOLE token,
|
||||||
|
// so path "foo.go" does not pull in "barfoo.go" — and a trailing "/" scopes to a
|
||||||
|
// directory (e.g. "pkg/foo/" matches pkg/foo/bar.go).
|
||||||
|
func diffLinesForPath(diff, path string) []string {
|
||||||
|
want := strings.TrimPrefix(strings.TrimPrefix(strings.TrimSpace(path), "a/"), "b/")
|
||||||
|
var out []string
|
||||||
|
inSection := false
|
||||||
|
for _, ln := range splitDiffLines(diff) {
|
||||||
|
if strings.HasPrefix(ln, "diff --git ") {
|
||||||
|
inSection = diffHeaderNames(ln, want)
|
||||||
|
}
|
||||||
|
if inSection {
|
||||||
|
out = append(out, ln)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return out
|
||||||
|
}
|
||||||
|
|
||||||
|
// diffHeaderNames reports whether a `diff --git a/X b/Y` header names want as one
|
||||||
|
// of its (a/b-stripped) path tokens — exact whole-token match, or a directory
|
||||||
|
// prefix when want ends with "/".
|
||||||
|
func diffHeaderNames(header, want string) bool {
|
||||||
|
fields := strings.Fields(header)
|
||||||
|
if len(fields) < 3 {
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
for _, f := range fields[2:] {
|
||||||
|
p := strings.TrimPrefix(strings.TrimPrefix(f, "a/"), "b/")
|
||||||
|
if p == want || (strings.HasSuffix(want, "/") && strings.HasPrefix(p, want)) {
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
|||||||
@@ -197,15 +197,93 @@ func TestFindFilesTool(t *testing.T) {
|
|||||||
|
|
||||||
func TestGetDiffTool(t *testing.T) {
|
func TestGetDiffTool(t *testing.T) {
|
||||||
root := buildFixtureRepo(t)
|
root := buildFixtureRepo(t)
|
||||||
const diff = "diff --git a/x b/x\n+added line\n"
|
const diff = "diff --git a/x b/x\n--- a/x\n+++ b/x\n+added line\n" +
|
||||||
|
"diff --git a/y.go b/y.go\n--- a/y.go\n+++ b/y.go\n+y change\n"
|
||||||
fs, _ := newRepoFS(root, diff)
|
fs, _ := newRepoFS(root, diff)
|
||||||
|
|
||||||
|
// Default: the whole diff as a NUMBERED window (paginated), not a raw dump —
|
||||||
|
// so a huge PR can't be poured into the transcript in one call.
|
||||||
|
out, err := call(t, fs, "get_diff", map[string]any{})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("get_diff: %v", err)
|
||||||
|
}
|
||||||
|
if !strings.Contains(out, "1\tdiff --git a/x b/x") {
|
||||||
|
t.Errorf("get_diff should return a numbered window, got:\n%s", out)
|
||||||
|
}
|
||||||
|
if !strings.Contains(out, "+added line") || !strings.Contains(out, "+y change") {
|
||||||
|
t.Errorf("the full (short) diff window should include every hunk, got:\n%s", out)
|
||||||
|
}
|
||||||
|
|
||||||
|
// path filter: only the named file's hunks come back.
|
||||||
|
out, err = call(t, fs, "get_diff", map[string]any{"path": "y.go"})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("get_diff path: %v", err)
|
||||||
|
}
|
||||||
|
if !strings.Contains(out, "y change") {
|
||||||
|
t.Errorf("get_diff path=y.go should include y's hunk, got:\n%s", out)
|
||||||
|
}
|
||||||
|
if strings.Contains(out, "added line") {
|
||||||
|
t.Errorf("get_diff path=y.go must NOT include x's hunk, got:\n%s", out)
|
||||||
|
}
|
||||||
|
|
||||||
|
// unknown path: a clear note, never an error.
|
||||||
|
out, err = call(t, fs, "get_diff", map[string]any{"path": "nope.txt"})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("get_diff unknown path: %v", err)
|
||||||
|
}
|
||||||
|
if !strings.Contains(out, "no diff hunks") {
|
||||||
|
t.Errorf("get_diff for an unknown path should note no hunks, got:\n%s", out)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestGetDiffTool_Paginates: a diff longer than the per-call line cap is returned
|
||||||
|
// as a truncated window with a paging hint, and start_line pages past it — the
|
||||||
|
// mechanism that stops get_diff from dumping a multi-hundred-KB diff at once.
|
||||||
|
func TestGetDiffTool_Paginates(t *testing.T) {
|
||||||
|
diff := "diff --git a/big b/big\n" + strings.Repeat("+line\n", maxGetDiffLines+50)
|
||||||
|
fs, _ := newRepoFS(t.TempDir(), diff)
|
||||||
|
|
||||||
out, err := call(t, fs, "get_diff", map[string]any{})
|
out, err := call(t, fs, "get_diff", map[string]any{})
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("get_diff: %v", err)
|
t.Fatalf("get_diff: %v", err)
|
||||||
}
|
}
|
||||||
if out != diff {
|
if !strings.Contains(out, "truncated after line") {
|
||||||
t.Errorf("get_diff returned %q, want %q", out, diff)
|
t.Error("a diff longer than the per-call cap should be truncated with a paging hint")
|
||||||
|
}
|
||||||
|
if strings.Contains(out, "801\t") {
|
||||||
|
t.Error("the first window must stop at the line cap, not reach line 801")
|
||||||
|
}
|
||||||
|
|
||||||
|
out, err = call(t, fs, "get_diff", map[string]any{"start_line": 805})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("get_diff page: %v", err)
|
||||||
|
}
|
||||||
|
if !strings.Contains(out, "805\t") {
|
||||||
|
t.Error("paging with start_line=805 should include line 805")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestDiffLinesForPath_Anchored: get_diff path= matches on a WHOLE path token,
|
||||||
|
// so "foo.go" never pulls in "barfoo.go" (the unanchored-substring weakness the
|
||||||
|
// swarm flagged), while a trailing "/" still scopes to a directory.
|
||||||
|
func TestDiffLinesForPath_Anchored(t *testing.T) {
|
||||||
|
diff := "diff --git a/foo.go b/foo.go\n+foo change\n" +
|
||||||
|
"diff --git a/barfoo.go b/barfoo.go\n+barfoo change\n" +
|
||||||
|
"diff --git a/pkg/x.go b/pkg/x.go\n+x change\n"
|
||||||
|
|
||||||
|
joined := strings.Join(diffLinesForPath(diff, "foo.go"), "\n")
|
||||||
|
if !strings.Contains(joined, "foo change") {
|
||||||
|
t.Errorf("foo.go should match its own hunk:\n%s", joined)
|
||||||
|
}
|
||||||
|
if strings.Contains(joined, "barfoo change") {
|
||||||
|
t.Errorf("foo.go must NOT match barfoo.go (unanchored substring regression):\n%s", joined)
|
||||||
|
}
|
||||||
|
|
||||||
|
if !strings.Contains(strings.Join(diffLinesForPath(diff, "pkg/"), "\n"), "x change") {
|
||||||
|
t.Error("a trailing-slash path should scope to the directory (pkg/ -> pkg/x.go)")
|
||||||
|
}
|
||||||
|
if len(diffLinesForPath(diff, "nope.go")) != 0 {
|
||||||
|
t.Error("an unknown path should yield no lines")
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
+20
-51
@@ -37,10 +37,11 @@ func lastUserText(req llm.Request) string {
|
|||||||
return req.Messages[len(req.Messages)-1].Text()
|
return req.Messages[len(req.Messages)-1].Text()
|
||||||
}
|
}
|
||||||
|
|
||||||
// TestRunAgent_WrapUpNudgeProducesAnswer: a model that keeps calling tools until
|
// TestReviewExecutor_WrapUpNudgeProducesAnswer: a model that keeps calling tools
|
||||||
// it is nudged to wrap up should still finish inside its budget — the steer
|
// until it is nudged to wrap up should still finish inside its budget — the steer
|
||||||
// message arrives a few steps before the cap and the model writes its answer.
|
// message (delivered by the executus wrap-up critic a few steps before the cap)
|
||||||
func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) {
|
// arrives and the model writes its answer.
|
||||||
|
func TestReviewExecutor_WrapUpNudgeProducesAnswer(t *testing.T) {
|
||||||
t.Setenv("GADFLY_WRAPUP_RESERVE", "4")
|
t.Setenv("GADFLY_WRAPUP_RESERVE", "4")
|
||||||
|
|
||||||
final := "VERDICT: No material issues found."
|
final := "VERDICT: No material issues found."
|
||||||
@@ -60,9 +61,10 @@ func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) {
|
|||||||
}
|
}
|
||||||
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
|
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
|
||||||
|
|
||||||
out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 12)
|
rex := newTestReviewExecutor(t, mdl, fs)
|
||||||
|
out, err := rex.run(context.Background(), "sys", "task", 12)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("runAgent should succeed via wrap-up nudge, got error: %v", err)
|
t.Fatalf("run should succeed via wrap-up nudge, got error: %v", err)
|
||||||
}
|
}
|
||||||
if out != final {
|
if out != final {
|
||||||
t.Errorf("expected final review %q, got %q", final, out)
|
t.Errorf("expected final review %q, got %q", final, out)
|
||||||
@@ -72,24 +74,19 @@ func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// TestRunAgent_FinalizationFallback: a model that ignores the wrap-up nudge and
|
// TestReviewExecutor_ExhaustionWithoutAnswerIsError: a model that ignores the
|
||||||
// spins on tools until the cap should NOT hard-fail — the tool-free finalization
|
// wrap-up nudge and spins on tools until the step cap produces no final answer.
|
||||||
// pass forces a final answer out of the transcript.
|
// The transcript-based forced-finalization fallback was removed in the executus
|
||||||
func TestRunAgent_FinalizationFallback(t *testing.T) {
|
// re-platform (run.Result does not expose the loop transcript), so the pass now
|
||||||
|
// surfaces an error — which reviewWithSpecialist renders as an advisory "reviewer
|
||||||
|
// failed to complete" notice rather than a phantom success.
|
||||||
|
func TestReviewExecutor_ExhaustionWithoutAnswerIsError(t *testing.T) {
|
||||||
t.Setenv("GADFLY_WRAPUP_RESERVE", "2")
|
t.Setenv("GADFLY_WRAPUP_RESERVE", "2")
|
||||||
|
|
||||||
final := "VERDICT: Minor issues\n- something"
|
|
||||||
forcedCalled := false
|
|
||||||
n := 0
|
n := 0
|
||||||
p := fake.New("fake", fake.WithDefault(func(_ string, req llm.Request) fake.Step {
|
p := fake.New("fake", fake.WithDefault(func(_ string, _ llm.Request) fake.Step {
|
||||||
// Only the tool-free finalization pass forbids tools — reply there.
|
|
||||||
if req.ToolChoice == "none" {
|
|
||||||
forcedCalled = true
|
|
||||||
return fake.Reply(final)
|
|
||||||
}
|
|
||||||
// Otherwise keep spinning, ignoring the wrap-up nudge entirely.
|
|
||||||
n++
|
n++
|
||||||
return spinToolCall(n)
|
return spinToolCall(n) // spin forever, ignoring the wrap-up nudge
|
||||||
}))
|
}))
|
||||||
mdl, err := p.Model("mock")
|
mdl, err := p.Model("mock")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
@@ -97,37 +94,9 @@ func TestRunAgent_FinalizationFallback(t *testing.T) {
|
|||||||
}
|
}
|
||||||
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
|
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
|
||||||
|
|
||||||
out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 6)
|
rex := newTestReviewExecutor(t, mdl, fs)
|
||||||
if err != nil {
|
if _, err := rex.run(context.Background(), "sys", "task", 6); err == nil {
|
||||||
t.Fatalf("runAgent should recover via finalization fallback, got error: %v", err)
|
t.Error("run should error when the model exhausts its steps without an answer")
|
||||||
}
|
|
||||||
if !forcedCalled {
|
|
||||||
t.Error("finalization fallback was never invoked")
|
|
||||||
}
|
|
||||||
if out != final {
|
|
||||||
t.Errorf("expected forced final answer %q, got %q", final, out)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// TestRunAgent_FallbackStillEmptyIsError: if even the tool-free finalization
|
|
||||||
// yields nothing, runAgent surfaces an error rather than a phantom success.
|
|
||||||
func TestRunAgent_FallbackStillEmptyIsError(t *testing.T) {
|
|
||||||
n := 0
|
|
||||||
p := fake.New("fake", fake.WithDefault(func(_ string, req llm.Request) fake.Step {
|
|
||||||
if req.ToolChoice == "none" {
|
|
||||||
return fake.Reply(" ") // finalization produces only whitespace
|
|
||||||
}
|
|
||||||
n++
|
|
||||||
return spinToolCall(n)
|
|
||||||
}))
|
|
||||||
mdl, err := p.Model("mock")
|
|
||||||
if err != nil {
|
|
||||||
t.Fatal(err)
|
|
||||||
}
|
|
||||||
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
|
|
||||||
|
|
||||||
if _, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4); err == nil {
|
|
||||||
t.Error("runAgent should error when the finalization fallback also yields no output")
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
+168
-2
@@ -45,6 +45,10 @@
|
|||||||
# GADFLY_FINDINGS_URL optional gadfly-reports store base URL; set to POST the run +
|
# GADFLY_FINDINGS_URL optional gadfly-reports store base URL; set to POST the run +
|
||||||
# findings for model-quality tracking (off when empty)
|
# findings for model-quality tracking (off when empty)
|
||||||
# GADFLY_FINDINGS_TOKEN optional bearer token for the gadfly-reports store
|
# GADFLY_FINDINGS_TOKEN optional bearer token for the gadfly-reports store
|
||||||
|
# GADFLY_CONSOLIDATE cross-model consensus comment: "auto" (default; on for >=2
|
||||||
|
# models), "1" force on, "0" force off (one comment per model)
|
||||||
|
# GADFLY_INLINE_REVIEW when consolidating, also post a COMMENT-state PR review with
|
||||||
|
# inline comments on changed lines (default on; "0" disables)
|
||||||
set -uo pipefail
|
set -uo pipefail
|
||||||
|
|
||||||
# One model by default: the specialist suite already provides breadth, so a
|
# One model by default: the specialist suite already provides breadth, so a
|
||||||
@@ -64,6 +68,29 @@ die() { log "ERROR: $*"; exit 1; }
|
|||||||
|
|
||||||
API() { curl -fsS --connect-timeout 20 --max-time 30 -H "Authorization: token ${GITEA_TOKEN}" "$@"; }
|
API() { curl -fsS --connect-timeout 20 --max-time 30 -H "Authorization: token ${GITEA_TOKEN}" "$@"; }
|
||||||
|
|
||||||
|
# upsert_comment_body MARKER BODY — create or update (by leading MARKER) a single
|
||||||
|
# PR comment. Mirrors run.sh's per-model upsert; used for the consensus comment
|
||||||
|
# and the per-model fallback when consolidation is on.
|
||||||
|
upsert_comment_body() {
|
||||||
|
local marker="$1" body="$2" post_body existing_id="" page=1 cmts
|
||||||
|
post_body="$(jq -n --arg b "$body" '{body:$b}')"
|
||||||
|
while [ "$page" -le 10 ]; do
|
||||||
|
cmts="$(API "${GITEA_API}/issues/${PR}/comments?limit=50&page=${page}" || echo '[]')"
|
||||||
|
[ "$(echo "$cmts" | jq 'length')" = "0" ] && break
|
||||||
|
existing_id="$(echo "$cmts" | jq -r --arg m "$marker" \
|
||||||
|
'.[] | select(.body != null and (.body | startswith($m))) | .id' | head -n1)"
|
||||||
|
[ -n "$existing_id" ] && break
|
||||||
|
page=$((page+1))
|
||||||
|
done
|
||||||
|
if [ -n "$existing_id" ]; then
|
||||||
|
curl -sS --connect-timeout 20 --max-time 30 -X PATCH -H "Authorization: token ${GITEA_TOKEN}" \
|
||||||
|
-H "Content-Type: application/json" "${GITEA_API}/issues/comments/${existing_id}" -d "$post_body" >/dev/null
|
||||||
|
else
|
||||||
|
curl -sS --connect-timeout 20 --max-time 30 -X POST -H "Authorization: token ${GITEA_TOKEN}" \
|
||||||
|
-H "Content-Type: application/json" "${GITEA_API}/issues/${PR}/comments" -d "$post_body" >/dev/null
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
# --- is the commenter allowed to trigger a re-review? ----------------------
|
# --- is the commenter allowed to trigger a re-review? ----------------------
|
||||||
actor_allowed() {
|
actor_allowed() {
|
||||||
local actor="$1"
|
local actor="$1"
|
||||||
@@ -156,6 +183,36 @@ export GADFLY_FINDINGS_TOKEN="${GADFLY_FINDINGS_TOKEN:-}"
|
|||||||
MODELS="${GADFLY_MODELS:-${OLLAMA_REVIEW_MODELS:-$DEFAULT_MODELS}}"
|
MODELS="${GADFLY_MODELS:-${OLLAMA_REVIEW_MODELS:-$DEFAULT_MODELS}}"
|
||||||
DEFAULT_CONC="${GADFLY_CONCURRENCY:-1}"
|
DEFAULT_CONC="${GADFLY_CONCURRENCY:-1}"
|
||||||
|
|
||||||
|
# --- huge-PR downshift ------------------------------------------------------
|
||||||
|
# A very large diff is what burns the model budget: every review step re-sends
|
||||||
|
# it, multiplied across models × lenses × passes × steps (this is what nuked a
|
||||||
|
# whole Ollama Cloud block on one giant PR). entrypoint is the only process that
|
||||||
|
# spans the whole fleet, so the fleet-wide size decision lives here: size the PR
|
||||||
|
# diff ONCE, and above GADFLY_HUGE_DIFF_BYTES collapse to a single cheap model +
|
||||||
|
# a focused lens subset, fewer steps, no recheck, and a smaller embedded diff.
|
||||||
|
# A finished shallow review beats a budget-nuking one. All knobs override; set
|
||||||
|
# GADFLY_HUGE_DIFF_BYTES=0 to disable. Small PRs are never touched.
|
||||||
|
HUGE_PR=0
|
||||||
|
HUGE_DIFF_BYTES="${GADFLY_HUGE_DIFF_BYTES:-600000}"
|
||||||
|
if [ "$HUGE_DIFF_BYTES" -gt 0 ] 2>/dev/null; then
|
||||||
|
PR_DIFF_BYTES="$(API "${GITEA_API}/pulls/${PR}.diff" 2>/dev/null | wc -c | tr -d '[:space:]')"
|
||||||
|
[ -z "$PR_DIFF_BYTES" ] && PR_DIFF_BYTES=0
|
||||||
|
if [ "$PR_DIFF_BYTES" -gt "$HUGE_DIFF_BYTES" ] 2>/dev/null; then
|
||||||
|
HUGE_PR=1
|
||||||
|
log "huge PR: diff ${PR_DIFF_BYTES}B > ${HUGE_DIFF_BYTES}B — downshifting the fleet (advisory)"
|
||||||
|
MODELS="${GADFLY_HUGE_DIFF_MODELS:-${MODELS%%,*}}" # first model only by default
|
||||||
|
export GADFLY_SPECIALISTS="${GADFLY_HUGE_DIFF_SPECIALISTS:-security,correctness,error-handling}"
|
||||||
|
export GADFLY_MAX_STEPS="${GADFLY_HUGE_DIFF_MAX_STEPS:-12}"
|
||||||
|
export GADFLY_RECHECK_MAX_STEPS="${GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS:-8}"
|
||||||
|
export GADFLY_RECHECK="${GADFLY_HUGE_DIFF_RECHECK:-0}" # skip recheck on huge PRs
|
||||||
|
# The Go-visible name directly (run.sh prefers GADFLY_MAX_DIFF_CHARS over its
|
||||||
|
# own MAX_DIFF_CHARS), so the cap is honored without relying on run.sh's alias.
|
||||||
|
export GADFLY_MAX_DIFF_CHARS="${GADFLY_HUGE_DIFF_MAX_DIFF_CHARS:-20000}"
|
||||||
|
# Surfaced on each posted comment so the shallower review is self-explaining.
|
||||||
|
export GADFLY_NOTICE="⚠️ Large PR (${PR_DIFF_BYTES} bytes): Gadfly downshifted to a focused, single-model review to stay within budget — coverage is intentionally shallower. Consider splitting the PR for a deeper review."
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
provider_of() { case "$1" in */*) echo "${1%%/*}";; *) echo "${GADFLY_PROVIDER:-ollama-cloud}";; esac; }
|
provider_of() { case "$1" in */*) echo "${1%%/*}";; *) echo "${GADFLY_PROVIDER:-ollama-cloud}";; esac; }
|
||||||
|
|
||||||
# Per-model status file path for the live board. The model id can contain '/'
|
# Per-model status file path for the live board. The model id can contain '/'
|
||||||
@@ -176,10 +233,11 @@ provider_cap() { # provider -> concurrency (override map "p=N,...", else default
|
|||||||
}
|
}
|
||||||
|
|
||||||
review_one() {
|
review_one() {
|
||||||
local sf=""
|
local sf="" ff=""
|
||||||
[ "${GADFLY_STATUS_BOARD:-1}" != "0" ] && sf="$(status_file_for "$1")"
|
[ "${GADFLY_STATUS_BOARD:-1}" != "0" ] && sf="$(status_file_for "$1")"
|
||||||
|
[ "$CONSOLIDATE" = "1" ] && ff="$(findings_file_for "$1")"
|
||||||
PROVIDER=ollama MODEL="$1" GADFLY_BIN="/usr/local/bin/gadfly" GADFLY_REPO_DIR="$REPO_DIR" \
|
PROVIDER=ollama MODEL="$1" GADFLY_BIN="/usr/local/bin/gadfly" GADFLY_REPO_DIR="$REPO_DIR" \
|
||||||
GADFLY_STATUS_FILE="$sf" \
|
GADFLY_STATUS_FILE="$sf" GADFLY_FINDINGS_OUT="$ff" GADFLY_CONSOLIDATE="$CONSOLIDATE" \
|
||||||
bash "${SCRIPTS_DIR}/run.sh" || log "model $1 failed (continuing)"
|
bash "${SCRIPTS_DIR}/run.sh" || log "model $1 failed (continuing)"
|
||||||
# If the binary never wrote real status (run.sh skipped it: empty diff, no key,
|
# If the binary never wrote real status (run.sh skipped it: empty diff, no key,
|
||||||
# binary missing), the pre-seed stays {started:0, done:false} and the board
|
# binary missing), the pre-seed stays {started:0, done:false} and the board
|
||||||
@@ -196,6 +254,34 @@ IFS=',' read -ra _raw <<< "$MODELS" || true
|
|||||||
MODEL_LIST=()
|
MODEL_LIST=()
|
||||||
for raw in "${_raw[@]}"; do m="$(echo "$raw" | tr -d '[:space:]')"; [ -n "$m" ] && MODEL_LIST+=("$m"); done
|
for raw in "${_raw[@]}"; do m="$(echo "$raw" | tr -d '[:space:]')"; [ -n "$m" ] && MODEL_LIST+=("$m"); done
|
||||||
|
|
||||||
|
# --- cross-model consolidation decision ------------------------------------
|
||||||
|
# With >=2 models, post ONE consensus comment (findings clustered + ranked by
|
||||||
|
# cross-model agreement) instead of N per-model walls of prose. Each model writes
|
||||||
|
# its findings to FINDINGS_DIR; a final pass (the binary in GADFLY_CONSOLIDATE_DIR
|
||||||
|
# mode) renders the consensus comment. GADFLY_CONSOLIDATE: "auto" (default; on for
|
||||||
|
# >=2 models), "1" force on, "0" force off (keep per-model comments).
|
||||||
|
FINDINGS_DIR="${WORKDIR}/findings"
|
||||||
|
CONSOLIDATE=0
|
||||||
|
case "${GADFLY_CONSOLIDATE:-auto}" in
|
||||||
|
1) CONSOLIDATE=1 ;;
|
||||||
|
0) CONSOLIDATE=0 ;;
|
||||||
|
*) [ "${#MODEL_LIST[@]}" -ge 2 ] && CONSOLIDATE=1 ;;
|
||||||
|
esac
|
||||||
|
# A model spec can contain '/' and ':' (e.g. claude-code/opus, qwen3:14b), so
|
||||||
|
# sanitize to a flat filename — but append a checksum of the raw spec so two
|
||||||
|
# specs that sanitize the same (foo:bar vs foo/bar -> foo_bar) don't collide onto
|
||||||
|
# one file and silently drop a model from the consensus.
|
||||||
|
findings_file_for() {
|
||||||
|
local safe sum
|
||||||
|
safe="$(echo "$1" | tr -c '[:alnum:]._-' '_')"
|
||||||
|
sum="$(printf '%s' "$1" | cksum | cut -d' ' -f1)"
|
||||||
|
echo "${FINDINGS_DIR}/${safe}-${sum}.json"
|
||||||
|
}
|
||||||
|
if [ "$CONSOLIDATE" = "1" ]; then
|
||||||
|
rm -rf "$FINDINGS_DIR"; mkdir -p "$FINDINGS_DIR"
|
||||||
|
log "consolidation ON: ${#MODEL_LIST[@]} models -> one consensus comment"
|
||||||
|
fi
|
||||||
|
|
||||||
# Distinct providers, in first-seen order (no associative arrays — portable).
|
# Distinct providers, in first-seen order (no associative arrays — portable).
|
||||||
PROVIDERS=""
|
PROVIDERS=""
|
||||||
for m in "${MODEL_LIST[@]}"; do
|
for m in "${MODEL_LIST[@]}"; do
|
||||||
@@ -241,6 +327,32 @@ if [ "${GADFLY_STATUS_BOARD:-1}" != "0" ]; then
|
|||||||
log "status board started (pid ${BOARD_PID})"
|
log "status board started (pid ${BOARD_PID})"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# --- swarm-wide hard backstop ----------------------------------------------
|
||||||
|
# A wall-clock ceiling across the WHOLE fleet, so a pathological PR can never
|
||||||
|
# drain the usage block however the models behave. entrypoint is the only
|
||||||
|
# process spanning every model, so a single "never exceed X" guard lives here.
|
||||||
|
# On expiry it stops the review subtrees (the binary + run.sh); whatever partial
|
||||||
|
# findings were gathered are still posted and the job never fails (advisory).
|
||||||
|
# GADFLY_PR_BUDGET_SECS=0 (default) disables it.
|
||||||
|
KILLER_PID=""
|
||||||
|
rm -f "${WORKDIR}/.budget_killed" "${WORKDIR}/.disarmed" 2>/dev/null || true
|
||||||
|
if [ "${GADFLY_PR_BUDGET_SECS:-0}" -gt 0 ] 2>/dev/null; then
|
||||||
|
(
|
||||||
|
sleep "${GADFLY_PR_BUDGET_SECS}"
|
||||||
|
log "PR wall-clock budget (${GADFLY_PR_BUDGET_SECS}s) reached — stopping the review fleet (advisory; partial findings still posted)"
|
||||||
|
: > "${WORKDIR}/.budget_killed"
|
||||||
|
pkill -TERM -f '/usr/local/bin/gadfly' 2>/dev/null || true
|
||||||
|
pkill -TERM -f "${SCRIPTS_DIR}/run.sh" 2>/dev/null || true
|
||||||
|
sleep 5
|
||||||
|
# Guard the delayed SIGKILL on the disarm marker: once the lanes finished and
|
||||||
|
# the watchdog was disarmed, the consolidation gadfly pass runs next, and a
|
||||||
|
# name-based KILL here must NOT catch it.
|
||||||
|
[ -f "${WORKDIR}/.disarmed" ] || pkill -KILL -f '/usr/local/bin/gadfly' 2>/dev/null || true
|
||||||
|
) &
|
||||||
|
KILLER_PID=$!
|
||||||
|
log "PR budget watchdog armed (${GADFLY_PR_BUDGET_SECS}s, pid ${KILLER_PID})"
|
||||||
|
fi
|
||||||
|
|
||||||
log "providers: ${PROVIDERS:-none}"
|
log "providers: ${PROVIDERS:-none}"
|
||||||
# Each provider lane runs in parallel; cap is enforced within each lane. Track
|
# Each provider lane runs in parallel; cap is enforced within each lane. Track
|
||||||
# the lane PIDs so we wait ONLY for the review work — not the status board,
|
# the lane PIDs so we wait ONLY for the review work — not the status board,
|
||||||
@@ -252,9 +364,63 @@ for p in $PROVIDERS; do
|
|||||||
done
|
done
|
||||||
[ "${#LANE_PIDS[@]}" -gt 0 ] && wait "${LANE_PIDS[@]}"
|
[ "${#LANE_PIDS[@]}" -gt 0 ] && wait "${LANE_PIDS[@]}"
|
||||||
|
|
||||||
|
# Reviews finished (or the watchdog killed them): disarm the watchdog so its
|
||||||
|
# delayed SIGKILL can't catch the consolidation pass that runs next. Drop the
|
||||||
|
# disarm marker FIRST so even a racing watchdog that already reached its KILL line
|
||||||
|
# skips it (the kill below also tears the watchdog subshell down during its sleep).
|
||||||
|
if [ -n "$KILLER_PID" ]; then
|
||||||
|
: > "${WORKDIR}/.disarmed"
|
||||||
|
kill "$KILLER_PID" 2>/dev/null || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
# If the backstop fired, note it on the consensus comment (per-model comments
|
||||||
|
# were already posted during the run; a killed model surfaces as a failed lane).
|
||||||
|
if [ -f "${WORKDIR}/.budget_killed" ]; then
|
||||||
|
export GADFLY_NOTICE="${GADFLY_NOTICE:+${GADFLY_NOTICE} }⏱️ This review was stopped early by the per-PR time budget (GADFLY_PR_BUDGET_SECS); findings are partial."
|
||||||
|
fi
|
||||||
|
|
||||||
# Reviews are done: signal the board to render the final state once and exit.
|
# Reviews are done: signal the board to render the final state once and exit.
|
||||||
if [ -n "$BOARD_PID" ]; then
|
if [ -n "$BOARD_PID" ]; then
|
||||||
touch "${STATUS_DIR}/.done" 2>/dev/null || true
|
touch "${STATUS_DIR}/.done" 2>/dev/null || true
|
||||||
wait "$BOARD_PID" 2>/dev/null || true
|
wait "$BOARD_PID" 2>/dev/null || true
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# --- cross-model consensus comment -----------------------------------------
|
||||||
|
# Render ONE consensus comment from the per-model findings the swarm wrote. This
|
||||||
|
# is advisory and best-effort: if the consolidation pass produces nothing, fall
|
||||||
|
# back to posting each model's review as its own comment (the per-model comments
|
||||||
|
# were suppressed during the run), so a consolidation hiccup never loses output.
|
||||||
|
if [ "$CONSOLIDATE" = "1" ]; then
|
||||||
|
n_files="$(ls -1 "${FINDINGS_DIR}"/*.json 2>/dev/null | wc -l | tr -d '[:space:]')"
|
||||||
|
log "consolidating findings from ${n_files} model(s)"
|
||||||
|
# Fetch the PR diff so the consolidator can also post an inline PR review,
|
||||||
|
# anchoring findings to changed lines (GADFLY_DIFF_FILE). GITEA_API/GITEA_TOKEN/
|
||||||
|
# GADFLY_PR are already in the binary's environment. Best-effort: an empty diff
|
||||||
|
# just means no inline review.
|
||||||
|
DIFF_FILE="${WORKDIR}/pr.diff"
|
||||||
|
API "${GITEA_API}/pulls/${PR}.diff" > "$DIFF_FILE" 2>/dev/null || true
|
||||||
|
CONSENSUS="$(GADFLY_CONSOLIDATE_DIR="$FINDINGS_DIR" GADFLY_DIFF_FILE="$DIFF_FILE" \
|
||||||
|
/usr/local/bin/gadfly 2>"${WORKDIR}/consolidate.err" || true)"
|
||||||
|
if [ -n "$CONSENSUS" ]; then
|
||||||
|
NOTICE_BLOCK=""
|
||||||
|
[ -n "${GADFLY_NOTICE:-}" ] && NOTICE_BLOCK="> ${GADFLY_NOTICE}"$'\n\n'
|
||||||
|
BODY="$(printf '%s%s\n\n<sub>Automated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.</sub>' "$NOTICE_BLOCK" "$CONSENSUS")"
|
||||||
|
upsert_comment_body "<!-- gadfly-consensus -->" "$BODY"
|
||||||
|
log "consensus comment posted"
|
||||||
|
else
|
||||||
|
log "consolidation produced no output; falling back to per-model comments"
|
||||||
|
log "$(tail -c 500 "${WORKDIR}/consolidate.err" 2>/dev/null)"
|
||||||
|
for f in "${FINDINGS_DIR}"/*.json; do
|
||||||
|
[ -f "$f" ] || continue
|
||||||
|
m="$(jq -r '.model // ""' "$f" 2>/dev/null)"
|
||||||
|
[ -z "$m" ] && continue
|
||||||
|
prov="$(jq -r '.provider // ""' "$f" 2>/dev/null)"
|
||||||
|
md="$(jq -r '.markdown // ""' "$f" 2>/dev/null)"
|
||||||
|
marker="<!-- gadfly-review:ollama:${m} -->"
|
||||||
|
body="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s\n\n<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge.</sub>' \
|
||||||
|
"$marker" "$m" "$prov" "$md")"
|
||||||
|
upsert_comment_body "$marker" "$body"
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
fi
|
||||||
log "done"
|
log "done"
|
||||||
|
|||||||
@@ -14,10 +14,11 @@
|
|||||||
# Forward ONLY the secrets the reviewer uses (least privilege) — see the
|
# Forward ONLY the secrets the reviewer uses (least privilege) — see the
|
||||||
# `secrets:` block below. GITEA_TOKEN is automatic. `secrets: inherit` also works
|
# `secrets:` block below. GITEA_TOKEN is automatic. `secrets: inherit` also works
|
||||||
# but hands the reusable EVERY secret in your repo (registry/deploy/db creds the
|
# but hands the reusable EVERY secret in your repo (registry/deploy/db creds the
|
||||||
# review never touches), so prefer the explicit form. Pin @<ref>: use the @v1
|
# review never touches), so prefer the explicit form. Pin to an immutable
|
||||||
# release tag (a curated pointer moved on deliberate releases) for auto-updating
|
# @<sha>: long-lived act_runners CACHE the reusable by ref, so a moved tag (@v1)
|
||||||
# stability, or a full @<sha> for an immutable pin. Avoid @main — it moves on
|
# or @main is often not re-fetched and silently runs a stale copy. Bump the @<sha>
|
||||||
# every push and would change what runs with your forwarded secrets.
|
# to adopt a structural change; routine swarm tuning rides owner variables (see
|
||||||
|
# the gadfly README "Central config via variables") with no re-pin needed.
|
||||||
#
|
#
|
||||||
# For custom named endpoints (GADFLY_ENDPOINT_<NAME>) or a provider the reusable
|
# For custom named endpoints (GADFLY_ENDPOINT_<NAME>) or a provider the reusable
|
||||||
# doesn't map, use the full stub in adversarial-review.yml instead.
|
# doesn't map, use the full stub in adversarial-review.yml instead.
|
||||||
@@ -49,8 +50,8 @@ jobs:
|
|||||||
if: >-
|
if: >-
|
||||||
github.event_name != 'issue_comment'
|
github.event_name != 'issue_comment'
|
||||||
|| (github.event.issue.pull_request && github.actor == 'your-username')
|
|| (github.event.issue.pull_request && github.actor == 'your-username')
|
||||||
# @v1 = curated release tag (auto-updates on releases); swap for a full @<sha>
|
# Pin to an immutable @<sha> (runners cache the ref, so @v1/@main can run
|
||||||
# if you want an immutable pin. Don't use @main (moves on every push).
|
# stale). Bump it for structural changes; tune the swarm via owner variables.
|
||||||
uses: steve/gadfly/.gitea/workflows/review-reusable.yml@v1
|
uses: steve/gadfly/.gitea/workflows/review-reusable.yml@v1
|
||||||
# Forward ONLY what the reviewer needs. Add provider keys you use
|
# Forward ONLY what the reviewer needs. Add provider keys you use
|
||||||
# (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, GADFLY_API_KEY) and/or
|
# (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, GADFLY_API_KEY) and/or
|
||||||
|
|||||||
@@ -3,6 +3,7 @@ module gitea.stevedudenhoeffer.com/steve/gadfly
|
|||||||
go 1.26.2
|
go 1.26.2
|
||||||
|
|
||||||
require (
|
require (
|
||||||
|
gitea.stevedudenhoeffer.com/steve/executus v0.1.4
|
||||||
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462
|
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462
|
||||||
gopkg.in/yaml.v3 v3.0.1
|
gopkg.in/yaml.v3 v3.0.1
|
||||||
)
|
)
|
||||||
@@ -17,6 +18,7 @@ require (
|
|||||||
github.com/go-logr/stdr v1.2.2 // indirect
|
github.com/go-logr/stdr v1.2.2 // indirect
|
||||||
github.com/google/go-cmp v0.7.0 // indirect
|
github.com/google/go-cmp v0.7.0 // indirect
|
||||||
github.com/google/s2a-go v0.1.9 // indirect
|
github.com/google/s2a-go v0.1.9 // indirect
|
||||||
|
github.com/google/uuid v1.6.0 // indirect
|
||||||
github.com/googleapis/enterprise-certificate-proxy v0.3.17 // indirect
|
github.com/googleapis/enterprise-certificate-proxy v0.3.17 // indirect
|
||||||
github.com/googleapis/gax-go/v2 v2.22.0 // indirect
|
github.com/googleapis/gax-go/v2 v2.22.0 // indirect
|
||||||
github.com/gorilla/websocket v1.5.3 // indirect
|
github.com/gorilla/websocket v1.5.3 // indirect
|
||||||
|
|||||||
@@ -4,6 +4,8 @@ cloud.google.com/go/auth v0.20.0 h1:kXTssoVb4azsVDoUiF8KvxAqrsQcQtB53DcSgta74CA=
|
|||||||
cloud.google.com/go/auth v0.20.0/go.mod h1:942/yi/itH1SsmpyrbnTMDgGfdy2BUqIKyd0cyYLc5Q=
|
cloud.google.com/go/auth v0.20.0/go.mod h1:942/yi/itH1SsmpyrbnTMDgGfdy2BUqIKyd0cyYLc5Q=
|
||||||
cloud.google.com/go/compute/metadata v0.9.0 h1:pDUj4QMoPejqq20dK0Pg2N4yG9zIkYGdBtwLoEkH9Zs=
|
cloud.google.com/go/compute/metadata v0.9.0 h1:pDUj4QMoPejqq20dK0Pg2N4yG9zIkYGdBtwLoEkH9Zs=
|
||||||
cloud.google.com/go/compute/metadata v0.9.0/go.mod h1:E0bWwX5wTnLPedCKqk3pJmVgCBSM6qQI1yTBdEb3C10=
|
cloud.google.com/go/compute/metadata v0.9.0/go.mod h1:E0bWwX5wTnLPedCKqk3pJmVgCBSM6qQI1yTBdEb3C10=
|
||||||
|
gitea.stevedudenhoeffer.com/steve/executus v0.1.4 h1:4F99uCV3OVaE9ITFp0FjPiYxLUQO+WpE+wU2HCnpXNM=
|
||||||
|
gitea.stevedudenhoeffer.com/steve/executus v0.1.4/go.mod h1:WQP/lH+meU06OSNF0TQO/wQLcJCrMwpi0EMj5vSpVtk=
|
||||||
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462 h1:1crjE1YkWHLZ91tUDOxN/Y5cuOnJ56e0U9UADoFfEPY=
|
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462 h1:1crjE1YkWHLZ91tUDOxN/Y5cuOnJ56e0U9UADoFfEPY=
|
||||||
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462/go.mod h1:UZLveG17SmENt4sne2RSLIbioix30RZbRIQUzBAnOyY=
|
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462/go.mod h1:UZLveG17SmENt4sne2RSLIbioix30RZbRIQUzBAnOyY=
|
||||||
github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
|
github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
|
||||||
|
|||||||
+35
-4
@@ -50,6 +50,13 @@ MAX_DIFF_CHARS="${MAX_DIFF_CHARS:-60000}"
|
|||||||
MARKER="<!-- gadfly-review:${PROVIDER}:${MODEL} -->"
|
MARKER="<!-- gadfly-review:${PROVIDER}:${MODEL} -->"
|
||||||
say() { echo "[gadfly-review:${PROVIDER}:${MODEL}] $*" >&2; }
|
say() { echo "[gadfly-review:${PROVIDER}:${MODEL}] $*" >&2; }
|
||||||
|
|
||||||
|
# When the swarm is consolidating (GADFLY_CONSOLIDATE=1, set by entrypoint.sh for
|
||||||
|
# a multi-model run), this model does NOT post its own comment — it writes its
|
||||||
|
# findings to GADFLY_FINDINGS_OUT and a single cross-model consensus comment is
|
||||||
|
# posted after the whole swarm finishes. Live progress still shows on the status
|
||||||
|
# board. Default 0 (post a per-model comment, the standalone behavior).
|
||||||
|
CONSOLIDATE="${GADFLY_CONSOLIDATE:-0}"
|
||||||
|
|
||||||
# Display the model's ACTUAL backend: the provider segment of the spec
|
# Display the model's ACTUAL backend: the provider segment of the spec
|
||||||
# ("m1pro/qwen3.6:35b-mlx" -> "m1pro"); a bare id uses GADFLY_PROVIDER (default
|
# ("m1pro/qwen3.6:35b-mlx" -> "m1pro"); a bare id uses GADFLY_PROVIDER (default
|
||||||
# ollama-cloud). This is what the comment header shows, not the run.sh lane.
|
# ollama-cloud). This is what the comment header shows, not the run.sh lane.
|
||||||
@@ -126,7 +133,9 @@ USR="$(printf 'PR #%s: %s\n\nDescription:\n%s\n\nUnified diff to review:\n```dif
|
|||||||
# --- announce start (placeholder comment) -----------------------------------
|
# --- announce start (placeholder comment) -----------------------------------
|
||||||
START_TS="$(date +%s)"
|
START_TS="$(date +%s)"
|
||||||
say "starting review with ${MODEL}"
|
say "starting review with ${MODEL}"
|
||||||
upsert_comment "$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n⏳ Reviewing… this comment will update with findings and run time.' \
|
# Skip the per-model placeholder when consolidating (the consensus comment is
|
||||||
|
# posted later; live progress is on the status board).
|
||||||
|
[ "$CONSOLIDATE" = "1" ] || upsert_comment "$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n⏳ Reviewing… this comment will update with findings and run time.' \
|
||||||
"$MARKER" "$MODEL" "$MODEL_PROVIDER")"
|
"$MARKER" "$MODEL" "$MODEL_PROVIDER")"
|
||||||
|
|
||||||
# --- call the model ---------------------------------------------------------
|
# --- call the model ---------------------------------------------------------
|
||||||
@@ -168,8 +177,9 @@ case "$PROVIDER" in
|
|||||||
GADFLY_SYSTEM_FILE="${SCRIPT_DIR}/system-prompt.txt" \
|
GADFLY_SYSTEM_FILE="${SCRIPT_DIR}/system-prompt.txt" \
|
||||||
GADFLY_TITLE="$TITLE" \
|
GADFLY_TITLE="$TITLE" \
|
||||||
GADFLY_BODY="$BODY" \
|
GADFLY_BODY="$BODY" \
|
||||||
GADFLY_MAX_DIFF_CHARS="$MAX_DIFF_CHARS" \
|
GADFLY_MAX_DIFF_CHARS="${GADFLY_MAX_DIFF_CHARS:-$MAX_DIFF_CHARS}" \
|
||||||
GADFLY_STATUS_FILE="${GADFLY_STATUS_FILE:-}" \
|
GADFLY_STATUS_FILE="${GADFLY_STATUS_FILE:-}" \
|
||||||
|
GADFLY_FINDINGS_OUT="${GADFLY_FINDINGS_OUT:-}" \
|
||||||
"$BIN" 2>"$ERR_FILE"
|
"$BIN" 2>"$ERR_FILE"
|
||||||
)"
|
)"
|
||||||
rc=$?
|
rc=$?
|
||||||
@@ -204,7 +214,28 @@ esac
|
|||||||
# --- assemble + post final comment (with run time) --------------------------
|
# --- assemble + post final comment (with run time) --------------------------
|
||||||
ELAPSED="$(( $(date +%s) - START_TS ))"
|
ELAPSED="$(( $(date +%s) - START_TS ))"
|
||||||
DUR="$(fmt_duration "$ELAPSED")"
|
DUR="$(fmt_duration "$ELAPSED")"
|
||||||
COMMENT="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s\n\n<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in %s</sub>' \
|
|
||||||
"$MARKER" "$MODEL" "$MODEL_PROVIDER" "$REVIEW" "$DUR")"
|
# Consolidating: the binary writes its findings file on success. If it failed or
|
||||||
|
# was skipped (no file, or an empty one), write a stub so this model still shows
|
||||||
|
# up in the consensus (as failed) and an all-models-fail run still posts a
|
||||||
|
# comment — never silently drop a model or the whole review.
|
||||||
|
if [ "$CONSOLIDATE" = "1" ] && [ -n "${GADFLY_FINDINGS_OUT:-}" ] && [ ! -s "${GADFLY_FINDINGS_OUT}" ]; then
|
||||||
|
jq -n --arg model "$MODEL" --arg provider "$MODEL_PROVIDER" --arg md "$REVIEW" \
|
||||||
|
'{model:$model, provider:$provider, verdict:"reviewer failed", errored:true, markdown:$md, findings:[]}' \
|
||||||
|
> "${GADFLY_FINDINGS_OUT}" 2>/dev/null || true
|
||||||
|
fi
|
||||||
|
# When consolidating, the binary has written this model's findings to
|
||||||
|
# GADFLY_FINDINGS_OUT; the consensus comment is posted by entrypoint.sh after the
|
||||||
|
# whole swarm finishes, so this model posts no comment of its own.
|
||||||
|
if [ "$CONSOLIDATE" = "1" ]; then
|
||||||
|
say "done in ${DUR} (consolidated; no per-model comment)"
|
||||||
|
else
|
||||||
|
# An optional one-line notice (e.g. entrypoint's huge-PR downshift advisory),
|
||||||
|
# shown under the header so a shallower review is self-explaining.
|
||||||
|
NOTICE_BLOCK=""
|
||||||
|
[ -n "${GADFLY_NOTICE:-}" ] && NOTICE_BLOCK="> ${GADFLY_NOTICE}"$'\n\n'
|
||||||
|
COMMENT="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s%s\n\n<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in %s</sub>' \
|
||||||
|
"$MARKER" "$MODEL" "$MODEL_PROVIDER" "$NOTICE_BLOCK" "$REVIEW" "$DUR")"
|
||||||
upsert_comment "$COMMENT"
|
upsert_comment "$COMMENT"
|
||||||
say "done in ${DUR}"
|
say "done in ${DUR}"
|
||||||
|
fi
|
||||||
|
|||||||
@@ -16,7 +16,9 @@ state. USE THEM to verify before you report. Do not review the diff in isolation
|
|||||||
- list_dir([path]) — list a directory.
|
- list_dir([path]) — list a directory.
|
||||||
- grep(pattern[, path, max_results]) — RE2 regex search across the repo.
|
- grep(pattern[, path, max_results]) — RE2 regex search across the repo.
|
||||||
- find_files(name[, max_results]) — locate a file by path substring.
|
- find_files(name[, max_results]) — locate a file by path substring.
|
||||||
- get_diff() — the full unified diff (the task message may truncate it).
|
- get_diff([path, start_line, limit]) — the unified diff as a paginated, numbered window;
|
||||||
|
pass `path` to fetch just one changed file's hunks (do this on a big PR instead of pulling
|
||||||
|
the whole diff at once).
|
||||||
|
|
||||||
Mandatory verification discipline — this is the whole point of giving you tools:
|
Mandatory verification discipline — this is the whole point of giving you tools:
|
||||||
- Before claiming a missing/duplicate import, an undefined symbol, a wrong signature,
|
- Before claiming a missing/duplicate import, an undefined symbol, a wrong signature,
|
||||||
@@ -43,3 +45,21 @@ Output rules:
|
|||||||
- Only report issues you are reasonably confident are real after checking. If the diff
|
- Only report issues you are reasonably confident are real after checking. If the diff
|
||||||
is clean, say so plainly rather than inventing nits.
|
is clean, say so plainly rather than inventing nits.
|
||||||
- When you are done investigating, STOP calling tools and reply with the final review.
|
- When you are done investigating, STOP calling tools and reply with the final review.
|
||||||
|
|
||||||
|
Machine-readable findings — AFTER the prose review, append ONE fenced code block,
|
||||||
|
tagged `gadfly-findings`, holding a JSON array of the SAME findings you described above
|
||||||
|
(this block is consumed by tooling and hidden from the rendered comment):
|
||||||
|
|
||||||
|
```gadfly-findings
|
||||||
|
[
|
||||||
|
{"file": "path/to/file.go", "line": 123, "severity": "high", "confidence": "high", "title": "one-line summary of the issue"}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
- One object per real finding, in the same order as your prose. `file`/`line` must be a
|
||||||
|
concrete location you verified (the line the issue is at). `severity` is one of
|
||||||
|
`critical`, `high`, `medium`, `small`, `trivial`. `confidence` is your post-verification
|
||||||
|
confidence the issue is real: one of `high`, `medium`, `low`.
|
||||||
|
- Include ONLY genuine problems — never verification notes ("confirmed X is safe at f:line"),
|
||||||
|
and never an "Outside my lens:" aside. If your lens is clean, emit an empty array `[]`.
|
||||||
|
- This block is in ADDITION to the prose; do not drop the human-readable findings.
|
||||||
|
|||||||
Reference in New Issue
Block a user