feat: re-platform agentic review onto executus + large-PR cost controls #20
@@ -22,15 +22,21 @@ verifies each one against the actual code, and posts its findings as a comment.
|
||||
4. **Provider-agnostic.** Powered by [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo),
|
||||
so it can target Ollama (local/cloud), OpenAI, Anthropic, Google, or any
|
||||
OpenAI/Ollama-compatible endpoint. Don't re-hardcode a single provider.
|
||||
5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib + majordomo. Keep
|
||||
it that way — no heavyweight deps, no coupling to any one consumer repo (e.g. mort).
|
||||
5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib, majordomo, and
|
||||
[executus](https://gitea.stevedudenhoeffer.com/steve/executus) (whose *core* — `run`/`compact`/
|
||||
`model`/`fanout`/`tool` — is itself majordomo+stdlib only, so the binary stays static; do NOT
|
||||
pull executus's `contrib/store` or any battery that drags in a DB driver). No heavyweight deps,
|
||||
no coupling to any one consumer repo (e.g. mort). Gadfly is executus's canonical *light* consumer.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
cmd/gadfly/ the reviewer binary — pure producer of review markdown (stdout)
|
||||
main.go orchestration: loop specialists, each a review pass + adversarial recheck
|
||||
engine.go reviewEngine abstraction: majordomo agent loop vs claude-code CLI shell-out
|
||||
main.go orchestration: fan specialists out (executus/fanout), each a review pass + recheck
|
||||
engine.go reviewEngine abstraction: executus run.Executor (majordomo agent loop +
|
||||
compaction/bounding/budget/critic) vs claude-code CLI shell-out
|
||||
executus.go executus wiring: tool.Registry over the repo tools, the run.Executor build
|
||||
(compact + model context-limit threshold + per-PR budget + wrap-up critic)
|
||||
specialists.go specialist lenses: built-ins, default suite, env + .gadfly.yml resolution
|
||||
auto.go dynamic `auto` selection: a selector model picks lenses per-diff (may invent)
|
||||
delegate.go worker-tier delegate_investigation tool (cheap sub-agent does legwork)
|
||||
@@ -69,7 +75,7 @@ verdict. Verdict is one of: `No material issues found` / `Minor issues` / `Block
|
||||
## Build / test
|
||||
|
||||
```sh
|
||||
go build ./cmd/gadfly # needs read access to the private majordomo module
|
||||
go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
|
||||
go test ./...
|
||||
gofmt -l cmd/ # must be clean
|
||||
docker build -t gadfly:dev --secret id=REGISTRY_USER,env=REGISTRY_USER --secret id=REGISTRY_PASSWORD,env=REGISTRY_PASSWORD .
|
||||
@@ -149,3 +155,21 @@ are actually exercised. OpenAI/Anthropic/Google come from majordomo's abstractio
|
||||
parallel, `cap` (from `GADFLY_PROVIDER_CONCURRENCY` else `GADFLY_CONCURRENCY`, default 1) bounds
|
||||
models-at-once within a lane. The review timeout (`GADFLY_TIMEOUT_SECS`) is **per-lens**, not
|
||||
shared across the suite — a slow model can't starve later lenses (the original timeout bug).
|
||||
- **Large-PR token burn**: the agent loop re-sends the whole transcript every step, so a giant
|
||||
diff (the old `get_diff` dumped it untruncated, and it was embedded in both the review and
|
||||
recheck task) was re-transmitted ~steps × lenses × passes × models times — a ~250 K-token PR
|
||||
could drain a metered usage block in minutes. Fixed in three size-gated layers (small PRs
|
||||
untouched): paginated `get_diff` + `executus/compact` compaction in the binary; an
|
||||
`entrypoint.sh` downshift above `GADFLY_HUGE_DIFF_BYTES` (one cheap model, fewer lenses/steps,
|
||||
no recheck); and a swarm-wide `GADFLY_PR_BUDGET_SECS` wall-clock backstop. Compaction's threshold
|
||||
is intentionally LOW (`GADFLY_COMPACT_RATIO` 0.45, not executus's 0.7) because the burning
|
||||
transcript on the embedded path rarely reaches 0.7×context.
|
||||
- **executus re-platform**: the in-process review path runs through `executus/run`'s `run.Executor`
|
||||
(compaction, run-bounding, `Ports.Budget`, the wrap-up nudge as `Ports.Critic`), wiring it in
|
||||
`cmd/gadfly/executus.go`. Gadfly KEEPS its own `model.go` resolution (so `GADFLY_ENDPOINT_<NAME>`
|
||||
http aliases + the claude-code engine survive) and only hands `run.Executor` the already-resolved
|
||||
model via a trivial resolver — do NOT route review-model resolution through
|
||||
`model.ParseModelForContext` (it bypasses gadfly's endpoint aliases). `run.Result` exposes no
|
||||
transcript, so the old transcript-based forced-finalization fallback is gone; the wrap-up critic
|
||||
nudge is the remaining "always emit something" mechanism. The claude-code engine still shells out
|
||||
and is unaffected.
|
||||
|
||||
@@ -24,7 +24,10 @@ RUN --mount=type=cache,target=/go/pkg/mod \
|
||||
go build -trimpath -ldflags="-s -w" -o /out/gadfly ./cmd/gadfly
|
||||
|
||||
FROM alpine:3.20
|
||||
RUN apk add --no-cache bash git curl jq ca-certificates nodejs npm
|
||||
# procps provides pkill/pgrep, which entrypoint.sh's per-PR wall-clock backstop
|
||||
# (GADFLY_PR_BUDGET_SECS) uses to stop the review subtrees — busybox's applets
|
||||
# are not guaranteed to include them.
|
||||
RUN apk add --no-cache bash git curl jq ca-certificates nodejs npm procps
|
||||
# Bundle the Claude Code CLI so the `claude-code` review engine works out of the
|
||||
# box (GADFLY_MODELS=claude-code or claude-code/<model>). This adds Node + the
|
||||
# CLI to the image (notably larger); ollama-only users pay the size but nothing
|
||||
|
||||
@@ -396,7 +396,16 @@ The reviewer binary reads these (the stub/entrypoint set sane defaults):
|
||||
| `GADFLY_TIMEOUT_SECS` | 300 | deadline **per specialist lens** (review+recheck) |
|
||||
| `GADFLY_RECHECK` | on | set `0`/`false` to skip the recheck pass |
|
||||
| `GADFLY_RECHECK_MAX_STEPS` | 16 | recheck-pass step cap |
|
||||
| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the prompt (full diff via `get_diff`) |
|
||||
| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the **review** prompt (the full diff is reachable via the paginated `get_diff` tool, scoped per file with its `path` arg) |
|
||||
| `GADFLY_RECHECK_DIFF_CHARS` | 20000 | diff chars embedded in the **recheck** prompt (smaller — the recheck pages `get_diff` for the hunks it verifies) |
|
||||
| `GADFLY_COMPACT` | on | context compaction (via [executus](https://gitea.stevedudenhoeffer.com/steve/executus)): fold the transcript's runaway middle into a summary as it nears the model's context window, so a big diff + accumulating tool output can't balloon every step. `0` disables |
|
||||
| `GADFLY_COMPACT_RATIO` | 0.45 | fraction of the model's context window at which compaction fires |
|
||||
| `GADFLY_COMPACT_MODEL` | worker, else review model | cheap model the compactor uses to summarize the folded middle |
|
||||
| `GADFLY_COMPACT_KEEP_RECENT` | 8 | most-recent messages kept verbatim during compaction |
|
||||
| `GADFLY_COMPACT_SUMMARY_WORDS` | 200 | word cap on the compaction summary |
|
||||
| `GADFLY_MODEL_CONTEXT_TOKENS` | *(auto)* | override the model's context-window size (tokens) for the compaction threshold; set it for self-hosted endpoints executus can't introspect (Ollama Cloud models resolve automatically) |
|
||||
| `GADFLY_PR_TOKEN_BUDGET` | — | per-model token ceiling for this PR; once spent, remaining lenses/passes are skipped (advisory). 0 = off |
|
||||
| `GADFLY_PR_TIME_BUDGET_SECS` | — | per-model wall-clock ceiling for this PR (advisory). 0 = off |
|
||||
| `GADFLY_STATUS_BOARD` | on | set `0` to disable the live status-board comment |
|
||||
| `GADFLY_STATUS_POLL_SECS` | 12 | how often the status board re-renders/upserts |
|
||||
| `GADFLY_CONSOLIDATE` | `auto` | cross-model consensus comment: `auto` (on for ≥2 models), `1` (force on), `0` (off — one comment per model) |
|
||||
@@ -408,6 +417,37 @@ The reviewer binary reads these (the stub/entrypoint set sane defaults):
|
||||
| `GADFLY_REPO` | *(from `GITEA_API`)* | `owner/repo` slug stamped on emitted runs/findings (set by `entrypoint.sh`) |
|
||||
| `GADFLY_PR` | *(from event)* | PR number stamped on emitted runs/findings (set by `entrypoint.sh`) |
|
||||
|
||||
### Large-PR cost controls
|
||||
|
||||
A very large diff is the one thing that can blow the budget: every review step
|
||||
re-sends it, multiplied across models × lenses × passes × steps (a single
|
||||
~250 K-token PR can otherwise burn a whole metered usage block). Gadfly handles
|
||||
big PRs in three layers, all **size-gated so small PRs are untouched**:
|
||||
|
||||
1. **Paginated `get_diff` + compaction** (reviewer binary, on by default) —
|
||||
`get_diff` returns a paginated, optionally per-file window instead of the whole
|
||||
diff, and once a transcript nears the model's context window its middle is
|
||||
folded into a summary (powered by [executus](https://gitea.stevedudenhoeffer.com/steve/executus)'s
|
||||
`compact`). Tune with the `GADFLY_COMPACT_*` knobs above.
|
||||
2. **Downshift** (`entrypoint.sh`) — above `GADFLY_HUGE_DIFF_BYTES` the whole fleet
|
||||
collapses to a single cheap model + a focused lens subset, fewer steps, and no
|
||||
recheck. A finished shallow review beats a budget-nuking one, and the posted
|
||||
comment says so.
|
||||
3. **Hard backstop** (`entrypoint.sh`) — `GADFLY_PR_BUDGET_SECS` is a wall-clock
|
||||
ceiling across the *entire* fleet; on expiry the review is stopped and whatever
|
||||
was found so far is posted. Like everything else, it never fails CI.
|
||||
|
||||
| Env | Default | Meaning |
|
||||
|-----|---------|---------|
|
||||
| `GADFLY_HUGE_DIFF_BYTES` | 600000 | downshift the fleet when the PR diff exceeds this many bytes (0 = never downshift) |
|
||||
| `GADFLY_HUGE_DIFF_MODELS` | first model | model(s) to run on a downshifted huge PR |
|
||||
| `GADFLY_HUGE_DIFF_SPECIALISTS` | `security,correctness,error-handling` | lenses on a downshifted huge PR |
|
||||
| `GADFLY_HUGE_DIFF_MAX_STEPS` | 12 | review step cap on a huge PR |
|
||||
| `GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS` | 8 | recheck step cap on a huge PR |
|
||||
| `GADFLY_HUGE_DIFF_RECHECK` | 0 | run the recheck pass on a huge PR (off by default) |
|
||||
| `GADFLY_HUGE_DIFF_MAX_DIFF_CHARS` | 20000 | embedded review-diff chars on a huge PR |
|
||||
| `GADFLY_PR_BUDGET_SECS` | — | swarm-wide wall-clock backstop; stops the whole fleet when reached (0 = off) |
|
||||
|
||||
## Findings telemetry (optional)
|
||||
|
||||
Gadfly can record what it found so model quality can be tracked over time. It is
|
||||
@@ -431,7 +471,7 @@ code.
|
||||
## Building locally
|
||||
|
||||
```sh
|
||||
go build ./cmd/gadfly # needs read access to the private majordomo module
|
||||
go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
|
||||
go test ./...
|
||||
```
|
||||
|
||||
|
||||
@@ -33,14 +33,18 @@ type reviewEngine interface {
|
||||
runPass(ctx context.Context, system, task string, maxSteps int) (string, error)
|
||||
}
|
||||
|
||||
// majordomoEngine drives the in-process majordomo agent over the repo sandbox.
|
||||
// majordomoEngine drives the in-process review path over the repo sandbox. It no
|
||||
// longer calls majordomo's agent loop directly: each pass runs through an
|
||||
// executus run.Executor (see executus.go), which adds context compaction, run
|
||||
// bounding, the per-PR budget gate, and the wrap-up critic. mdl is retained only
|
||||
// so auto-select can fall back to the review model as its selector.
|
||||
type majordomoEngine struct {
|
||||
mdl llm.Model
|
||||
fsTools *repoFS
|
||||
rex *reviewExecutor
|
||||
mdl llm.Model
|
||||
}
|
||||
|
||||
func (e *majordomoEngine) runPass(ctx context.Context, system, task string, maxSteps int) (string, error) {
|
||||
return runAgent(ctx, e.mdl, e.fsTools, system, task, maxSteps)
|
||||
return e.rex.run(ctx, system, task, maxSteps)
|
||||
}
|
||||
|
||||
// claudeCodeEngine reviews by shelling out to the `claude` CLI (Claude Code) in
|
||||
|
||||
@@ -0,0 +1,371 @@
|
||||
package main
|
||||
|
||||
// executus.go wires gadfly's agentic review path onto the executus run kernel
|
||||
// (gitea.stevedudenhoeffer.com/steve/executus), layered above majordomo. The
|
||||
// majordomoEngine no longer drives majordomo's agent loop directly; it builds a
|
||||
// run.Executor that gives gadfly, for free:
|
||||
//
|
||||
// - context compaction (executus/compact): once the transcript a step would
|
||||
// SEND crosses a token threshold derived from the model's real context
|
||||
// window, the runaway middle is folded into a one-paragraph summary by a
|
||||
// cheap summarizer model — so a big diff + accumulating read_file/grep
|
||||
// results can't balloon every re-sent step (the large-PR burn).
|
||||
// - run bounding + a per-PR spend budget (executus/run Ports.Budget): a hard
|
||||
// token/seconds ceiling so a pathological PR can't drain the usage block.
|
||||
// - the wrap-up nudge, re-expressed as an executus Critic (Ports.Critic): the
|
||||
// steer that tells a step-hungry model to stop investigating and write its
|
||||
// answer is now the critic seam, not a bespoke RunOption.
|
||||
//
|
||||
// Everything degrades to today's behavior when unconfigured: nil summarizer or a
|
||||
|
|
||||
// 0 context window disables compaction; nil budget disables the ceiling; the
|
||||
// claude-code engine shells out and is unaffected by any of this.
|
||||
//
|
||||
// gadfly keeps its own model.go resolution (so GADFLY_ENDPOINT_<NAME> http
|
||||
// aliases, failover chains, and the claude-code engine all survive) — the
|
||||
// run.Executor is handed gadfly's already-resolved model via a trivial resolver,
|
||||
// not routed through executus's tier table.
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"os"
|
||||
"strconv"
|
||||
"strings"
|
||||
"sync"
|
||||
"sync/atomic"
|
||||
"time"
|
||||
|
||||
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
|
||||
|
||||
"gitea.stevedudenhoeffer.com/steve/executus/compact"
|
||||
"gitea.stevedudenhoeffer.com/steve/executus/model"
|
||||
exrun "gitea.stevedudenhoeffer.com/steve/executus/run"
|
||||
exectool "gitea.stevedudenhoeffer.com/steve/executus/tool"
|
||||
)
|
||||
|
||||
const (
|
||||
// defaultCompactRatio is the fraction of the model's context window at which
|
||||
// compaction fires. It is deliberately LOWER than executus's own 0.7 default:
|
||||
// on the large-PR burn the per-step transcript is the embedded diff (~17K) plus
|
||||
// accumulating read_file results, which rarely reaches 0.7×262K≈183K — so a
|
||||
// 0.7 threshold never bites. ~0.45×262K≈118K folds the runaway middle while a
|
||||
// transcript is still well under the cap. Override with GADFLY_COMPACT_RATIO.
|
||||
defaultCompactRatio = 0.45
|
||||
// defaultCompactKeepRecent / defaultCompactSummaryWords mirror executus's own
|
||||
// compactor defaults; surfaced as gadfly env knobs for tuning.
|
||||
defaultCompactKeepRecent = 8
|
||||
defaultCompactSummaryWords = 200
|
||||
// contextTokenLookupTimeout bounds the one-shot /api/show call that resolves a
|
||||
// cloud model's context window at executor-build time. Kept short so a slow or
|
||||
// unreachable endpoint adds at most this to startup before degrading to
|
||||
// no-compaction (rather than the provider cache's default 15s).
|
||||
contextTokenLookupTimeout = 5 * time.Second
|
||||
)
|
||||
|
||||
// runSeq mints a unique-per-process RunID suffix for each executor run so audit
|
||||
// and the run kernel can tell one pass from another within a binary process.
|
||||
var runSeq atomic.Uint64
|
||||
|
||||
// wrappedTool adapts an already-built majordomo llm.Tool (gadfly's sandboxed
|
||||
// read_file/grep/get_diff/… closures over the repoFS) to executus's tool.Tool
|
||||
// interface so the run kernel can build a toolbox from them by name. gadfly's
|
||||
// tools need no caller/channel identity, so BuildLLM ignores the Invocation and
|
||||
// returns the pre-built tool; Permission is the zero value (private, ungated).
|
||||
type wrappedTool struct{ t llm.Tool }
|
||||
|
||||
func (w wrappedTool) Name() string { return w.t.Name }
|
||||
func (w wrappedTool) Description() string { return w.t.Description }
|
||||
func (w wrappedTool) Permission() exectool.Permission { return exectool.Permission{} }
|
||||
func (w wrappedTool) BuildLLM(_ exectool.Invocation) llm.Tool { return w.t }
|
||||
|
||||
// gadflyToolRegistry registers the repo's read-only tools (plus the optional
|
||||
|
gitea-actions
commented
🟡 gadflyToolRegistry re-derives fsTools()+delegateTool, duplicating toolbox()'s membership logic (two sources of truth) maintainability · flagged by 1 model
🪰 Gadfly · advisory 🟡 **gadflyToolRegistry re-derives fsTools()+delegateTool, duplicating toolbox()'s membership logic (two sources of truth)**
_maintainability · flagged by 1 model_
- **`cmd/gadfly/executus.go:82-84` vs `cmd/gadfly/tools.go:117-119` — duplicated tool-list construction.** `gadflyToolRegistry` re-derives `fs.fsTools()` + `if fs.worker != nil { append(fs.delegateTool()) }`, byte-for-byte the membership logic of `toolbox()`. Verified by reading both functions. Adding a new built-in tool to `fsTools()` requires both paths to stay in sync by hand.
<sub>🪰 Gadfly · advisory</sub>
|
||||
// delegate_investigation worker tool) in a fresh executus tool.Registry and
|
||||
// returns it along with the tool names for RunnableAgent.LowLevelTools.
|
||||
func gadflyToolRegistry(fs *repoFS) (exectool.Registry, []string, error) {
|
||||
reg := exectool.NewRegistry()
|
||||
tools := fs.allTools()
|
||||
names := make([]string, 0, len(tools))
|
||||
for _, t := range tools {
|
||||
if err := reg.Register(wrappedTool{t: t}); err != nil {
|
||||
return nil, nil, fmt.Errorf("register tool %q: %w", t.Name, err)
|
||||
}
|
||||
names = append(names, t.Name)
|
||||
}
|
||||
return reg, names, nil
|
||||
}
|
||||
|
||||
// gadflyBudget is gadfly's per-PR spend ceiling, satisfying run.Ports.Budget.
|
||||
// It gates a run BEFORE it makes any model call (Check) once the process has
|
||||
// spent its token or wall-clock allowance on this PR. Tokens are fed in
|
||||
// out-of-band via addUsage (the Budget interface's Commit only carries seconds);
|
||||
// the engine calls addUsage after each pass with run.Result.Usage. A nil
|
||||
// *gadflyBudget is never installed — caps of 0 mean "unlimited", so the port is
|
||||
// only wired when at least one cap is set.
|
||||
//
|
||||
// The guard is PASS-granular: Check runs before each pass, so it stops the NEXT
|
||||
// pass once the budget is spent but cannot abort a single runaway pass mid-flight.
|
||||
// The swarm-wide GADFLY_PR_BUDGET_SECS wall-clock backstop (entrypoint.sh) is what
|
||||
// bounds a mid-pass runaway.
|
||||
type gadflyBudget struct {
|
||||
mu sync.Mutex
|
||||
maxTokens int64
|
||||
maxSeconds float64
|
||||
tokens int64
|
||||
seconds float64
|
||||
}
|
||||
|
||||
// newPRBudget builds the per-PR budget from env, or nil when neither cap is set
|
||||
// (the default — the swarm-wide ceiling lives in entrypoint.sh; this is the
|
||||
// per-process belt to its suspenders).
|
||||
func newPRBudget() *gadflyBudget {
|
||||
toks := envInt("GADFLY_PR_TOKEN_BUDGET", 0)
|
||||
secs := envInt("GADFLY_PR_TIME_BUDGET_SECS", 0)
|
||||
if toks <= 0 && secs <= 0 {
|
||||
return nil
|
||||
}
|
||||
return &gadflyBudget{maxTokens: int64(toks), maxSeconds: float64(secs)}
|
||||
}
|
||||
|
||||
func (b *gadflyBudget) Check(_ context.Context, _ string) error {
|
||||
if b == nil {
|
||||
return nil
|
||||
}
|
||||
b.mu.Lock()
|
||||
defer b.mu.Unlock()
|
||||
if b.maxTokens > 0 && b.tokens >= b.maxTokens {
|
||||
return fmt.Errorf("gadfly: per-PR token budget exhausted (%d/%d)", b.tokens, b.maxTokens)
|
||||
}
|
||||
if b.maxSeconds > 0 && b.seconds >= b.maxSeconds {
|
||||
return fmt.Errorf("gadfly: per-PR time budget exhausted (%.0f/%.0fs)", b.seconds, b.maxSeconds)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (b *gadflyBudget) Commit(_ context.Context, _ string, runtimeSeconds float64) {
|
||||
if b == nil {
|
||||
return
|
||||
}
|
||||
b.mu.Lock()
|
||||
defer b.mu.Unlock()
|
||||
b.seconds += runtimeSeconds
|
||||
}
|
||||
|
||||
// addUsage records a finished pass's token spend toward the budget. Safe on nil.
|
||||
func (b *gadflyBudget) addUsage(u llm.Usage) {
|
||||
if b == nil {
|
||||
return
|
||||
}
|
||||
b.mu.Lock()
|
||||
defer b.mu.Unlock()
|
||||
b.tokens += int64(u.InputTokens) + int64(u.OutputTokens)
|
||||
}
|
||||
|
||||
// wrapUpCritic re-expresses gadfly's wrap-up nudge as an executus run.Critic:
|
||||
// once a run comes within wrapUpReserve steps of its cap, Steer() injects the
|
||||
// "stop calling tools and write your final answer" message so a thorough model
|
||||
// spends its last steps finalizing instead of hard-failing empty. It sets no
|
||||
// hard deadline (Deadline()==zero) and never raises the step ceiling
|
||||
// (MaxSteps()==0, defer to the run's MaxIterations) — it is purely the nudge.
|
||||
type wrapUpCritic struct{ reserve int }
|
||||
|
||||
func (c *wrapUpCritic) Monitor(_ context.Context, info exrun.RunInfo, _ time.Duration) exrun.CriticHandle {
|
||||
return &wrapUpHandle{maxSteps: info.MaxIterations, reserve: c.reserve}
|
||||
}
|
||||
|
||||
type wrapUpHandle struct {
|
||||
mu sync.Mutex
|
||||
maxSteps int
|
||||
reserve int
|
||||
done int // steps completed so far
|
||||
nudged bool
|
||||
}
|
||||
|
||||
func (h *wrapUpHandle) RecordStep(iter int, _ *llm.Response) {
|
||||
h.mu.Lock()
|
||||
h.done = iter + 1
|
||||
h.mu.Unlock()
|
||||
}
|
||||
func (h *wrapUpHandle) RecordToolStart(string, string) {}
|
||||
func (h *wrapUpHandle) Steer() []llm.Message {
|
||||
h.mu.Lock()
|
||||
defer h.mu.Unlock()
|
||||
at := h.maxSteps - h.reserve
|
||||
if at < 1 {
|
||||
at = 1
|
||||
}
|
||||
if !h.nudged && h.maxSteps > 0 && h.done >= at {
|
||||
h.nudged = true
|
||||
return []llm.Message{llm.UserText(wrapUpInstruction)}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
func (h *wrapUpHandle) Deadline() time.Time { return time.Time{} }
|
||||
func (h *wrapUpHandle) MaxSteps() int { return 0 }
|
||||
func (h *wrapUpHandle) KillCause() error { return nil }
|
||||
func (h *wrapUpHandle) Stop() {}
|
||||
|
||||
// reviewExecutor bundles a run.Executor with the per-run wiring the engine needs
|
||||
// for each pass (the tool names to expose, the model spec to report as the tier,
|
||||
// the per-PR caller id, and the budget to feed token usage into).
|
||||
type reviewExecutor struct {
|
||||
ex *exrun.Executor
|
||||
toolNames []string
|
||||
modelSpec string
|
||||
callerID string
|
||||
budget *gadflyBudget
|
||||
}
|
||||
|
||||
// newReviewExecutor builds the run.Executor for the in-process majordomo review
|
||||
// path. mdl is gadfly's already-resolved review model; summarizer is the cheap
|
||||
// model the compactor uses (nil disables compaction). Compaction also needs the
|
||||
// model's context window (resolved once here, not per pass); a 0 window likewise
|
||||
// disables it. The budget (may be nil) becomes the run.Ports.Budget gate.
|
||||
func newReviewExecutor(fs *repoFS, mdl, summarizer llm.Model, modelSpec string, budget *gadflyBudget) (*reviewExecutor, error) {
|
||||
reg, names, err := gadflyToolRegistry(fs)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
// gadfly resolves exactly one model per process; the run kernel's resolver
|
||||
// just hands that model back regardless of the tier string it is asked for.
|
||||
modelsResolver := func(ctx context.Context, _ string) (context.Context, llm.Model, error) {
|
||||
return ctx, mdl, nil
|
||||
}
|
||||
|
||||
var compactor compact.CompactorFactory
|
||||
var ctxTokens func(string) int
|
||||
if summarizer != nil && compactionEnabled() {
|
||||
if window := resolveContextTokens(modelSpec); window > 0 {
|
||||
sumResolver := func(ctx context.Context, _ string) (context.Context, llm.Model, error) {
|
||||
return ctx, summarizer, nil
|
||||
}
|
||||
compactor = compact.NewCompactor(compact.CompactorConfig{
|
||||
Models: sumResolver,
|
||||
KeepRecent: envInt("GADFLY_COMPACT_KEEP_RECENT", defaultCompactKeepRecent),
|
||||
SummaryWordCap: envInt("GADFLY_COMPACT_SUMMARY_WORDS", defaultCompactSummaryWords),
|
||||
})
|
||||
ctxTokens = func(string) int { return window } // memoized: one window per process
|
||||
}
|
||||
}
|
||||
|
||||
var ports exrun.Ports
|
||||
ports.Critic = &wrapUpCritic{reserve: wrapUpReserve()}
|
||||
if budget != nil {
|
||||
ports.Budget = budget
|
||||
}
|
||||
|
||||
cfg := exrun.Config{
|
||||
|
gitea-actions
commented
🟡 GADFLY_TIMEOUT_SECS/MAX_STEPS configured in multiple places; Config.Defaults.MaxRuntime/MaxIterations duplicated verbatim with per-run RunnableAgent values maintainability · flagged by 1 model
🪰 Gadfly · advisory 🟡 **GADFLY_TIMEOUT_SECS/MAX_STEPS configured in multiple places; Config.Defaults.MaxRuntime/MaxIterations duplicated verbatim with per-run RunnableAgent values**
_maintainability · flagged by 1 model_
- `cmd/gadfly/executus.go:258` & `:285` (with `cmd/gadfly/main.go:325`) — **`GADFLY_TIMEOUT_SECS` / `GADFLY_MAX_STEPS` are configured in three/two places, and the `Config.Defaults` copies are redundant with the per-run values.** The expression `time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second` is written verbatim twice in `executus.go` (once in `Config.Defaults.MaxRuntime`, once in the per-run `RunnableAgent.MaxRuntime`), and the same timeout is *also* applied as th…
<sub>🪰 Gadfly · advisory</sub>
|
||||
Registry: reg,
|
||||
Models: modelsResolver,
|
||||
Compactor: compactor,
|
||||
ContextTokens: ctxTokens,
|
||||
Defaults: exrun.Defaults{
|
||||
// MaxIterations/MaxRuntime are intentionally omitted: every pass sets its
|
||||
// own per-run cap on the RunnableAgent below (the review and recheck caps
|
||||
// differ), so a Defaults value here would always be overridden — dead. This
|
||||
// leaves only the cross-pass guards + the compaction ratio.
|
||||
MaxConsecutiveToolErrors: 4,
|
||||
MaxSameToolCallRepeats: 4,
|
||||
CompactionThresholdRatio: compactRatio(),
|
||||
FallbackTier: modelSpec,
|
||||
},
|
||||
Ports: ports,
|
||||
}
|
||||
return &reviewExecutor{
|
||||
ex: exrun.New(cfg),
|
||||
toolNames: names,
|
||||
modelSpec: modelSpec,
|
||||
callerID: prCallerID(),
|
||||
budget: budget,
|
||||
}, nil
|
||||
}
|
||||
|
||||
|
gitea-actions
commented
🟡 Per-PR token budget only updated after each pass (addUsage post-run), so Budget.Check can't stop a single runaway pass mid-flight — token guard is pass-granular, weaker than its name implies error-handling, maintainability · flagged by 3 models
🪰 Gadfly · advisory 🟡 **Per-PR token budget only updated after each pass (addUsage post-run), so Budget.Check can't stop a single runaway pass mid-flight — token guard is pass-granular, weaker than its name implies**
_error-handling, maintainability · flagged by 3 models_
- **`cmd/gadfly/executus.go:283` (and `:123–155`, the `gadflyBudget` token gate) — the per-PR *token* budget can't bite mid-pass, only between passes.** `Budget.Check` is the executor's pre-call gate, but it reads `b.tokens`, which is only updated by `reviewExecutor.run` calling `r.budget.addUsage(res.Usage)` *after* `ex.Run` returns (the `run.Ports.Budget` interface only carries `Commit(seconds)`, with no per-call token feed — tokens are fed "out-of-band" once per finished pass). Consequence: t…
<sub>🪰 Gadfly · advisory</sub>
|
||||
// run executes one agent pass (review or recheck) through the run kernel and
|
||||
// returns the model's final text. An empty answer with no error is reported as
|
||||
// an error so the caller (reviewWithSpecialist) renders the advisory "reviewer
|
||||
// failed to complete" notice rather than a blank section.
|
||||
func (r *reviewExecutor) run(ctx context.Context, system, task string, maxSteps int) (string, error) {
|
||||
res := r.ex.Run(ctx, exrun.RunnableAgent{
|
||||
Name: "gadfly-review",
|
||||
SystemPrompt: system,
|
||||
ModelTier: r.modelSpec,
|
||||
MaxIterations: maxSteps,
|
||||
MaxRuntime: reviewTimeout(),
|
||||
LowLevelTools: r.toolNames,
|
||||
Critic: exrun.CriticConfig{Enabled: true},
|
||||
}, exectool.Invocation{
|
||||
RunID: fmt.Sprintf("gadfly-%d", runSeq.Add(1)),
|
||||
CallerID: r.callerID,
|
||||
}, task)
|
||||
|
||||
// Feed token spend toward the per-PR budget out-of-band (Commit carries only
|
||||
// seconds; the executor already called it). Safe on a nil budget.
|
||||
r.budget.addUsage(res.Usage)
|
||||
|
||||
if res.Err != nil {
|
||||
return "", res.Err
|
||||
}
|
||||
if out := strings.TrimSpace(res.Output); out != "" {
|
||||
return out, nil
|
||||
}
|
||||
return "", errors.New("agent produced no output")
|
||||
}
|
||||
|
||||
// prCallerID is the budget/audit caller key: the repo + PR, so a budget keyed on
|
||||
// it is naturally per-PR. Falls back to "local" for an out-of-CI run.
|
||||
func prCallerID() string {
|
||||
repo := strings.TrimSpace(os.Getenv("GADFLY_REPO"))
|
||||
pr := strings.TrimSpace(os.Getenv("GADFLY_PR"))
|
||||
|
gitea-actions
commented
🟡 compactionEnabled() is a verbatim copy of recheckEnabled(); extract a shared env-bool helper maintainability · flagged by 1 model
🪰 Gadfly · advisory 🟡 **compactionEnabled() is a verbatim copy of recheckEnabled(); extract a shared env-bool helper**
_maintainability · flagged by 1 model_
- **`cmd/gadfly/executus.go:319-326` — `compactionEnabled()` is a verbatim copy of `recheckEnabled()`.** It is byte-for-byte the same "on unless explicitly falsey" pattern as `recheck.go:68-75` (`switch strings.ToLower(strings.TrimSpace(os.Getenv(...)))` over `"0","false","no","off"` → `default: true`). This is exactly the copy-paste the lens targets: two identical env-bool parsers that will drift. Fix: extract one helper (e.g. `envBoolDefaultTrue(name string) bool`) and have both call it.
<sub>🪰 Gadfly · advisory</sub>
|
||||
if repo == "" && pr == "" {
|
||||
return "local"
|
||||
}
|
||||
return repo + "#" + pr
|
||||
}
|
||||
|
||||
// compactionEnabled reports whether context compaction should be wired. On
|
||||
// unless GADFLY_COMPACT is explicitly falsey.
|
||||
func compactionEnabled() bool { return envBool("GADFLY_COMPACT", true) }
|
||||
|
||||
// compactRatio is the compaction threshold as a fraction of the model context
|
||||
// window (GADFLY_COMPACT_RATIO), clamped to (0,1]; default defaultCompactRatio.
|
||||
func compactRatio() float64 {
|
||||
v := strings.TrimSpace(os.Getenv("GADFLY_COMPACT_RATIO"))
|
||||
if v == "" {
|
||||
return defaultCompactRatio
|
||||
}
|
||||
f, err := strconv.ParseFloat(v, 64)
|
||||
if err != nil || f <= 0 || f > 1 {
|
||||
return defaultCompactRatio
|
||||
}
|
||||
return f
|
||||
}
|
||||
|
||||
// resolveContextTokens returns the review model's context window in tokens, used
|
||||
// to set the compaction threshold. GADFLY_MODEL_CONTEXT_TOKENS overrides it
|
||||
// (needed for custom/self-hosted endpoints executus can't introspect); otherwise
|
||||
// it asks executus/model, which knows the static catalog and can fetch an
|
||||
// Ollama Cloud model's limit via /api/show (one call, at executor-build time).
|
||||
// Returns 0 — disabling compaction — for an unknown model, mirroring executus's
|
||||
// "unknown ⇒ don't budget" contract.
|
||||
func resolveContextTokens(modelSpec string) int {
|
||||
if v := envInt("GADFLY_MODEL_CONTEXT_TOKENS", 0); v > 0 {
|
||||
return v
|
||||
}
|
||||
key := strings.TrimSpace(os.Getenv("GADFLY_API_KEY"))
|
||||
if key == "" {
|
||||
key = strings.TrimSpace(os.Getenv("OLLAMA_API_KEY"))
|
||||
|
gitea-actions
commented
🟠 resolveContextTokens silently disables compaction on network errors with no log error-handling · flagged by 2 models 🪰 Gadfly · advisory 🟠 **resolveContextTokens silently disables compaction on network errors with no log**
_error-handling · flagged by 2 models_
<sub>🪰 Gadfly · advisory</sub>
|
||||
}
|
||||
cache := model.NewCloudOllamaLimitCache("", key, nil)
|
||||
ctx, cancel := context.WithTimeout(context.Background(), contextTokenLookupTimeout)
|
||||
defer cancel()
|
||||
if n, ok := model.MaxContextTokensResolving(ctx, modelSpec, cache); ok {
|
||||
return n
|
||||
}
|
||||
// Unknown model or a failed lookup (e.g. no key / unreachable endpoint): don't
|
||||
// guess — compaction is disabled. Log it so a misconfiguration is debuggable
|
||||
// rather than silently dropping the protection. Set GADFLY_MODEL_CONTEXT_TOKENS
|
||||
// to force a window for an endpoint executus can't introspect.
|
||||
fmt.Fprintf(os.Stderr, "gadfly: no context window resolved for %q; compaction disabled (set GADFLY_MODEL_CONTEXT_TOKENS to enable it)\n", modelSpec)
|
||||
return 0
|
||||
}
|
||||
@@ -0,0 +1,140 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
"testing"
|
||||
|
||||
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
|
||||
)
|
||||
|
||||
func TestGadflyBudget(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
|
||||
// A nil budget never blocks and never panics.
|
||||
var nilB *gadflyBudget
|
||||
if err := nilB.Check(ctx, "pr"); err != nil {
|
||||
t.Errorf("nil budget Check should be nil, got %v", err)
|
||||
}
|
||||
nilB.Commit(ctx, "pr", 10)
|
||||
nilB.addUsage(llm.Usage{InputTokens: 5})
|
||||
|
||||
// Token ceiling: passes until usage crosses it.
|
||||
b := &gadflyBudget{maxTokens: 100}
|
||||
if err := b.Check(ctx, "pr"); err != nil {
|
||||
t.Fatalf("fresh budget should pass, got %v", err)
|
||||
}
|
||||
b.addUsage(llm.Usage{InputTokens: 60, OutputTokens: 50}) // 110 >= 100
|
||||
if err := b.Check(ctx, "pr"); err == nil {
|
||||
t.Error("budget over the token cap should reject the next run")
|
||||
}
|
||||
|
||||
// Seconds ceiling, accumulated via Commit.
|
||||
s := &gadflyBudget{maxSeconds: 30}
|
||||
s.Commit(ctx, "pr", 31)
|
||||
if err := s.Check(ctx, "pr"); err == nil {
|
||||
t.Error("budget over the time cap should reject the next run")
|
||||
}
|
||||
}
|
||||
|
||||
func TestNewPRBudget(t *testing.T) {
|
||||
t.Setenv("GADFLY_PR_TOKEN_BUDGET", "")
|
||||
t.Setenv("GADFLY_PR_TIME_BUDGET_SECS", "")
|
||||
if newPRBudget() != nil {
|
||||
t.Error("no caps set should yield a nil (disabled) budget")
|
||||
}
|
||||
t.Setenv("GADFLY_PR_TOKEN_BUDGET", "1000")
|
||||
if b := newPRBudget(); b == nil || b.maxTokens != 1000 {
|
||||
t.Errorf("token cap should build a budget with maxTokens=1000, got %+v", b)
|
||||
}
|
||||
}
|
||||
|
||||
func TestCompactRatio(t *testing.T) {
|
||||
t.Setenv("GADFLY_COMPACT_RATIO", "")
|
||||
if got := compactRatio(); got != defaultCompactRatio {
|
||||
t.Errorf("default ratio = %v, want %v", got, defaultCompactRatio)
|
||||
}
|
||||
t.Setenv("GADFLY_COMPACT_RATIO", "0.6")
|
||||
if got := compactRatio(); got != 0.6 {
|
||||
t.Errorf("ratio override = %v, want 0.6", got)
|
||||
}
|
||||
for _, bad := range []string{"0", "-1", "2", "nope"} {
|
||||
t.Setenv("GADFLY_COMPACT_RATIO", bad)
|
||||
if got := compactRatio(); got != defaultCompactRatio {
|
||||
t.Errorf("invalid ratio %q should fall back to the default, got %v", bad, got)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestResolveContextTokensOverride(t *testing.T) {
|
||||
// The explicit override short-circuits any model introspection (no network).
|
||||
t.Setenv("GADFLY_MODEL_CONTEXT_TOKENS", "123456")
|
||||
if got := resolveContextTokens("anything"); got != 123456 {
|
||||
t.Errorf("explicit context-token override = %d, want 123456", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestPRCallerID(t *testing.T) {
|
||||
t.Setenv("GADFLY_REPO", "")
|
||||
t.Setenv("GADFLY_PR", "")
|
||||
if got := prCallerID(); got != "local" {
|
||||
t.Errorf("no repo/PR should be %q, got %q", "local", got)
|
||||
}
|
||||
t.Setenv("GADFLY_REPO", "steve/mort")
|
||||
t.Setenv("GADFLY_PR", "1367")
|
||||
if got := prCallerID(); got != "steve/mort#1367" {
|
||||
t.Errorf("callerID = %q, want steve/mort#1367", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestCompactionEnabled(t *testing.T) {
|
||||
for _, v := range []string{"", "1", "true", "yes"} {
|
||||
t.Setenv("GADFLY_COMPACT", v)
|
||||
if !compactionEnabled() {
|
||||
t.Errorf("GADFLY_COMPACT=%q should be enabled", v)
|
||||
}
|
||||
}
|
||||
for _, v := range []string{"0", "false", "no", "off"} {
|
||||
t.Setenv("GADFLY_COMPACT", v)
|
||||
if compactionEnabled() {
|
||||
t.Errorf("GADFLY_COMPACT=%q should be disabled", v)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestGadflyToolRegistry(t *testing.T) {
|
||||
fs, err := newRepoFS(t.TempDir(), "diff")
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
_, names, err := gadflyToolRegistry(fs)
|
||||
if err != nil {
|
||||
t.Fatalf("gadflyToolRegistry: %v", err)
|
||||
}
|
||||
want := map[string]bool{"read_file": true, "list_dir": true, "grep": true, "find_files": true, "get_diff": true}
|
||||
has := func(n string) bool {
|
||||
for _, g := range names {
|
||||
if g == n {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
for n := range want {
|
||||
if !has(n) {
|
||||
t.Errorf("registry missing tool %q (got %v)", n, names)
|
||||
}
|
||||
}
|
||||
if has("delegate_investigation") {
|
||||
t.Error("delegate_investigation must be absent without a worker model")
|
||||
}
|
||||
|
||||
// With a worker model the delegate tool is registered too.
|
||||
fs.worker = fakeModel(t, "x")
|
||||
_, names, err = gadflyToolRegistry(fs)
|
||||
if err != nil {
|
||||
t.Fatalf("gadflyToolRegistry with worker: %v", err)
|
||||
}
|
||||
if !has("delegate_investigation") {
|
||||
t.Errorf("delegate_investigation should be registered with a worker model, got %v", names)
|
||||
}
|
||||
}
|
||||
@@ -104,7 +104,7 @@ func TestRunSpecialists_FansOut(t *testing.T) {
|
||||
}
|
||||
specs := threeLenses()
|
||||
|
||||
results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff")
|
||||
results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff")
|
||||
|
||||
if got := peak(); got != 3 {
|
||||
t.Errorf("peak concurrent lenses = %d, want 3", got)
|
||||
@@ -124,7 +124,7 @@ func TestRunSpecialists_SequentialByDefault(t *testing.T) {
|
||||
}
|
||||
specs := threeLenses()
|
||||
|
||||
results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff")
|
||||
results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff")
|
||||
|
||||
if got := peak(); got != 1 {
|
||||
t.Errorf("peak concurrent lenses = %d, want 1 (sequential by default)", got)
|
||||
@@ -146,7 +146,7 @@ func TestRunSpecialists_PerProviderFanOut(t *testing.T) {
|
||||
}
|
||||
specs := threeLenses()
|
||||
|
||||
results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff")
|
||||
results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff")
|
||||
|
||||
if got := peak(); got != 3 {
|
||||
t.Errorf("peak concurrent lenses = %d, want 3 (m1 per-provider override)", got)
|
||||
|
||||
@@ -39,10 +39,9 @@
|
||||
// GADFLY_TITLE PR title (optional).
|
||||
// GADFLY_BODY PR description (optional).
|
||||
// GADFLY_MAX_STEPS review-pass step cap (optional, default 24).
|
||||
// GADFLY_WRAPUP_RESERVE steps before the cap at which the agent is told to
|
||||
// stop investigating and write its answer (optional,
|
||||
// default 4). Plus a tool-free finalization fallback
|
||||
// guarantees a step-exhausted pass still emits output.
|
||||
// GADFLY_WRAPUP_RESERVE steps before the cap at which the wrap-up critic nudges
|
||||
// the agent to stop investigating and write its answer
|
||||
// (optional, default 4).
|
||||
// GADFLY_RECHECK set to 0/false to skip the recheck pass (optional, default on).
|
||||
// GADFLY_RECHECK_MAX_STEPS recheck-pass step cap (optional, default 16).
|
||||
// GADFLY_TIMEOUT_SECS overall deadline in seconds, shared by both passes (optional, default 300).
|
||||
@@ -55,8 +54,8 @@
|
||||
// lanes as GADFLY_PROVIDER_CONCURRENCY (e.g.
|
||||
// "ollama-cloud=3,m1=1"). Wins over GADFLY_LENS_CONCURRENCY
|
||||
// for the model's provider; falls back to it otherwise.
|
||||
// GADFLY_MAX_DIFF_CHARS diff chars embedded in the prompt (optional, default 60000;
|
||||
// the full diff is always available via the get_diff tool).
|
||||
// GADFLY_MAX_DIFF_CHARS diff chars embedded in the review prompt (optional, default 60000;
|
||||
// the full diff is reachable via the paginated get_diff tool).
|
||||
//
|
||||
// On success it prints the review to stdout and exits 0. On a usage/config or
|
||||
// model error it prints a diagnostic to stderr and exits non-zero; run.sh then
|
||||
@@ -70,11 +69,9 @@ import (
|
||||
"os"
|
||||
"strconv"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"gitea.stevedudenhoeffer.com/steve/majordomo/agent"
|
||||
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
|
||||
"gitea.stevedudenhoeffer.com/steve/executus/fanout"
|
||||
)
|
||||
|
||||
const (
|
||||
@@ -107,13 +104,6 @@ const wrapUpInstruction = "⚠️ You are almost out of your investigation budge
|
||||
"Do not begin any new investigation. If a finding could not be confirmed, drop it or mark it explicitly as unverified. " +
|
||||
"Output the review in the required format right now."
|
||||
|
||||
// finalizeInstruction is the user message sent on the tool-free fallback pass
|
||||
// when the agent exhausted its budget (or tripped a loop guard) without ever
|
||||
// producing a final answer. It forces the model to synthesize whatever it has.
|
||||
const finalizeInstruction = "You have run out of investigation steps. Do NOT call any tools. " +
|
||||
"Based solely on what you have already gathered above, write your final answer now in the required format. " +
|
||||
"If you could not confirm some findings, omit them or mark them as unverified, but produce the answer."
|
||||
|
||||
func main() {
|
||||
if err := run(); err != nil {
|
||||
fmt.Fprintln(os.Stderr, "gadfly:", err)
|
||||
@@ -177,7 +167,18 @@ func run() error {
|
||||
} else if worker != nil {
|
||||
fsTools.worker = worker
|
||||
}
|
||||
eng = &majordomoEngine{mdl: mdl, fsTools: fsTools}
|
||||
// The context compactor needs a cheap summarizer; reuse the worker model
|
||||
// when present, else the review model. A bad explicit GADFLY_COMPACT_MODEL
|
||||
// just disables compaction rather than sinking the review.
|
||||
summarizer, serr := resolveSummarizerModel(mdl, fsTools.worker)
|
||||
if serr != nil {
|
||||
fmt.Fprintln(os.Stderr, "gadfly: compaction summarizer disabled:", serr)
|
||||
}
|
||||
rex, rerr := newReviewExecutor(fsTools, mdl, summarizer, os.Getenv("GADFLY_MODEL"), newPRBudget())
|
||||
if rerr != nil {
|
||||
return fmt.Errorf("build review executor: %w", rerr)
|
||||
}
|
||||
eng = &majordomoEngine{rex: rex, mdl: mdl}
|
||||
}
|
||||
|
||||
specialists, registry, auto, serrs := resolveSpecialists(repoDir)
|
||||
@@ -231,55 +232,55 @@ func run() error {
|
||||
}
|
||||
|
||||
// runSpecialists reviews the diff through each lens and returns the results in
|
||||
// the SAME order as specialists, regardless of finish order. Up to
|
||||
// GADFLY_LENS_CONCURRENCY lenses run concurrently; the default of 1 keeps the
|
||||
// suite sequential, exactly as before. Each lens already runs under its own
|
||||
// per-lens timeout (reviewWithSpecialist), so concurrency simply overlaps those
|
||||
// independent passes — and because reviewWithSpecialist builds a fresh toolbox
|
||||
// per pass and the lenses only read the immutable repoFS, they share no mutable
|
||||
// state. Results are stored by index so the consolidated comment keeps the
|
||||
// configured lens order.
|
||||
// the SAME order as specialists, regardless of finish order. It uses executus's
|
||||
// fanout primitive: up to GADFLY_LENS_CONCURRENCY lenses run concurrently (the
|
||||
// default of 1 keeps the suite sequential, exactly as before), and fanout.Run
|
||||
// returns one result per lens in input order. Each lens already runs under its
|
||||
// own per-lens timeout (reviewWithSpecialist) and the lenses only read the
|
||||
// immutable repoFS, so concurrency simply overlaps independent passes.
|
||||
//
|
||||
// Caution: this fans out WITHIN one model. It multiplies with entrypoint.sh's
|
||||
// per-provider model concurrency, so total concurrent backend requests ≈
|
||||
// (models at once) × (lenses at once). To fan lenses out without oversubscribing
|
||||
// the backend, run models one at a time (provider lane cap 1) and raise this.
|
||||
func runSpecialists(eng reviewEngine, base string, specialists []Specialist, task, diff string) []specialistResult {
|
||||
results := make([]specialistResult, len(specialists))
|
||||
|
||||
// Optional live status board: publishes this model's per-lens progress to a
|
||||
// file the entrypoint board renders. Inert (no-op) unless GADFLY_STATUS_FILE
|
||||
// is set, so plain runs are unaffected.
|
||||
sw := newStatusWriter(os.Getenv("GADFLY_MODEL"), modelProvider(), specialists)
|
||||
|
||||
conc := min(lensConcurrency(), len(specialists))
|
||||
fanResults := fanout.Run(context.Background(), specialists, fanout.Options[Specialist]{
|
||||
MaxConcurrent: lensConcurrency(),
|
||||
}, func(_ context.Context, sp Specialist) (res specialistResult, _ error) {
|
||||
// A panic in one lens must not crash the whole binary (which would kill
|
||||
// every other lens's output) or leave this lens stuck at "running" on the
|
||||
// status board. fanout does not recover fn panics, so we do it here:
|
||||
// record the panic as an errored result and mark the lens finished.
|
||||
defer func() {
|
||||
if r := recover(); r != nil {
|
||||
res = specialistResult{spec: sp, out: fmt.Sprintf("⚠️ This reviewer panicked: %v", r), verdict: verdictUnknown, errored: true}
|
||||
sw.set(sp.Name, lensFinished, "", true)
|
||||
}
|
||||
}()
|
||||
sw.set(sp.Name, lensRunning, "", false)
|
||||
out, errored := reviewWithSpecialist(eng, base, sp, task, diff)
|
||||
v := parseVerdict(out)
|
||||
sw.set(sp.Name, lensFinished, v.label(), errored)
|
||||
return specialistResult{spec: sp, out: out, verdict: v, errored: errored}, nil
|
||||
})
|
||||
|
||||
sem := make(chan struct{}, conc)
|
||||
var wg sync.WaitGroup
|
||||
for i, sp := range specialists {
|
||||
wg.Add(1)
|
||||
sem <- struct{}{} // blocks once `conc` lenses are already in flight
|
||||
go func(i int, sp Specialist) {
|
||||
defer wg.Done()
|
||||
defer func() { <-sem }()
|
||||
// A panic in one lens must not crash the whole binary (which would
|
||||
// kill every other lens's output) or leave this lens stuck at
|
||||
// "running" on the status board. Recover, record it as an errored
|
||||
// result, and mark the lens finished so the board can complete.
|
||||
defer func() {
|
||||
if r := recover(); r != nil {
|
||||
results[i] = specialistResult{spec: sp, out: fmt.Sprintf("⚠️ This reviewer panicked: %v", r), verdict: verdictUnknown, errored: true}
|
||||
sw.set(sp.Name, lensFinished, "", true)
|
||||
}
|
||||
}()
|
||||
sw.set(sp.Name, lensRunning, "", false)
|
||||
out, errored := reviewWithSpecialist(eng, base, sp, task, diff)
|
||||
v := parseVerdict(out)
|
||||
results[i] = specialistResult{spec: sp, out: out, verdict: v, errored: errored}
|
||||
sw.set(sp.Name, lensFinished, v.label(), errored)
|
||||
}(i, sp)
|
||||
// fanout guarantees input order; its Result.Err is set only when the context
|
||||
// is cancelled before a lens ran (reviewWithSpecialist embeds its own failures
|
||||
// in the result), so surface that as an errored lens rather than dropping it.
|
||||
results := make([]specialistResult, len(specialists))
|
||||
for i, r := range fanResults {
|
||||
if r.Err != nil {
|
||||
results[i] = specialistResult{spec: specialists[i], out: fmt.Sprintf("⚠️ This reviewer did not run: %v", r.Err), verdict: verdictUnknown, errored: true}
|
||||
sw.set(specialists[i].Name, lensFinished, "", true)
|
||||
continue
|
||||
}
|
||||
results[i] = r.Value
|
||||
}
|
||||
wg.Wait()
|
||||
return results
|
||||
}
|
||||
|
||||
@@ -320,8 +321,7 @@ func providerOverride(envName, provider string) (int, bool) {
|
||||
// returned bool is true when the review pass failed (rendered as an inline
|
||||
// notice — advisory; one lens failing never sinks the others or the job).
|
||||
func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, diff string) (string, bool) {
|
||||
timeout := time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second
|
||||
ctx, cancel := context.WithTimeout(context.Background(), timeout)
|
||||
ctx, cancel := context.WithTimeout(context.Background(), reviewTimeout())
|
||||
defer cancel()
|
||||
|
||||
draft, err := eng.runPass(ctx, composeSpecialistPrompt(base, sp), task,
|
||||
@@ -344,90 +344,6 @@ func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, di
|
||||
return final, false
|
||||
}
|
||||
|
||||
// runAgent runs one agent pass (its own fresh toolbox over the sandbox) and
|
||||
// returns the final answer. An empty answer is an error — the caller decides
|
||||
// whether that is fatal (review pass) or recoverable (recheck pass). A
|
||||
// non-empty answer that ended on a budget/guard error is still returned: the
|
||||
// model wrote its output, then ran out of steps.
|
||||
//
|
||||
// Two mechanisms keep a step-hungry model from hard-failing with no output:
|
||||
// 1. A wrap-up steer: once the run comes within wrapUpReserve steps of the
|
||||
// cap, a forceful "stop calling tools, write your final answer" message is
|
||||
// injected so the model spends its remaining steps finalizing.
|
||||
// 2. A finalization fallback: if the loop still ends empty (the model ignored
|
||||
// the nudge, or a loop guard tripped), one tool-free model call forces a
|
||||
// final answer out of the transcript already gathered.
|
||||
func runAgent(ctx context.Context, mdl llm.Model, fsTools *repoFS, system, task string, maxSteps int) (string, error) {
|
||||
box, err := fsTools.toolbox()
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
loop := agent.New(mdl, system,
|
||||
agent.WithToolbox(box),
|
||||
agent.WithMaxSteps(maxSteps),
|
||||
// Guard rails: stop the model from spinning on failing or identical
|
||||
// tool calls instead of writing its answer.
|
||||
agent.WithToolErrorLimits(4, 4),
|
||||
)
|
||||
|
||||
wrapUpAt := maxSteps - wrapUpReserve()
|
||||
if wrapUpAt < 1 {
|
||||
wrapUpAt = 1
|
||||
}
|
||||
var completed int // steps finished so far (updated after each step)
|
||||
nudged := false
|
||||
|
||||
res, runErr := loop.Run(ctx, task,
|
||||
agent.OnStep(func(s agent.Step) { completed = s.Index + 1 }),
|
||||
agent.WithSteer(func() []llm.Message {
|
||||
if !nudged && completed >= wrapUpAt {
|
||||
nudged = true
|
||||
return []llm.Message{llm.UserText(wrapUpInstruction)}
|
||||
}
|
||||
return nil
|
||||
}),
|
||||
)
|
||||
|
||||
out := ""
|
||||
if res != nil {
|
||||
out = strings.TrimSpace(res.Output)
|
||||
}
|
||||
if out != "" {
|
||||
return out, nil
|
||||
}
|
||||
|
||||
// No final answer. If we still have budget on the clock and a transcript to
|
||||
// work from, force a tool-free finalization rather than losing the pass.
|
||||
if res != nil && len(res.Messages) > 0 && ctx.Err() == nil {
|
||||
if forced := forceFinalAnswer(ctx, mdl, system, res.Messages); forced != "" {
|
||||
return forced, nil
|
||||
}
|
||||
}
|
||||
|
||||
if runErr != nil {
|
||||
return "", runErr
|
||||
}
|
||||
return "", errors.New("agent produced no output")
|
||||
}
|
||||
|
||||
// forceFinalAnswer makes one tool-free model call to squeeze a final answer out
|
||||
// of an agent that exhausted its step budget without producing one. Tools are
|
||||
// forbidden (ToolChoice "none") so the model must synthesize from the transcript
|
||||
// instead of investigating further. Best-effort: any error or empty reply
|
||||
// returns "" and the caller falls back to its normal empty-output handling.
|
||||
func forceFinalAnswer(ctx context.Context, mdl llm.Model, system string, transcript []llm.Message) string {
|
||||
msgs := append(append([]llm.Message(nil), transcript...), llm.UserText(finalizeInstruction))
|
||||
resp, err := mdl.Generate(ctx, llm.Request{
|
||||
System: system,
|
||||
Messages: msgs,
|
||||
ToolChoice: "none",
|
||||
})
|
||||
if err != nil || resp == nil {
|
||||
return ""
|
||||
}
|
||||
return strings.TrimSpace(resp.Text())
|
||||
}
|
||||
|
||||
// wrapUpReserve is how many steps before the cap the wrap-up nudge fires,
|
||||
// overridable via GADFLY_WRAPUP_RESERVE.
|
||||
func wrapUpReserve() int {
|
||||
@@ -444,7 +360,7 @@ func buildTask(diff string) string {
|
||||
truncNote := ""
|
||||
if maxDiff > 0 && len(diff) > maxDiff {
|
||||
diff = diff[:maxDiff]
|
||||
truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars in this message; read the changed files (or call get_diff, if available) for the full text.]", maxDiff)
|
||||
truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars in this message; page the full diff with get_diff (paginated; pass a `path` to scope it to one file) or read the changed files.]", maxDiff)
|
||||
}
|
||||
|
||||
var b strings.Builder
|
||||
@@ -471,3 +387,23 @@ func envInt(name string, def int) int {
|
||||
}
|
||||
return n
|
||||
}
|
||||
|
||||
// envBool reads a boolean-ish env var: def when unset, false for an explicit
|
||||
// falsey value (0/false/no/off), true otherwise. The shared spelling for
|
||||
// gadfly's "on unless disabled" opt-out flags (GADFLY_RECHECK, GADFLY_COMPACT).
|
||||
func envBool(name string, def bool) bool {
|
||||
switch strings.ToLower(strings.TrimSpace(os.Getenv(name))) {
|
||||
case "":
|
||||
return def
|
||||
case "0", "false", "no", "off":
|
||||
return false
|
||||
default:
|
||||
return true
|
||||
}
|
||||
}
|
||||
|
||||
// reviewTimeout is the per-specialist-lens deadline (GADFLY_TIMEOUT_SECS), shared
|
||||
// across a lens's review+recheck passes and applied as each pass's run cap.
|
||||
func reviewTimeout() time.Duration {
|
||||
return time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second
|
||||
}
|
||||
|
||||
@@ -127,6 +127,27 @@ func resolveWorkerModel() (llm.Model, error) {
|
||||
return majordomo.Parse(buildSpec(provider, spec))
|
||||
}
|
||||
|
||||
// resolveSummarizerModel picks the model the context compactor uses to compress
|
||||
// the runaway middle of a transcript. It should be CHEAP, since it fires once per
|
||||
// compaction: GADFLY_COMPACT_MODEL if set (honoring GADFLY_PROVIDER for a bare
|
||||
// id), else the delegate worker model when one is configured (already cheap by
|
||||
// design), else the review model itself. Returns (nil, err) only on an explicit
|
||||
// bad GADFLY_COMPACT_MODEL spec — the caller logs it and simply runs without
|
||||
// compaction rather than failing the review.
|
||||
func resolveSummarizerModel(review, worker llm.Model) (llm.Model, error) {
|
||||
if spec := strings.TrimSpace(os.Getenv("GADFLY_COMPACT_MODEL")); spec != "" {
|
||||
provider := strings.TrimSpace(os.Getenv("GADFLY_PROVIDER"))
|
||||
if provider == "" {
|
||||
provider = defaultProvider
|
||||
}
|
||||
return majordomo.Parse(buildSpec(provider, spec))
|
||||
}
|
||||
if worker != nil {
|
||||
return worker, nil
|
||||
}
|
||||
return review, nil
|
||||
}
|
||||
|
||||
// buildSpec turns (provider, model) into a majordomo spec. A model id that
|
||||
// already carries a "provider/" prefix (or is a multi-element failover chain)
|
||||
// is passed through verbatim; a bare id is prefixed with the provider.
|
||||
|
||||
@@ -2,7 +2,6 @@ package main
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"os"
|
||||
"strings"
|
||||
)
|
||||
|
||||
@@ -11,6 +10,13 @@ import (
|
||||
// than discovering them.
|
||||
const defaultRecheckMaxSteps = 16
|
||||
|
||||
// defaultRecheckDiffChars caps the diff embedded in the recheck task. It is much
|
||||
// smaller than the review task's GADFLY_MAX_DIFF_CHARS: the recheck already has
|
||||
// the draft findings to verify and can pull the exact hunks it needs via the
|
||||
// paginated get_diff tool (optionally scoped to a path), so re-embedding the
|
||||
// whole diff on every recheck step is pure burn. Override: GADFLY_RECHECK_DIFF_CHARS.
|
||||
const defaultRecheckDiffChars = 20000
|
||||
|
||||
// recheckSystemPrompt drives the second, adversarial verification pass. The
|
||||
// model is given a DRAFT review and must independently confirm each finding
|
||||
// against the real code before letting it survive — the antidote to a
|
||||
@@ -58,14 +64,7 @@ Output rules:
|
||||
|
||||
// recheckEnabled reports whether the verification pass should run. On unless
|
||||
// GADFLY_RECHECK is explicitly a falsey value.
|
||||
func recheckEnabled() bool {
|
||||
switch strings.ToLower(strings.TrimSpace(os.Getenv("GADFLY_RECHECK"))) {
|
||||
case "0", "false", "no", "off":
|
||||
return false
|
||||
default:
|
||||
return true
|
||||
}
|
||||
}
|
||||
func recheckEnabled() bool { return envBool("GADFLY_RECHECK", true) }
|
||||
|
||||
// shouldRecheck decides whether to run the verification pass for a given draft.
|
||||
// A clean "no material issues" draft has nothing to verify, so it is skipped
|
||||
@@ -84,11 +83,11 @@ func shouldRecheck(draft string) bool {
|
||||
// scrutinize, with the full diff available via get_diff (and embedded here,
|
||||
// truncated, to save a tool call).
|
||||
func buildRecheckTask(draft, diff string) string {
|
||||
maxDiff := envInt("GADFLY_MAX_DIFF_CHARS", defaultMaxDiffChars)
|
||||
maxDiff := envInt("GADFLY_RECHECK_DIFF_CHARS", defaultRecheckDiffChars)
|
||||
truncNote := ""
|
||||
if maxDiff > 0 && len(diff) > maxDiff {
|
||||
diff = diff[:maxDiff]
|
||||
truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars here; read the changed files (or call get_diff, if available) for the full text.]", maxDiff)
|
||||
truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars here; call get_diff (paginated; pass a `path` to scope it to one file) or read the changed files for the rest.]", maxDiff)
|
||||
}
|
||||
|
||||
var b strings.Builder
|
||||
|
||||
@@ -49,7 +49,7 @@ func TestRecheckEnabled(t *testing.T) {
|
||||
}
|
||||
|
||||
func TestBuildRecheckTask(t *testing.T) {
|
||||
t.Setenv("GADFLY_MAX_DIFF_CHARS", "")
|
||||
t.Setenv("GADFLY_RECHECK_DIFF_CHARS", "")
|
||||
draft := "VERDICT: Blocking issues found\n- foo.go:1 broken"
|
||||
out := buildRecheckTask(draft, "diff --git a/x b/x\n+y\n")
|
||||
if !strings.Contains(out, draft) {
|
||||
@@ -77,25 +77,46 @@ func fakeModel(t *testing.T, reply string) llm.Model {
|
||||
return m
|
||||
}
|
||||
|
||||
func TestRunAgent_ReturnsOutput(t *testing.T) {
|
||||
// newTestReviewExecutor builds a reviewExecutor over a fake model + repo for unit
|
||||
// tests: no compaction summarizer and no budget, so it exercises the bare agent
|
||||
// loop through the executus run kernel without any network call.
|
||||
func newTestReviewExecutor(t *testing.T, mdl llm.Model, fs *repoFS) *reviewExecutor {
|
||||
t.Helper()
|
||||
rex, err := newReviewExecutor(fs, mdl, nil, "mock", nil)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
return rex
|
||||
}
|
||||
|
||||
// testEngine wraps newTestReviewExecutor in a majordomoEngine for the
|
||||
// runSpecialists-level tests.
|
||||
func testEngine(t *testing.T, mdl llm.Model, fs *repoFS) *majordomoEngine {
|
||||
t.Helper()
|
||||
return &majordomoEngine{rex: newTestReviewExecutor(t, mdl, fs), mdl: mdl}
|
||||
}
|
||||
|
||||
func TestReviewExecutor_ReturnsOutput(t *testing.T) {
|
||||
fs, err := newRepoFS(t.TempDir(), "diff")
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
mdl := fakeModel(t, " corrected review: No material issues found. ")
|
||||
out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4)
|
||||
rex := newTestReviewExecutor(t, mdl, fs)
|
||||
out, err := rex.run(context.Background(), "sys", "task", 4)
|
||||
if err != nil {
|
||||
t.Fatalf("runAgent: %v", err)
|
||||
t.Fatalf("run: %v", err)
|
||||
}
|
||||
if out != "corrected review: No material issues found." {
|
||||
t.Errorf("runAgent should return trimmed model output, got %q", out)
|
||||
t.Errorf("run should return trimmed model output, got %q", out)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunAgent_EmptyIsError(t *testing.T) {
|
||||
func TestReviewExecutor_EmptyIsError(t *testing.T) {
|
||||
fs, _ := newRepoFS(t.TempDir(), "diff")
|
||||
mdl := fakeModel(t, " ")
|
||||
if _, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4); err == nil {
|
||||
t.Error("runAgent should error on empty model output")
|
||||
rex := newTestReviewExecutor(t, mdl, fs)
|
||||
if _, err := rex.run(context.Background(), "sys", "task", 4); err == nil {
|
||||
t.Error("run should error on empty model output")
|
||||
}
|
||||
}
|
||||
|
||||
@@ -9,6 +9,7 @@ import (
|
||||
"regexp"
|
||||
"sort"
|
||||
"strings"
|
||||
"sync"
|
||||
|
||||
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
|
||||
)
|
||||
@@ -17,11 +18,13 @@ import (
|
||||
// every tool caps how much it can pull in one call — a runaway read_file or
|
||||
// grep would blow the window and stall the loop.
|
||||
const (
|
||||
maxFileBytes = 64 * 1024 // per read_file call
|
||||
maxReadLines = 800 // per read_file call
|
||||
maxGrepResults = 200 // per grep call
|
||||
maxFindResults = 200 // per find_files call
|
||||
maxLineLen = 400 // truncate any single returned line to this
|
||||
maxFileBytes = 64 * 1024 // per read_file call
|
||||
maxReadLines = 800 // per read_file call
|
||||
maxGrepResults = 200 // per grep call
|
||||
maxFindResults = 200 // per find_files call
|
||||
maxLineLen = 400 // truncate any single returned line to this
|
||||
maxGetDiffLines = 800 // per get_diff call (paginated window)
|
||||
maxGetDiffBytes = 64 * 1024 // per get_diff call
|
||||
)
|
||||
|
||||
// skipDirs are never descended into by grep / find_files — noise and bulk that
|
||||
@@ -40,6 +43,12 @@ type repoFS struct {
|
||||
root string // absolute, symlink-resolved repo root
|
||||
diff string // the full PR unified diff (served by get_diff)
|
||||
worker llm.Model // optional cheap model for delegate_investigation; nil = no delegation
|
||||
|
||||
// diffLines caches the split diff so paging through get_diff doesn't re-split
|
||||
// the whole (possibly large) diff on every call. The diff is immutable, so the
|
||||
// cache is computed once and safe to share across concurrent lenses.
|
||||
diffOnce sync.Once
|
||||
diffLines []string
|
||||
}
|
||||
|
||||
// newRepoFS resolves root to an absolute, symlink-free path.
|
||||
@@ -108,15 +117,24 @@ func (r *repoFS) fsTools() []llm.Tool {
|
||||
}
|
||||
}
|
||||
|
||||
// toolbox builds the reviewer's toolbox: the read-only repo tools, plus the
|
||||
// delegate_investigation tool when a worker model is configured.
|
||||
func (r *repoFS) toolbox() (*llm.Toolbox, error) {
|
||||
box := llm.NewToolbox("gadfly")
|
||||
// allTools is the single source of truth for the reviewer's tool set: the
|
||||
// read-only repo tools, plus delegate_investigation when a worker model is
|
||||
// configured. Both the executus registry (gadflyToolRegistry, the production
|
||||
// path) and toolbox() build from this list.
|
||||
func (r *repoFS) allTools() []llm.Tool {
|
||||
tools := r.fsTools()
|
||||
if r.worker != nil {
|
||||
tools = append(tools, r.delegateTool())
|
||||
}
|
||||
for _, t := range tools {
|
||||
return tools
|
||||
}
|
||||
|
||||
// toolbox builds a majordomo toolbox from allTools(). The production review path
|
||||
// now goes through executus's tool.Registry (see executus.go); this remains for
|
||||
// the toolbox-level tests (the `call` helper).
|
||||
func (r *repoFS) toolbox() (*llm.Toolbox, error) {
|
||||
box := llm.NewToolbox("gadfly")
|
||||
for _, t := range r.allTools() {
|
||||
if err := box.Add(t); err != nil {
|
||||
return nil, fmt.Errorf("add tool %q: %w", t.Name, err)
|
||||
}
|
||||
@@ -396,15 +414,124 @@ func (r *repoFS) findFilesTool() llm.Tool {
|
||||
)
|
||||
}
|
||||
|
||||
type getDiffArgs struct {
|
||||
Path string `json:"path,omitempty" description:"Optional changed-file path (e.g. pkg/foo/bar.go); returns ONLY that file's diff hunks. Omit for the whole diff. Use this on a large PR to pull just the file a finding is about."`
|
||||
StartLine int `json:"start_line,omitempty" description:"Optional 1-based line to start from within the (whole or path-scoped) diff (default 1)."`
|
||||
Limit int `json:"limit,omitempty" description:"Optional max number of diff lines to return (default/maximum 800)."`
|
||||
}
|
||||
|
||||
func (r *repoFS) getDiffTool() llm.Tool {
|
||||
return llm.DefineTool[struct{}](
|
||||
return llm.DefineTool[getDiffArgs](
|
||||
"get_diff",
|
||||
"Return the complete unified diff under review. The diff is also included (possibly truncated) in the task message; call this to get the full, untruncated text.",
|
||||
func(_ context.Context, _ struct{}) (any, error) {
|
||||
if strings.TrimSpace(r.diff) == "" {
|
||||
return "(empty diff)", nil
|
||||
"Return the unified diff under review as a numbered, PAGINATED window (like read_file) — not the whole diff at once, so a huge PR can't blow the context window. Pass `path` to fetch just one changed file's hunks, or `start_line`/`limit` to page through. A truncated copy of the diff is also embedded in the task message.",
|
||||
func(_ context.Context, args getDiffArgs) (any, error) {
|
||||
scope := "the diff"
|
||||
var lines []string
|
||||
if p := strings.TrimSpace(args.Path); p != "" {
|
||||
lines = diffLinesForPath(r.diff, p)
|
||||
if len(lines) == 0 {
|
||||
return fmt.Sprintf("(no diff hunks for path %q; check it against the changed-files list — the path must match a file the diff touches)", p), nil
|
||||
}
|
||||
scope = "the diff for " + p
|
||||
|
gitea-actions
commented
🟡 windowDiff re-splits the whole diff per get_diff call; paging a large diff is O(n^2/limit) on the large-PR hot path correctness, performance · flagged by 2 models
🪰 Gadfly · advisory 🟡 **windowDiff re-splits the whole diff per get_diff call; paging a large diff is O(n^2/limit) on the large-PR hot path**
_correctness, performance · flagged by 2 models_
- **`cmd/gadfly/tools.go:435` (`windowDiff`) re-splits the entire diff on every `get_diff` call, even to return one 800-line window.** `strings.Split(diff, "\n")` materializes a `[]string` over the *whole* diff regardless of `start`/`limit`. On exactly the large PRs this feature targets, the agent is steered to call `get_diff` repeatedly — paging via `start_line` and re-fetching per `path`. Each call is O(total-diff-lines) work plus a fresh line-slice allocation, so **paging a large diff to its…
<sub>🪰 Gadfly · advisory</sub>
|
||||
} else {
|
||||
lines = r.diffAllLines()
|
||||
if len(lines) == 0 {
|
||||
return "(empty diff)", nil
|
||||
}
|
||||
}
|
||||
return r.diff, nil
|
||||
return windowDiff(lines, scope, args.StartLine, args.Limit), nil
|
||||
},
|
||||
)
|
||||
}
|
||||
|
||||
// diffAllLines returns the whole diff split into lines, cached so paging never
|
||||
// re-splits the (possibly large) diff.
|
||||
func (r *repoFS) diffAllLines() []string {
|
||||
r.diffOnce.Do(func() { r.diffLines = splitDiffLines(r.diff) })
|
||||
return r.diffLines
|
||||
}
|
||||
|
||||
// splitDiffLines splits a unified diff into lines, dropping the single trailing
|
||||
// empty element a trailing newline produces — otherwise windowDiff would emit a
|
||||
// blank final line and over-count the total by one.
|
||||
func splitDiffLines(diff string) []string {
|
||||
lines := strings.Split(diff, "\n")
|
||||
if n := len(lines); n > 0 && lines[n-1] == "" {
|
||||
lines = lines[:n-1]
|
||||
}
|
||||
|
gitea-actions
commented
🟡 windowDiff truncation message off-by-one: says 'truncated at line N' when N was the last emitted line error-handling · flagged by 1 model
🪰 Gadfly · advisory 🟡 **windowDiff truncation message off-by-one: says 'truncated at line N' when N was the last emitted line**
_error-handling · flagged by 1 model_
- **`windowDiff` truncation message off-by-one** (`cmd/gadfly/tools.go:461`): The message says `truncated at line %d` using `i` (0-based index of the first *unemitted* line), but the last *emitted* line was `i` (1-based). For example, after emitting lines 1–800, `i=800` (0-based), and the message reads "truncated at line 800 of 850" — but line 800 was the last line shown; truncation actually happened at line 801. The `start_line` hint correctly uses `i+1` (801), so the paging instruction is righ…
<sub>🪰 Gadfly · advisory</sub>
|
||||
return lines
|
||||
}
|
||||
|
||||
// windowDiff returns a numbered, paginated slice of pre-split diff lines,
|
||||
// mirroring read_file's caps (maxGetDiffLines / maxGetDiffBytes / maxLineLen) so
|
||||
// a single get_diff call can never dump a multi-hundred-KB diff into the
|
||||
// transcript — the amplifier behind the large-PR token burn. The full diff stays
|
||||
// reachable by paging with start_line, or scoped per file via the path arg.
|
||||
func windowDiff(lines []string, scope string, start, limit int) string {
|
||||
|
gitea-actions
commented
🟠 diffForPath re-splits the full diff on every scoped get_diff call, allocating O(diff_size) per invocation correctness, error-handling, performance · flagged by 5 models The finding is confirmed. The code at 🪰 Gadfly · advisory 🟠 **diffForPath re-splits the full diff on every scoped get_diff call, allocating O(diff_size) per invocation**
_correctness, error-handling, performance · flagged by 5 models_
The finding is confirmed. The code at `cmd/gadfly/tools.go:470-483` matches the draft's description exactly: `inSection = strings.Contains(ln, path)` is an unanchored substring match against the full `diff --git a/… b/…` header line, so a `path` that is a substring of another changed file's path (e.g. `foo.go` vs `vendor/foo.go`) pulls in both sections. This survives verification.
<sub>🪰 Gadfly · advisory</sub>
|
||||
total := len(lines)
|
||||
if start < 1 {
|
||||
start = 1
|
||||
}
|
||||
if limit <= 0 || limit > maxGetDiffLines {
|
||||
limit = maxGetDiffLines
|
||||
}
|
||||
if start > total {
|
||||
return fmt.Sprintf("(%s has %d lines; nothing at/after line %d)", scope, total, start)
|
||||
}
|
||||
var b strings.Builder
|
||||
emitted := 0
|
||||
i := start - 1
|
||||
for ; i < total; i++ {
|
||||
if emitted >= limit || b.Len() >= maxGetDiffBytes {
|
||||
break
|
||||
}
|
||||
line := lines[i]
|
||||
if len(line) > maxLineLen {
|
||||
line = line[:maxLineLen] + "…"
|
||||
}
|
||||
fmt.Fprintf(&b, "%d\t%s\n", i+1, line)
|
||||
emitted++
|
||||
}
|
||||
if i < total {
|
||||
// i (0-based) is the first line NOT emitted; line i was the last shown.
|
||||
fmt.Fprintf(&b, "... (%s truncated after line %d of %d; call get_diff again with start_line=%d for the rest, or pass a `path` to scope to one file)\n", scope, i, total, i+1)
|
||||
}
|
||||
return b.String()
|
||||
}
|
||||
|
||||
// diffLinesForPath returns the unified-diff lines for one changed file: from the
|
||||
// `diff --git` header that names path through to the next header (or end). The
|
||||
// header names two path tokens (a/<old> b/<new>); a match is on a WHOLE token,
|
||||
// so path "foo.go" does not pull in "barfoo.go" — and a trailing "/" scopes to a
|
||||
// directory (e.g. "pkg/foo/" matches pkg/foo/bar.go).
|
||||
func diffLinesForPath(diff, path string) []string {
|
||||
want := strings.TrimPrefix(strings.TrimPrefix(strings.TrimSpace(path), "a/"), "b/")
|
||||
var out []string
|
||||
inSection := false
|
||||
for _, ln := range splitDiffLines(diff) {
|
||||
if strings.HasPrefix(ln, "diff --git ") {
|
||||
inSection = diffHeaderNames(ln, want)
|
||||
}
|
||||
if inSection {
|
||||
out = append(out, ln)
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
// diffHeaderNames reports whether a `diff --git a/X b/Y` header names want as one
|
||||
// of its (a/b-stripped) path tokens — exact whole-token match, or a directory
|
||||
// prefix when want ends with "/".
|
||||
func diffHeaderNames(header, want string) bool {
|
||||
fields := strings.Fields(header)
|
||||
if len(fields) < 3 {
|
||||
return false
|
||||
}
|
||||
for _, f := range fields[2:] {
|
||||
p := strings.TrimPrefix(strings.TrimPrefix(f, "a/"), "b/")
|
||||
if p == want || (strings.HasSuffix(want, "/") && strings.HasPrefix(p, want)) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
@@ -197,15 +197,93 @@ func TestFindFilesTool(t *testing.T) {
|
||||
|
||||
func TestGetDiffTool(t *testing.T) {
|
||||
root := buildFixtureRepo(t)
|
||||
const diff = "diff --git a/x b/x\n+added line\n"
|
||||
const diff = "diff --git a/x b/x\n--- a/x\n+++ b/x\n+added line\n" +
|
||||
"diff --git a/y.go b/y.go\n--- a/y.go\n+++ b/y.go\n+y change\n"
|
||||
fs, _ := newRepoFS(root, diff)
|
||||
|
||||
// Default: the whole diff as a NUMBERED window (paginated), not a raw dump —
|
||||
// so a huge PR can't be poured into the transcript in one call.
|
||||
out, err := call(t, fs, "get_diff", map[string]any{})
|
||||
if err != nil {
|
||||
t.Fatalf("get_diff: %v", err)
|
||||
}
|
||||
if !strings.Contains(out, "1\tdiff --git a/x b/x") {
|
||||
t.Errorf("get_diff should return a numbered window, got:\n%s", out)
|
||||
}
|
||||
if !strings.Contains(out, "+added line") || !strings.Contains(out, "+y change") {
|
||||
t.Errorf("the full (short) diff window should include every hunk, got:\n%s", out)
|
||||
}
|
||||
|
||||
// path filter: only the named file's hunks come back.
|
||||
out, err = call(t, fs, "get_diff", map[string]any{"path": "y.go"})
|
||||
if err != nil {
|
||||
t.Fatalf("get_diff path: %v", err)
|
||||
}
|
||||
if !strings.Contains(out, "y change") {
|
||||
t.Errorf("get_diff path=y.go should include y's hunk, got:\n%s", out)
|
||||
}
|
||||
if strings.Contains(out, "added line") {
|
||||
t.Errorf("get_diff path=y.go must NOT include x's hunk, got:\n%s", out)
|
||||
}
|
||||
|
||||
// unknown path: a clear note, never an error.
|
||||
out, err = call(t, fs, "get_diff", map[string]any{"path": "nope.txt"})
|
||||
if err != nil {
|
||||
t.Fatalf("get_diff unknown path: %v", err)
|
||||
}
|
||||
if !strings.Contains(out, "no diff hunks") {
|
||||
t.Errorf("get_diff for an unknown path should note no hunks, got:\n%s", out)
|
||||
}
|
||||
}
|
||||
|
||||
// TestGetDiffTool_Paginates: a diff longer than the per-call line cap is returned
|
||||
// as a truncated window with a paging hint, and start_line pages past it — the
|
||||
// mechanism that stops get_diff from dumping a multi-hundred-KB diff at once.
|
||||
func TestGetDiffTool_Paginates(t *testing.T) {
|
||||
diff := "diff --git a/big b/big\n" + strings.Repeat("+line\n", maxGetDiffLines+50)
|
||||
fs, _ := newRepoFS(t.TempDir(), diff)
|
||||
|
||||
out, err := call(t, fs, "get_diff", map[string]any{})
|
||||
if err != nil {
|
||||
t.Fatalf("get_diff: %v", err)
|
||||
}
|
||||
if out != diff {
|
||||
t.Errorf("get_diff returned %q, want %q", out, diff)
|
||||
if !strings.Contains(out, "truncated after line") {
|
||||
t.Error("a diff longer than the per-call cap should be truncated with a paging hint")
|
||||
}
|
||||
if strings.Contains(out, "801\t") {
|
||||
t.Error("the first window must stop at the line cap, not reach line 801")
|
||||
}
|
||||
|
||||
out, err = call(t, fs, "get_diff", map[string]any{"start_line": 805})
|
||||
if err != nil {
|
||||
t.Fatalf("get_diff page: %v", err)
|
||||
}
|
||||
if !strings.Contains(out, "805\t") {
|
||||
t.Error("paging with start_line=805 should include line 805")
|
||||
}
|
||||
}
|
||||
|
||||
// TestDiffLinesForPath_Anchored: get_diff path= matches on a WHOLE path token,
|
||||
// so "foo.go" never pulls in "barfoo.go" (the unanchored-substring weakness the
|
||||
// swarm flagged), while a trailing "/" still scopes to a directory.
|
||||
func TestDiffLinesForPath_Anchored(t *testing.T) {
|
||||
diff := "diff --git a/foo.go b/foo.go\n+foo change\n" +
|
||||
"diff --git a/barfoo.go b/barfoo.go\n+barfoo change\n" +
|
||||
"diff --git a/pkg/x.go b/pkg/x.go\n+x change\n"
|
||||
|
||||
joined := strings.Join(diffLinesForPath(diff, "foo.go"), "\n")
|
||||
if !strings.Contains(joined, "foo change") {
|
||||
t.Errorf("foo.go should match its own hunk:\n%s", joined)
|
||||
}
|
||||
if strings.Contains(joined, "barfoo change") {
|
||||
t.Errorf("foo.go must NOT match barfoo.go (unanchored substring regression):\n%s", joined)
|
||||
}
|
||||
|
||||
if !strings.Contains(strings.Join(diffLinesForPath(diff, "pkg/"), "\n"), "x change") {
|
||||
t.Error("a trailing-slash path should scope to the directory (pkg/ -> pkg/x.go)")
|
||||
}
|
||||
if len(diffLinesForPath(diff, "nope.go")) != 0 {
|
||||
t.Error("an unknown path should yield no lines")
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -37,10 +37,11 @@ func lastUserText(req llm.Request) string {
|
||||
return req.Messages[len(req.Messages)-1].Text()
|
||||
}
|
||||
|
||||
// TestRunAgent_WrapUpNudgeProducesAnswer: a model that keeps calling tools until
|
||||
// it is nudged to wrap up should still finish inside its budget — the steer
|
||||
// message arrives a few steps before the cap and the model writes its answer.
|
||||
func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) {
|
||||
// TestReviewExecutor_WrapUpNudgeProducesAnswer: a model that keeps calling tools
|
||||
// until it is nudged to wrap up should still finish inside its budget — the steer
|
||||
// message (delivered by the executus wrap-up critic a few steps before the cap)
|
||||
// arrives and the model writes its answer.
|
||||
func TestReviewExecutor_WrapUpNudgeProducesAnswer(t *testing.T) {
|
||||
t.Setenv("GADFLY_WRAPUP_RESERVE", "4")
|
||||
|
||||
final := "VERDICT: No material issues found."
|
||||
@@ -60,9 +61,10 @@ func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) {
|
||||
}
|
||||
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
|
||||
|
||||
out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 12)
|
||||
rex := newTestReviewExecutor(t, mdl, fs)
|
||||
out, err := rex.run(context.Background(), "sys", "task", 12)
|
||||
if err != nil {
|
||||
t.Fatalf("runAgent should succeed via wrap-up nudge, got error: %v", err)
|
||||
t.Fatalf("run should succeed via wrap-up nudge, got error: %v", err)
|
||||
}
|
||||
if out != final {
|
||||
t.Errorf("expected final review %q, got %q", final, out)
|
||||
@@ -72,24 +74,19 @@ func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
// TestRunAgent_FinalizationFallback: a model that ignores the wrap-up nudge and
|
||||
// spins on tools until the cap should NOT hard-fail — the tool-free finalization
|
||||
// pass forces a final answer out of the transcript.
|
||||
func TestRunAgent_FinalizationFallback(t *testing.T) {
|
||||
// TestReviewExecutor_ExhaustionWithoutAnswerIsError: a model that ignores the
|
||||
// wrap-up nudge and spins on tools until the step cap produces no final answer.
|
||||
// The transcript-based forced-finalization fallback was removed in the executus
|
||||
// re-platform (run.Result does not expose the loop transcript), so the pass now
|
||||
// surfaces an error — which reviewWithSpecialist renders as an advisory "reviewer
|
||||
// failed to complete" notice rather than a phantom success.
|
||||
func TestReviewExecutor_ExhaustionWithoutAnswerIsError(t *testing.T) {
|
||||
t.Setenv("GADFLY_WRAPUP_RESERVE", "2")
|
||||
|
||||
final := "VERDICT: Minor issues\n- something"
|
||||
forcedCalled := false
|
||||
n := 0
|
||||
p := fake.New("fake", fake.WithDefault(func(_ string, req llm.Request) fake.Step {
|
||||
// Only the tool-free finalization pass forbids tools — reply there.
|
||||
if req.ToolChoice == "none" {
|
||||
forcedCalled = true
|
||||
return fake.Reply(final)
|
||||
}
|
||||
// Otherwise keep spinning, ignoring the wrap-up nudge entirely.
|
||||
p := fake.New("fake", fake.WithDefault(func(_ string, _ llm.Request) fake.Step {
|
||||
n++
|
||||
return spinToolCall(n)
|
||||
return spinToolCall(n) // spin forever, ignoring the wrap-up nudge
|
||||
}))
|
||||
mdl, err := p.Model("mock")
|
||||
if err != nil {
|
||||
@@ -97,37 +94,9 @@ func TestRunAgent_FinalizationFallback(t *testing.T) {
|
||||
}
|
||||
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
|
||||
|
||||
out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 6)
|
||||
if err != nil {
|
||||
t.Fatalf("runAgent should recover via finalization fallback, got error: %v", err)
|
||||
}
|
||||
if !forcedCalled {
|
||||
t.Error("finalization fallback was never invoked")
|
||||
}
|
||||
if out != final {
|
||||
t.Errorf("expected forced final answer %q, got %q", final, out)
|
||||
}
|
||||
}
|
||||
|
||||
// TestRunAgent_FallbackStillEmptyIsError: if even the tool-free finalization
|
||||
// yields nothing, runAgent surfaces an error rather than a phantom success.
|
||||
func TestRunAgent_FallbackStillEmptyIsError(t *testing.T) {
|
||||
n := 0
|
||||
p := fake.New("fake", fake.WithDefault(func(_ string, req llm.Request) fake.Step {
|
||||
if req.ToolChoice == "none" {
|
||||
return fake.Reply(" ") // finalization produces only whitespace
|
||||
}
|
||||
n++
|
||||
return spinToolCall(n)
|
||||
}))
|
||||
mdl, err := p.Model("mock")
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
|
||||
|
||||
if _, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4); err == nil {
|
||||
t.Error("runAgent should error when the finalization fallback also yields no output")
|
||||
rex := newTestReviewExecutor(t, mdl, fs)
|
||||
if _, err := rex.run(context.Background(), "sys", "task", 6); err == nil {
|
||||
t.Error("run should error when the model exhausts its steps without an answer")
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -183,6 +183,36 @@ export GADFLY_FINDINGS_TOKEN="${GADFLY_FINDINGS_TOKEN:-}"
|
||||
MODELS="${GADFLY_MODELS:-${OLLAMA_REVIEW_MODELS:-$DEFAULT_MODELS}}"
|
||||
DEFAULT_CONC="${GADFLY_CONCURRENCY:-1}"
|
||||
|
||||
# --- huge-PR downshift ------------------------------------------------------
|
||||
# A very large diff is what burns the model budget: every review step re-sends
|
||||
# it, multiplied across models × lenses × passes × steps (this is what nuked a
|
||||
# whole Ollama Cloud block on one giant PR). entrypoint is the only process that
|
||||
# spans the whole fleet, so the fleet-wide size decision lives here: size the PR
|
||||
# diff ONCE, and above GADFLY_HUGE_DIFF_BYTES collapse to a single cheap model +
|
||||
# a focused lens subset, fewer steps, no recheck, and a smaller embedded diff.
|
||||
# A finished shallow review beats a budget-nuking one. All knobs override; set
|
||||
# GADFLY_HUGE_DIFF_BYTES=0 to disable. Small PRs are never touched.
|
||||
HUGE_PR=0
|
||||
HUGE_DIFF_BYTES="${GADFLY_HUGE_DIFF_BYTES:-600000}"
|
||||
if [ "$HUGE_DIFF_BYTES" -gt 0 ] 2>/dev/null; then
|
||||
PR_DIFF_BYTES="$(API "${GITEA_API}/pulls/${PR}.diff" 2>/dev/null | wc -c | tr -d '[:space:]')"
|
||||
[ -z "$PR_DIFF_BYTES" ] && PR_DIFF_BYTES=0
|
||||
if [ "$PR_DIFF_BYTES" -gt "$HUGE_DIFF_BYTES" ] 2>/dev/null; then
|
||||
HUGE_PR=1
|
||||
log "huge PR: diff ${PR_DIFF_BYTES}B > ${HUGE_DIFF_BYTES}B — downshifting the fleet (advisory)"
|
||||
MODELS="${GADFLY_HUGE_DIFF_MODELS:-${MODELS%%,*}}" # first model only by default
|
||||
export GADFLY_SPECIALISTS="${GADFLY_HUGE_DIFF_SPECIALISTS:-security,correctness,error-handling}"
|
||||
export GADFLY_MAX_STEPS="${GADFLY_HUGE_DIFF_MAX_STEPS:-12}"
|
||||
export GADFLY_RECHECK_MAX_STEPS="${GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS:-8}"
|
||||
export GADFLY_RECHECK="${GADFLY_HUGE_DIFF_RECHECK:-0}" # skip recheck on huge PRs
|
||||
# The Go-visible name directly (run.sh prefers GADFLY_MAX_DIFF_CHARS over its
|
||||
|
gitea-actions
commented
⚪ downshift exports MAX_DIFF_CHARS (not GADFLY_MAX_DIFF_CHARS); the aliasing lives only in run.sh and is invisible from the Go side maintainability · flagged by 1 model
🪰 Gadfly · advisory ⚪ **downshift exports MAX_DIFF_CHARS (not GADFLY_MAX_DIFF_CHARS); the aliasing lives only in run.sh and is invisible from the Go side**
_maintainability · flagged by 1 model_
- **`entrypoint.sh:208` — `MAX_DIFF_CHARS` env name is a shell-only alias invisible from the Go side.** The downshift sets `export MAX_DIFF_CHARS=...` with a comment "run.sh -> GADFLY_MAX_DIFF_CHARS." Confirmed: `scripts/run.sh:180` translates `MAX_DIFF_CHARS` to `GADFLY_MAX_DIFF_CHARS` when invoking the binary, and the Go tree reads only `GADFLY_MAX_DIFF_CHARS` (main.go:361) / `GADFLY_RECHECK_DIFF_CHARS` (recheck.go). `MAX_DIFF_CHARS` appears nowhere in Go. The indirection lives entirely in she…
<sub>🪰 Gadfly · advisory</sub>
|
||||
# own MAX_DIFF_CHARS), so the cap is honored without relying on run.sh's alias.
|
||||
export GADFLY_MAX_DIFF_CHARS="${GADFLY_HUGE_DIFF_MAX_DIFF_CHARS:-20000}"
|
||||
# Surfaced on each posted comment so the shallower review is self-explaining.
|
||||
export GADFLY_NOTICE="⚠️ Large PR (${PR_DIFF_BYTES} bytes): Gadfly downshifted to a focused, single-model review to stay within budget — coverage is intentionally shallower. Consider splitting the PR for a deeper review."
|
||||
fi
|
||||
fi
|
||||
|
||||
provider_of() { case "$1" in */*) echo "${1%%/*}";; *) echo "${GADFLY_PROVIDER:-ollama-cloud}";; esac; }
|
||||
|
||||
# Per-model status file path for the live board. The model id can contain '/'
|
||||
@@ -297,6 +327,32 @@ if [ "${GADFLY_STATUS_BOARD:-1}" != "0" ]; then
|
||||
log "status board started (pid ${BOARD_PID})"
|
||||
fi
|
||||
|
||||
# --- swarm-wide hard backstop ----------------------------------------------
|
||||
# A wall-clock ceiling across the WHOLE fleet, so a pathological PR can never
|
||||
# drain the usage block however the models behave. entrypoint is the only
|
||||
# process spanning every model, so a single "never exceed X" guard lives here.
|
||||
# On expiry it stops the review subtrees (the binary + run.sh); whatever partial
|
||||
# findings were gathered are still posted and the job never fails (advisory).
|
||||
# GADFLY_PR_BUDGET_SECS=0 (default) disables it.
|
||||
KILLER_PID=""
|
||||
rm -f "${WORKDIR}/.budget_killed" "${WORKDIR}/.disarmed" 2>/dev/null || true
|
||||
if [ "${GADFLY_PR_BUDGET_SECS:-0}" -gt 0 ] 2>/dev/null; then
|
||||
(
|
||||
sleep "${GADFLY_PR_BUDGET_SECS}"
|
||||
log "PR wall-clock budget (${GADFLY_PR_BUDGET_SECS}s) reached — stopping the review fleet (advisory; partial findings still posted)"
|
||||
: > "${WORKDIR}/.budget_killed"
|
||||
pkill -TERM -f '/usr/local/bin/gadfly' 2>/dev/null || true
|
||||
pkill -TERM -f "${SCRIPTS_DIR}/run.sh" 2>/dev/null || true
|
||||
|
gitea-actions
commented
🟠 Backstop KILL pkill can race the disarm and kill the consolidation gadfly pass error-handling · flagged by 1 model 🪰 Gadfly · advisory 🟠 **Backstop KILL pkill can race the disarm and kill the consolidation gadfly pass**
_error-handling · flagged by 1 model_
<sub>🪰 Gadfly · advisory</sub>
|
||||
sleep 5
|
||||
# Guard the delayed SIGKILL on the disarm marker: once the lanes finished and
|
||||
# the watchdog was disarmed, the consolidation gadfly pass runs next, and a
|
||||
# name-based KILL here must NOT catch it.
|
||||
[ -f "${WORKDIR}/.disarmed" ] || pkill -KILL -f '/usr/local/bin/gadfly' 2>/dev/null || true
|
||||
) &
|
||||
KILLER_PID=$!
|
||||
log "PR budget watchdog armed (${GADFLY_PR_BUDGET_SECS}s, pid ${KILLER_PID})"
|
||||
fi
|
||||
|
||||
log "providers: ${PROVIDERS:-none}"
|
||||
# Each provider lane runs in parallel; cap is enforced within each lane. Track
|
||||
# the lane PIDs so we wait ONLY for the review work — not the status board,
|
||||
@@ -308,6 +364,21 @@ for p in $PROVIDERS; do
|
||||
done
|
||||
|
gitea-actions
commented
🟠 Budget watchdog kill may not prevent delayed SIGKILL from catching consolidation pass error-handling · flagged by 1 model
🪰 Gadfly · advisory 🟠 **Budget watchdog kill may not prevent delayed SIGKILL from catching consolidation pass**
_error-handling · flagged by 1 model_
- **Budget watchdog race: `kill $KILLER_PID` may not prevent the delayed SIGKILL** (`entrypoint.sh:364`): The watchdog subshell runs `sleep $BUDGET; pkill -TERM ...; sleep 5; pkill -KILL ...`. If the budget expires and the watchdog enters its `sleep 5` between SIGTERM and SIGKILL, and the reviews then finish naturally during that 5-second window, the main script's `kill "$KILLER_PID"` sends SIGTERM to the subshell. However, the subshell is a bash process running `sleep 5` — SIGTERM will interrup…
<sub>🪰 Gadfly · advisory</sub>
|
||||
[ "${#LANE_PIDS[@]}" -gt 0 ] && wait "${LANE_PIDS[@]}"
|
||||
|
||||
# Reviews finished (or the watchdog killed them): disarm the watchdog so its
|
||||
# delayed SIGKILL can't catch the consolidation pass that runs next. Drop the
|
||||
# disarm marker FIRST so even a racing watchdog that already reached its KILL line
|
||||
# skips it (the kill below also tears the watchdog subshell down during its sleep).
|
||||
if [ -n "$KILLER_PID" ]; then
|
||||
: > "${WORKDIR}/.disarmed"
|
||||
kill "$KILLER_PID" 2>/dev/null || true
|
||||
fi
|
||||
|
||||
# If the backstop fired, note it on the consensus comment (per-model comments
|
||||
# were already posted during the run; a killed model surfaces as a failed lane).
|
||||
if [ -f "${WORKDIR}/.budget_killed" ]; then
|
||||
export GADFLY_NOTICE="${GADFLY_NOTICE:+${GADFLY_NOTICE} }⏱️ This review was stopped early by the per-PR time budget (GADFLY_PR_BUDGET_SECS); findings are partial."
|
||||
fi
|
||||
|
||||
# Reviews are done: signal the board to render the final state once and exit.
|
||||
if [ -n "$BOARD_PID" ]; then
|
||||
touch "${STATUS_DIR}/.done" 2>/dev/null || true
|
||||
@@ -331,7 +402,9 @@ if [ "$CONSOLIDATE" = "1" ]; then
|
||||
CONSENSUS="$(GADFLY_CONSOLIDATE_DIR="$FINDINGS_DIR" GADFLY_DIFF_FILE="$DIFF_FILE" \
|
||||
/usr/local/bin/gadfly 2>"${WORKDIR}/consolidate.err" || true)"
|
||||
if [ -n "$CONSENSUS" ]; then
|
||||
BODY="$(printf '%s\n\n<sub>Automated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.</sub>' "$CONSENSUS")"
|
||||
NOTICE_BLOCK=""
|
||||
[ -n "${GADFLY_NOTICE:-}" ] && NOTICE_BLOCK="> ${GADFLY_NOTICE}"$'\n\n'
|
||||
BODY="$(printf '%s%s\n\n<sub>Automated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.</sub>' "$NOTICE_BLOCK" "$CONSENSUS")"
|
||||
upsert_comment_body "<!-- gadfly-consensus -->" "$BODY"
|
||||
log "consensus comment posted"
|
||||
else
|
||||
|
||||
@@ -3,6 +3,7 @@ module gitea.stevedudenhoeffer.com/steve/gadfly
|
||||
go 1.26.2
|
||||
|
||||
require (
|
||||
gitea.stevedudenhoeffer.com/steve/executus v0.1.4
|
||||
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462
|
||||
gopkg.in/yaml.v3 v3.0.1
|
||||
)
|
||||
@@ -17,6 +18,7 @@ require (
|
||||
github.com/go-logr/stdr v1.2.2 // indirect
|
||||
github.com/google/go-cmp v0.7.0 // indirect
|
||||
github.com/google/s2a-go v0.1.9 // indirect
|
||||
github.com/google/uuid v1.6.0 // indirect
|
||||
github.com/googleapis/enterprise-certificate-proxy v0.3.17 // indirect
|
||||
github.com/googleapis/gax-go/v2 v2.22.0 // indirect
|
||||
github.com/gorilla/websocket v1.5.3 // indirect
|
||||
|
||||
@@ -4,6 +4,8 @@ cloud.google.com/go/auth v0.20.0 h1:kXTssoVb4azsVDoUiF8KvxAqrsQcQtB53DcSgta74CA=
|
||||
cloud.google.com/go/auth v0.20.0/go.mod h1:942/yi/itH1SsmpyrbnTMDgGfdy2BUqIKyd0cyYLc5Q=
|
||||
cloud.google.com/go/compute/metadata v0.9.0 h1:pDUj4QMoPejqq20dK0Pg2N4yG9zIkYGdBtwLoEkH9Zs=
|
||||
cloud.google.com/go/compute/metadata v0.9.0/go.mod h1:E0bWwX5wTnLPedCKqk3pJmVgCBSM6qQI1yTBdEb3C10=
|
||||
gitea.stevedudenhoeffer.com/steve/executus v0.1.4 h1:4F99uCV3OVaE9ITFp0FjPiYxLUQO+WpE+wU2HCnpXNM=
|
||||
gitea.stevedudenhoeffer.com/steve/executus v0.1.4/go.mod h1:WQP/lH+meU06OSNF0TQO/wQLcJCrMwpi0EMj5vSpVtk=
|
||||
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462 h1:1crjE1YkWHLZ91tUDOxN/Y5cuOnJ56e0U9UADoFfEPY=
|
||||
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462/go.mod h1:UZLveG17SmENt4sne2RSLIbioix30RZbRIQUzBAnOyY=
|
||||
github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
|
||||
|
||||
@@ -177,7 +177,7 @@ case "$PROVIDER" in
|
||||
GADFLY_SYSTEM_FILE="${SCRIPT_DIR}/system-prompt.txt" \
|
||||
GADFLY_TITLE="$TITLE" \
|
||||
GADFLY_BODY="$BODY" \
|
||||
GADFLY_MAX_DIFF_CHARS="$MAX_DIFF_CHARS" \
|
||||
GADFLY_MAX_DIFF_CHARS="${GADFLY_MAX_DIFF_CHARS:-$MAX_DIFF_CHARS}" \
|
||||
GADFLY_STATUS_FILE="${GADFLY_STATUS_FILE:-}" \
|
||||
GADFLY_FINDINGS_OUT="${GADFLY_FINDINGS_OUT:-}" \
|
||||
"$BIN" 2>"$ERR_FILE"
|
||||
@@ -230,8 +230,12 @@ fi
|
||||
if [ "$CONSOLIDATE" = "1" ]; then
|
||||
say "done in ${DUR} (consolidated; no per-model comment)"
|
||||
else
|
||||
COMMENT="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s\n\n<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in %s</sub>' \
|
||||
"$MARKER" "$MODEL" "$MODEL_PROVIDER" "$REVIEW" "$DUR")"
|
||||
# An optional one-line notice (e.g. entrypoint's huge-PR downshift advisory),
|
||||
# shown under the header so a shallower review is self-explaining.
|
||||
NOTICE_BLOCK=""
|
||||
[ -n "${GADFLY_NOTICE:-}" ] && NOTICE_BLOCK="> ${GADFLY_NOTICE}"$'\n\n'
|
||||
COMMENT="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s%s\n\n<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in %s</sub>' \
|
||||
"$MARKER" "$MODEL" "$MODEL_PROVIDER" "$NOTICE_BLOCK" "$REVIEW" "$DUR")"
|
||||
upsert_comment "$COMMENT"
|
||||
say "done in ${DUR}"
|
||||
fi
|
||||
|
||||
@@ -16,7 +16,9 @@ state. USE THEM to verify before you report. Do not review the diff in isolation
|
||||
- list_dir([path]) — list a directory.
|
||||
- grep(pattern[, path, max_results]) — RE2 regex search across the repo.
|
||||
- find_files(name[, max_results]) — locate a file by path substring.
|
||||
- get_diff() — the full unified diff (the task message may truncate it).
|
||||
- get_diff([path, start_line, limit]) — the unified diff as a paginated, numbered window;
|
||||
pass `path` to fetch just one changed file's hunks (do this on a big PR instead of pulling
|
||||
the whole diff at once).
|
||||
|
||||
Mandatory verification discipline — this is the whole point of giving you tools:
|
||||
- Before claiming a missing/duplicate import, an undefined symbol, a wrong signature,
|
||||
|
||||
⚪ file docstring claims 'degrades to today's behavior' but the old forced-finalization fallback was removed by this PR, so empty-output exhaustion no longer matches prior behavior
maintainability · flagged by 1 model
cmd/gadfly/executus.go:19-21— docstring claims "degrades to today's behavior," but the old forced-finalization fallback was removed by this PR. The file-level comment says "Everything degrades to today's behavior when unconfigured," yet the prior transcript-basedforceFinalAnswer/finalizeInstructionpath was deleted (confirmed in the diff and the renamedTestReviewExecutor_ExhaustionWithoutAnswerIsErrortest comment). On empty-output exhaustion the new code returns an error where…🪰 Gadfly · advisory