feat: re-platform agentic review onto executus + large-PR cost controls (#20)
Build & push image / build-and-push (push) Successful in 33s

Makes gadfly a consumer of executus (run.Executor compaction/bounding/budget/critic + fanout) and fixes the large-PR token burn in size-gated layers: paginated get_diff, downshift above GADFLY_HUGE_DIFF_BYTES, and a swarm-wide GADFLY_PR_BUDGET_SECS backstop. Small PRs untouched; advisory-only and the static binary preserved. Dogfood swarm reviewed it (6 models, 21 real findings graded + folded in).

Co-authored-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
Co-committed-by: Steve Dudenhoeffer <steve@stevedudenhoeffer.com>
This commit was merged in pull request #20.
This commit is contained in:
2026-06-30 15:41:03 +00:00
committed by steve
parent 5007597cf9
commit ac6ce06cdd
19 changed files with 1065 additions and 249 deletions
+29 -5
View File
@@ -22,15 +22,21 @@ verifies each one against the actual code, and posts its findings as a comment.
4. **Provider-agnostic.** Powered by [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo), 4. **Provider-agnostic.** Powered by [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo),
so it can target Ollama (local/cloud), OpenAI, Anthropic, Google, or any so it can target Ollama (local/cloud), OpenAI, Anthropic, Google, or any
OpenAI/Ollama-compatible endpoint. Don't re-hardcode a single provider. OpenAI/Ollama-compatible endpoint. Don't re-hardcode a single provider.
5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib + majordomo. Keep 5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib, majordomo, and
it that way — no heavyweight deps, no coupling to any one consumer repo (e.g. mort). [executus](https://gitea.stevedudenhoeffer.com/steve/executus) (whose *core*`run`/`compact`/
`model`/`fanout`/`tool` — is itself majordomo+stdlib only, so the binary stays static; do NOT
pull executus's `contrib/store` or any battery that drags in a DB driver). No heavyweight deps,
no coupling to any one consumer repo (e.g. mort). Gadfly is executus's canonical *light* consumer.
## Architecture ## Architecture
``` ```
cmd/gadfly/ the reviewer binary — pure producer of review markdown (stdout) cmd/gadfly/ the reviewer binary — pure producer of review markdown (stdout)
main.go orchestration: loop specialists, each a review pass + adversarial recheck main.go orchestration: fan specialists out (executus/fanout), each a review pass + recheck
engine.go reviewEngine abstraction: majordomo agent loop vs claude-code CLI shell-out engine.go reviewEngine abstraction: executus run.Executor (majordomo agent loop +
compaction/bounding/budget/critic) vs claude-code CLI shell-out
executus.go executus wiring: tool.Registry over the repo tools, the run.Executor build
(compact + model context-limit threshold + per-PR budget + wrap-up critic)
specialists.go specialist lenses: built-ins, default suite, env + .gadfly.yml resolution specialists.go specialist lenses: built-ins, default suite, env + .gadfly.yml resolution
auto.go dynamic `auto` selection: a selector model picks lenses per-diff (may invent) auto.go dynamic `auto` selection: a selector model picks lenses per-diff (may invent)
delegate.go worker-tier delegate_investigation tool (cheap sub-agent does legwork) delegate.go worker-tier delegate_investigation tool (cheap sub-agent does legwork)
@@ -69,7 +75,7 @@ verdict. Verdict is one of: `No material issues found` / `Minor issues` / `Block
## Build / test ## Build / test
```sh ```sh
go build ./cmd/gadfly # needs read access to the private majordomo module go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
go test ./... go test ./...
gofmt -l cmd/ # must be clean gofmt -l cmd/ # must be clean
docker build -t gadfly:dev --secret id=REGISTRY_USER,env=REGISTRY_USER --secret id=REGISTRY_PASSWORD,env=REGISTRY_PASSWORD . docker build -t gadfly:dev --secret id=REGISTRY_USER,env=REGISTRY_USER --secret id=REGISTRY_PASSWORD,env=REGISTRY_PASSWORD .
@@ -149,3 +155,21 @@ are actually exercised. OpenAI/Anthropic/Google come from majordomo's abstractio
parallel, `cap` (from `GADFLY_PROVIDER_CONCURRENCY` else `GADFLY_CONCURRENCY`, default 1) bounds parallel, `cap` (from `GADFLY_PROVIDER_CONCURRENCY` else `GADFLY_CONCURRENCY`, default 1) bounds
models-at-once within a lane. The review timeout (`GADFLY_TIMEOUT_SECS`) is **per-lens**, not models-at-once within a lane. The review timeout (`GADFLY_TIMEOUT_SECS`) is **per-lens**, not
shared across the suite — a slow model can't starve later lenses (the original timeout bug). shared across the suite — a slow model can't starve later lenses (the original timeout bug).
- **Large-PR token burn**: the agent loop re-sends the whole transcript every step, so a giant
diff (the old `get_diff` dumped it untruncated, and it was embedded in both the review and
recheck task) was re-transmitted ~steps × lenses × passes × models times — a ~250 K-token PR
could drain a metered usage block in minutes. Fixed in three size-gated layers (small PRs
untouched): paginated `get_diff` + `executus/compact` compaction in the binary; an
`entrypoint.sh` downshift above `GADFLY_HUGE_DIFF_BYTES` (one cheap model, fewer lenses/steps,
no recheck); and a swarm-wide `GADFLY_PR_BUDGET_SECS` wall-clock backstop. Compaction's threshold
is intentionally LOW (`GADFLY_COMPACT_RATIO` 0.45, not executus's 0.7) because the burning
transcript on the embedded path rarely reaches 0.7×context.
- **executus re-platform**: the in-process review path runs through `executus/run`'s `run.Executor`
(compaction, run-bounding, `Ports.Budget`, the wrap-up nudge as `Ports.Critic`), wiring it in
`cmd/gadfly/executus.go`. Gadfly KEEPS its own `model.go` resolution (so `GADFLY_ENDPOINT_<NAME>`
http aliases + the claude-code engine survive) and only hands `run.Executor` the already-resolved
model via a trivial resolver — do NOT route review-model resolution through
`model.ParseModelForContext` (it bypasses gadfly's endpoint aliases). `run.Result` exposes no
transcript, so the old transcript-based forced-finalization fallback is gone; the wrap-up critic
nudge is the remaining "always emit something" mechanism. The claude-code engine still shells out
and is unaffected.
+4 -1
View File
@@ -24,7 +24,10 @@ RUN --mount=type=cache,target=/go/pkg/mod \
go build -trimpath -ldflags="-s -w" -o /out/gadfly ./cmd/gadfly go build -trimpath -ldflags="-s -w" -o /out/gadfly ./cmd/gadfly
FROM alpine:3.20 FROM alpine:3.20
RUN apk add --no-cache bash git curl jq ca-certificates nodejs npm # procps provides pkill/pgrep, which entrypoint.sh's per-PR wall-clock backstop
# (GADFLY_PR_BUDGET_SECS) uses to stop the review subtrees — busybox's applets
# are not guaranteed to include them.
RUN apk add --no-cache bash git curl jq ca-certificates nodejs npm procps
# Bundle the Claude Code CLI so the `claude-code` review engine works out of the # Bundle the Claude Code CLI so the `claude-code` review engine works out of the
# box (GADFLY_MODELS=claude-code or claude-code/<model>). This adds Node + the # box (GADFLY_MODELS=claude-code or claude-code/<model>). This adds Node + the
# CLI to the image (notably larger); ollama-only users pay the size but nothing # CLI to the image (notably larger); ollama-only users pay the size but nothing
+42 -2
View File
@@ -396,7 +396,16 @@ The reviewer binary reads these (the stub/entrypoint set sane defaults):
| `GADFLY_TIMEOUT_SECS` | 300 | deadline **per specialist lens** (review+recheck) | | `GADFLY_TIMEOUT_SECS` | 300 | deadline **per specialist lens** (review+recheck) |
| `GADFLY_RECHECK` | on | set `0`/`false` to skip the recheck pass | | `GADFLY_RECHECK` | on | set `0`/`false` to skip the recheck pass |
| `GADFLY_RECHECK_MAX_STEPS` | 16 | recheck-pass step cap | | `GADFLY_RECHECK_MAX_STEPS` | 16 | recheck-pass step cap |
| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the prompt (full diff via `get_diff`) | | `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the **review** prompt (the full diff is reachable via the paginated `get_diff` tool, scoped per file with its `path` arg) |
| `GADFLY_RECHECK_DIFF_CHARS` | 20000 | diff chars embedded in the **recheck** prompt (smaller — the recheck pages `get_diff` for the hunks it verifies) |
| `GADFLY_COMPACT` | on | context compaction (via [executus](https://gitea.stevedudenhoeffer.com/steve/executus)): fold the transcript's runaway middle into a summary as it nears the model's context window, so a big diff + accumulating tool output can't balloon every step. `0` disables |
| `GADFLY_COMPACT_RATIO` | 0.45 | fraction of the model's context window at which compaction fires |
| `GADFLY_COMPACT_MODEL` | worker, else review model | cheap model the compactor uses to summarize the folded middle |
| `GADFLY_COMPACT_KEEP_RECENT` | 8 | most-recent messages kept verbatim during compaction |
| `GADFLY_COMPACT_SUMMARY_WORDS` | 200 | word cap on the compaction summary |
| `GADFLY_MODEL_CONTEXT_TOKENS` | *(auto)* | override the model's context-window size (tokens) for the compaction threshold; set it for self-hosted endpoints executus can't introspect (Ollama Cloud models resolve automatically) |
| `GADFLY_PR_TOKEN_BUDGET` | — | per-model token ceiling for this PR; once spent, remaining lenses/passes are skipped (advisory). 0 = off |
| `GADFLY_PR_TIME_BUDGET_SECS` | — | per-model wall-clock ceiling for this PR (advisory). 0 = off |
| `GADFLY_STATUS_BOARD` | on | set `0` to disable the live status-board comment | | `GADFLY_STATUS_BOARD` | on | set `0` to disable the live status-board comment |
| `GADFLY_STATUS_POLL_SECS` | 12 | how often the status board re-renders/upserts | | `GADFLY_STATUS_POLL_SECS` | 12 | how often the status board re-renders/upserts |
| `GADFLY_CONSOLIDATE` | `auto` | cross-model consensus comment: `auto` (on for ≥2 models), `1` (force on), `0` (off — one comment per model) | | `GADFLY_CONSOLIDATE` | `auto` | cross-model consensus comment: `auto` (on for ≥2 models), `1` (force on), `0` (off — one comment per model) |
@@ -408,6 +417,37 @@ The reviewer binary reads these (the stub/entrypoint set sane defaults):
| `GADFLY_REPO` | *(from `GITEA_API`)* | `owner/repo` slug stamped on emitted runs/findings (set by `entrypoint.sh`) | | `GADFLY_REPO` | *(from `GITEA_API`)* | `owner/repo` slug stamped on emitted runs/findings (set by `entrypoint.sh`) |
| `GADFLY_PR` | *(from event)* | PR number stamped on emitted runs/findings (set by `entrypoint.sh`) | | `GADFLY_PR` | *(from event)* | PR number stamped on emitted runs/findings (set by `entrypoint.sh`) |
### Large-PR cost controls
A very large diff is the one thing that can blow the budget: every review step
re-sends it, multiplied across models × lenses × passes × steps (a single
~250 K-token PR can otherwise burn a whole metered usage block). Gadfly handles
big PRs in three layers, all **size-gated so small PRs are untouched**:
1. **Paginated `get_diff` + compaction** (reviewer binary, on by default) —
`get_diff` returns a paginated, optionally per-file window instead of the whole
diff, and once a transcript nears the model's context window its middle is
folded into a summary (powered by [executus](https://gitea.stevedudenhoeffer.com/steve/executus)'s
`compact`). Tune with the `GADFLY_COMPACT_*` knobs above.
2. **Downshift** (`entrypoint.sh`) — above `GADFLY_HUGE_DIFF_BYTES` the whole fleet
collapses to a single cheap model + a focused lens subset, fewer steps, and no
recheck. A finished shallow review beats a budget-nuking one, and the posted
comment says so.
3. **Hard backstop** (`entrypoint.sh`) — `GADFLY_PR_BUDGET_SECS` is a wall-clock
ceiling across the *entire* fleet; on expiry the review is stopped and whatever
was found so far is posted. Like everything else, it never fails CI.
| Env | Default | Meaning |
|-----|---------|---------|
| `GADFLY_HUGE_DIFF_BYTES` | 600000 | downshift the fleet when the PR diff exceeds this many bytes (0 = never downshift) |
| `GADFLY_HUGE_DIFF_MODELS` | first model | model(s) to run on a downshifted huge PR |
| `GADFLY_HUGE_DIFF_SPECIALISTS` | `security,correctness,error-handling` | lenses on a downshifted huge PR |
| `GADFLY_HUGE_DIFF_MAX_STEPS` | 12 | review step cap on a huge PR |
| `GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS` | 8 | recheck step cap on a huge PR |
| `GADFLY_HUGE_DIFF_RECHECK` | 0 | run the recheck pass on a huge PR (off by default) |
| `GADFLY_HUGE_DIFF_MAX_DIFF_CHARS` | 20000 | embedded review-diff chars on a huge PR |
| `GADFLY_PR_BUDGET_SECS` | — | swarm-wide wall-clock backstop; stops the whole fleet when reached (0 = off) |
## Findings telemetry (optional) ## Findings telemetry (optional)
Gadfly can record what it found so model quality can be tracked over time. It is Gadfly can record what it found so model quality can be tracked over time. It is
@@ -431,7 +471,7 @@ code.
## Building locally ## Building locally
```sh ```sh
go build ./cmd/gadfly # needs read access to the private majordomo module go build ./cmd/gadfly # needs read access to the private majordomo + executus modules
go test ./... go test ./...
``` ```
+8 -4
View File
@@ -33,14 +33,18 @@ type reviewEngine interface {
runPass(ctx context.Context, system, task string, maxSteps int) (string, error) runPass(ctx context.Context, system, task string, maxSteps int) (string, error)
} }
// majordomoEngine drives the in-process majordomo agent over the repo sandbox. // majordomoEngine drives the in-process review path over the repo sandbox. It no
// longer calls majordomo's agent loop directly: each pass runs through an
// executus run.Executor (see executus.go), which adds context compaction, run
// bounding, the per-PR budget gate, and the wrap-up critic. mdl is retained only
// so auto-select can fall back to the review model as its selector.
type majordomoEngine struct { type majordomoEngine struct {
mdl llm.Model rex *reviewExecutor
fsTools *repoFS mdl llm.Model
} }
func (e *majordomoEngine) runPass(ctx context.Context, system, task string, maxSteps int) (string, error) { func (e *majordomoEngine) runPass(ctx context.Context, system, task string, maxSteps int) (string, error) {
return runAgent(ctx, e.mdl, e.fsTools, system, task, maxSteps) return e.rex.run(ctx, system, task, maxSteps)
} }
// claudeCodeEngine reviews by shelling out to the `claude` CLI (Claude Code) in // claudeCodeEngine reviews by shelling out to the `claude` CLI (Claude Code) in
+371
View File
@@ -0,0 +1,371 @@
package main
// executus.go wires gadfly's agentic review path onto the executus run kernel
// (gitea.stevedudenhoeffer.com/steve/executus), layered above majordomo. The
// majordomoEngine no longer drives majordomo's agent loop directly; it builds a
// run.Executor that gives gadfly, for free:
//
// - context compaction (executus/compact): once the transcript a step would
// SEND crosses a token threshold derived from the model's real context
// window, the runaway middle is folded into a one-paragraph summary by a
// cheap summarizer model — so a big diff + accumulating read_file/grep
// results can't balloon every re-sent step (the large-PR burn).
// - run bounding + a per-PR spend budget (executus/run Ports.Budget): a hard
// token/seconds ceiling so a pathological PR can't drain the usage block.
// - the wrap-up nudge, re-expressed as an executus Critic (Ports.Critic): the
// steer that tells a step-hungry model to stop investigating and write its
// answer is now the critic seam, not a bespoke RunOption.
//
// Everything degrades to today's behavior when unconfigured: nil summarizer or a
// 0 context window disables compaction; nil budget disables the ceiling; the
// claude-code engine shells out and is unaffected by any of this.
//
// gadfly keeps its own model.go resolution (so GADFLY_ENDPOINT_<NAME> http
// aliases, failover chains, and the claude-code engine all survive) — the
// run.Executor is handed gadfly's already-resolved model via a trivial resolver,
// not routed through executus's tier table.
import (
"context"
"errors"
"fmt"
"os"
"strconv"
"strings"
"sync"
"sync/atomic"
"time"
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
"gitea.stevedudenhoeffer.com/steve/executus/compact"
"gitea.stevedudenhoeffer.com/steve/executus/model"
exrun "gitea.stevedudenhoeffer.com/steve/executus/run"
exectool "gitea.stevedudenhoeffer.com/steve/executus/tool"
)
const (
// defaultCompactRatio is the fraction of the model's context window at which
// compaction fires. It is deliberately LOWER than executus's own 0.7 default:
// on the large-PR burn the per-step transcript is the embedded diff (~17K) plus
// accumulating read_file results, which rarely reaches 0.7×262K≈183K — so a
// 0.7 threshold never bites. ~0.45×262K≈118K folds the runaway middle while a
// transcript is still well under the cap. Override with GADFLY_COMPACT_RATIO.
defaultCompactRatio = 0.45
// defaultCompactKeepRecent / defaultCompactSummaryWords mirror executus's own
// compactor defaults; surfaced as gadfly env knobs for tuning.
defaultCompactKeepRecent = 8
defaultCompactSummaryWords = 200
// contextTokenLookupTimeout bounds the one-shot /api/show call that resolves a
// cloud model's context window at executor-build time. Kept short so a slow or
// unreachable endpoint adds at most this to startup before degrading to
// no-compaction (rather than the provider cache's default 15s).
contextTokenLookupTimeout = 5 * time.Second
)
// runSeq mints a unique-per-process RunID suffix for each executor run so audit
// and the run kernel can tell one pass from another within a binary process.
var runSeq atomic.Uint64
// wrappedTool adapts an already-built majordomo llm.Tool (gadfly's sandboxed
// read_file/grep/get_diff/… closures over the repoFS) to executus's tool.Tool
// interface so the run kernel can build a toolbox from them by name. gadfly's
// tools need no caller/channel identity, so BuildLLM ignores the Invocation and
// returns the pre-built tool; Permission is the zero value (private, ungated).
type wrappedTool struct{ t llm.Tool }
func (w wrappedTool) Name() string { return w.t.Name }
func (w wrappedTool) Description() string { return w.t.Description }
func (w wrappedTool) Permission() exectool.Permission { return exectool.Permission{} }
func (w wrappedTool) BuildLLM(_ exectool.Invocation) llm.Tool { return w.t }
// gadflyToolRegistry registers the repo's read-only tools (plus the optional
// delegate_investigation worker tool) in a fresh executus tool.Registry and
// returns it along with the tool names for RunnableAgent.LowLevelTools.
func gadflyToolRegistry(fs *repoFS) (exectool.Registry, []string, error) {
reg := exectool.NewRegistry()
tools := fs.allTools()
names := make([]string, 0, len(tools))
for _, t := range tools {
if err := reg.Register(wrappedTool{t: t}); err != nil {
return nil, nil, fmt.Errorf("register tool %q: %w", t.Name, err)
}
names = append(names, t.Name)
}
return reg, names, nil
}
// gadflyBudget is gadfly's per-PR spend ceiling, satisfying run.Ports.Budget.
// It gates a run BEFORE it makes any model call (Check) once the process has
// spent its token or wall-clock allowance on this PR. Tokens are fed in
// out-of-band via addUsage (the Budget interface's Commit only carries seconds);
// the engine calls addUsage after each pass with run.Result.Usage. A nil
// *gadflyBudget is never installed — caps of 0 mean "unlimited", so the port is
// only wired when at least one cap is set.
//
// The guard is PASS-granular: Check runs before each pass, so it stops the NEXT
// pass once the budget is spent but cannot abort a single runaway pass mid-flight.
// The swarm-wide GADFLY_PR_BUDGET_SECS wall-clock backstop (entrypoint.sh) is what
// bounds a mid-pass runaway.
type gadflyBudget struct {
mu sync.Mutex
maxTokens int64
maxSeconds float64
tokens int64
seconds float64
}
// newPRBudget builds the per-PR budget from env, or nil when neither cap is set
// (the default — the swarm-wide ceiling lives in entrypoint.sh; this is the
// per-process belt to its suspenders).
func newPRBudget() *gadflyBudget {
toks := envInt("GADFLY_PR_TOKEN_BUDGET", 0)
secs := envInt("GADFLY_PR_TIME_BUDGET_SECS", 0)
if toks <= 0 && secs <= 0 {
return nil
}
return &gadflyBudget{maxTokens: int64(toks), maxSeconds: float64(secs)}
}
func (b *gadflyBudget) Check(_ context.Context, _ string) error {
if b == nil {
return nil
}
b.mu.Lock()
defer b.mu.Unlock()
if b.maxTokens > 0 && b.tokens >= b.maxTokens {
return fmt.Errorf("gadfly: per-PR token budget exhausted (%d/%d)", b.tokens, b.maxTokens)
}
if b.maxSeconds > 0 && b.seconds >= b.maxSeconds {
return fmt.Errorf("gadfly: per-PR time budget exhausted (%.0f/%.0fs)", b.seconds, b.maxSeconds)
}
return nil
}
func (b *gadflyBudget) Commit(_ context.Context, _ string, runtimeSeconds float64) {
if b == nil {
return
}
b.mu.Lock()
defer b.mu.Unlock()
b.seconds += runtimeSeconds
}
// addUsage records a finished pass's token spend toward the budget. Safe on nil.
func (b *gadflyBudget) addUsage(u llm.Usage) {
if b == nil {
return
}
b.mu.Lock()
defer b.mu.Unlock()
b.tokens += int64(u.InputTokens) + int64(u.OutputTokens)
}
// wrapUpCritic re-expresses gadfly's wrap-up nudge as an executus run.Critic:
// once a run comes within wrapUpReserve steps of its cap, Steer() injects the
// "stop calling tools and write your final answer" message so a thorough model
// spends its last steps finalizing instead of hard-failing empty. It sets no
// hard deadline (Deadline()==zero) and never raises the step ceiling
// (MaxSteps()==0, defer to the run's MaxIterations) — it is purely the nudge.
type wrapUpCritic struct{ reserve int }
func (c *wrapUpCritic) Monitor(_ context.Context, info exrun.RunInfo, _ time.Duration) exrun.CriticHandle {
return &wrapUpHandle{maxSteps: info.MaxIterations, reserve: c.reserve}
}
type wrapUpHandle struct {
mu sync.Mutex
maxSteps int
reserve int
done int // steps completed so far
nudged bool
}
func (h *wrapUpHandle) RecordStep(iter int, _ *llm.Response) {
h.mu.Lock()
h.done = iter + 1
h.mu.Unlock()
}
func (h *wrapUpHandle) RecordToolStart(string, string) {}
func (h *wrapUpHandle) Steer() []llm.Message {
h.mu.Lock()
defer h.mu.Unlock()
at := h.maxSteps - h.reserve
if at < 1 {
at = 1
}
if !h.nudged && h.maxSteps > 0 && h.done >= at {
h.nudged = true
return []llm.Message{llm.UserText(wrapUpInstruction)}
}
return nil
}
func (h *wrapUpHandle) Deadline() time.Time { return time.Time{} }
func (h *wrapUpHandle) MaxSteps() int { return 0 }
func (h *wrapUpHandle) KillCause() error { return nil }
func (h *wrapUpHandle) Stop() {}
// reviewExecutor bundles a run.Executor with the per-run wiring the engine needs
// for each pass (the tool names to expose, the model spec to report as the tier,
// the per-PR caller id, and the budget to feed token usage into).
type reviewExecutor struct {
ex *exrun.Executor
toolNames []string
modelSpec string
callerID string
budget *gadflyBudget
}
// newReviewExecutor builds the run.Executor for the in-process majordomo review
// path. mdl is gadfly's already-resolved review model; summarizer is the cheap
// model the compactor uses (nil disables compaction). Compaction also needs the
// model's context window (resolved once here, not per pass); a 0 window likewise
// disables it. The budget (may be nil) becomes the run.Ports.Budget gate.
func newReviewExecutor(fs *repoFS, mdl, summarizer llm.Model, modelSpec string, budget *gadflyBudget) (*reviewExecutor, error) {
reg, names, err := gadflyToolRegistry(fs)
if err != nil {
return nil, err
}
// gadfly resolves exactly one model per process; the run kernel's resolver
// just hands that model back regardless of the tier string it is asked for.
modelsResolver := func(ctx context.Context, _ string) (context.Context, llm.Model, error) {
return ctx, mdl, nil
}
var compactor compact.CompactorFactory
var ctxTokens func(string) int
if summarizer != nil && compactionEnabled() {
if window := resolveContextTokens(modelSpec); window > 0 {
sumResolver := func(ctx context.Context, _ string) (context.Context, llm.Model, error) {
return ctx, summarizer, nil
}
compactor = compact.NewCompactor(compact.CompactorConfig{
Models: sumResolver,
KeepRecent: envInt("GADFLY_COMPACT_KEEP_RECENT", defaultCompactKeepRecent),
SummaryWordCap: envInt("GADFLY_COMPACT_SUMMARY_WORDS", defaultCompactSummaryWords),
})
ctxTokens = func(string) int { return window } // memoized: one window per process
}
}
var ports exrun.Ports
ports.Critic = &wrapUpCritic{reserve: wrapUpReserve()}
if budget != nil {
ports.Budget = budget
}
cfg := exrun.Config{
Registry: reg,
Models: modelsResolver,
Compactor: compactor,
ContextTokens: ctxTokens,
Defaults: exrun.Defaults{
// MaxIterations/MaxRuntime are intentionally omitted: every pass sets its
// own per-run cap on the RunnableAgent below (the review and recheck caps
// differ), so a Defaults value here would always be overridden — dead. This
// leaves only the cross-pass guards + the compaction ratio.
MaxConsecutiveToolErrors: 4,
MaxSameToolCallRepeats: 4,
CompactionThresholdRatio: compactRatio(),
FallbackTier: modelSpec,
},
Ports: ports,
}
return &reviewExecutor{
ex: exrun.New(cfg),
toolNames: names,
modelSpec: modelSpec,
callerID: prCallerID(),
budget: budget,
}, nil
}
// run executes one agent pass (review or recheck) through the run kernel and
// returns the model's final text. An empty answer with no error is reported as
// an error so the caller (reviewWithSpecialist) renders the advisory "reviewer
// failed to complete" notice rather than a blank section.
func (r *reviewExecutor) run(ctx context.Context, system, task string, maxSteps int) (string, error) {
res := r.ex.Run(ctx, exrun.RunnableAgent{
Name: "gadfly-review",
SystemPrompt: system,
ModelTier: r.modelSpec,
MaxIterations: maxSteps,
MaxRuntime: reviewTimeout(),
LowLevelTools: r.toolNames,
Critic: exrun.CriticConfig{Enabled: true},
}, exectool.Invocation{
RunID: fmt.Sprintf("gadfly-%d", runSeq.Add(1)),
CallerID: r.callerID,
}, task)
// Feed token spend toward the per-PR budget out-of-band (Commit carries only
// seconds; the executor already called it). Safe on a nil budget.
r.budget.addUsage(res.Usage)
if res.Err != nil {
return "", res.Err
}
if out := strings.TrimSpace(res.Output); out != "" {
return out, nil
}
return "", errors.New("agent produced no output")
}
// prCallerID is the budget/audit caller key: the repo + PR, so a budget keyed on
// it is naturally per-PR. Falls back to "local" for an out-of-CI run.
func prCallerID() string {
repo := strings.TrimSpace(os.Getenv("GADFLY_REPO"))
pr := strings.TrimSpace(os.Getenv("GADFLY_PR"))
if repo == "" && pr == "" {
return "local"
}
return repo + "#" + pr
}
// compactionEnabled reports whether context compaction should be wired. On
// unless GADFLY_COMPACT is explicitly falsey.
func compactionEnabled() bool { return envBool("GADFLY_COMPACT", true) }
// compactRatio is the compaction threshold as a fraction of the model context
// window (GADFLY_COMPACT_RATIO), clamped to (0,1]; default defaultCompactRatio.
func compactRatio() float64 {
v := strings.TrimSpace(os.Getenv("GADFLY_COMPACT_RATIO"))
if v == "" {
return defaultCompactRatio
}
f, err := strconv.ParseFloat(v, 64)
if err != nil || f <= 0 || f > 1 {
return defaultCompactRatio
}
return f
}
// resolveContextTokens returns the review model's context window in tokens, used
// to set the compaction threshold. GADFLY_MODEL_CONTEXT_TOKENS overrides it
// (needed for custom/self-hosted endpoints executus can't introspect); otherwise
// it asks executus/model, which knows the static catalog and can fetch an
// Ollama Cloud model's limit via /api/show (one call, at executor-build time).
// Returns 0 — disabling compaction — for an unknown model, mirroring executus's
// "unknown ⇒ don't budget" contract.
func resolveContextTokens(modelSpec string) int {
if v := envInt("GADFLY_MODEL_CONTEXT_TOKENS", 0); v > 0 {
return v
}
key := strings.TrimSpace(os.Getenv("GADFLY_API_KEY"))
if key == "" {
key = strings.TrimSpace(os.Getenv("OLLAMA_API_KEY"))
}
cache := model.NewCloudOllamaLimitCache("", key, nil)
ctx, cancel := context.WithTimeout(context.Background(), contextTokenLookupTimeout)
defer cancel()
if n, ok := model.MaxContextTokensResolving(ctx, modelSpec, cache); ok {
return n
}
// Unknown model or a failed lookup (e.g. no key / unreachable endpoint): don't
// guess — compaction is disabled. Log it so a misconfiguration is debuggable
// rather than silently dropping the protection. Set GADFLY_MODEL_CONTEXT_TOKENS
// to force a window for an endpoint executus can't introspect.
fmt.Fprintf(os.Stderr, "gadfly: no context window resolved for %q; compaction disabled (set GADFLY_MODEL_CONTEXT_TOKENS to enable it)\n", modelSpec)
return 0
}
+140
View File
@@ -0,0 +1,140 @@
package main
import (
"context"
"testing"
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
)
func TestGadflyBudget(t *testing.T) {
ctx := context.Background()
// A nil budget never blocks and never panics.
var nilB *gadflyBudget
if err := nilB.Check(ctx, "pr"); err != nil {
t.Errorf("nil budget Check should be nil, got %v", err)
}
nilB.Commit(ctx, "pr", 10)
nilB.addUsage(llm.Usage{InputTokens: 5})
// Token ceiling: passes until usage crosses it.
b := &gadflyBudget{maxTokens: 100}
if err := b.Check(ctx, "pr"); err != nil {
t.Fatalf("fresh budget should pass, got %v", err)
}
b.addUsage(llm.Usage{InputTokens: 60, OutputTokens: 50}) // 110 >= 100
if err := b.Check(ctx, "pr"); err == nil {
t.Error("budget over the token cap should reject the next run")
}
// Seconds ceiling, accumulated via Commit.
s := &gadflyBudget{maxSeconds: 30}
s.Commit(ctx, "pr", 31)
if err := s.Check(ctx, "pr"); err == nil {
t.Error("budget over the time cap should reject the next run")
}
}
func TestNewPRBudget(t *testing.T) {
t.Setenv("GADFLY_PR_TOKEN_BUDGET", "")
t.Setenv("GADFLY_PR_TIME_BUDGET_SECS", "")
if newPRBudget() != nil {
t.Error("no caps set should yield a nil (disabled) budget")
}
t.Setenv("GADFLY_PR_TOKEN_BUDGET", "1000")
if b := newPRBudget(); b == nil || b.maxTokens != 1000 {
t.Errorf("token cap should build a budget with maxTokens=1000, got %+v", b)
}
}
func TestCompactRatio(t *testing.T) {
t.Setenv("GADFLY_COMPACT_RATIO", "")
if got := compactRatio(); got != defaultCompactRatio {
t.Errorf("default ratio = %v, want %v", got, defaultCompactRatio)
}
t.Setenv("GADFLY_COMPACT_RATIO", "0.6")
if got := compactRatio(); got != 0.6 {
t.Errorf("ratio override = %v, want 0.6", got)
}
for _, bad := range []string{"0", "-1", "2", "nope"} {
t.Setenv("GADFLY_COMPACT_RATIO", bad)
if got := compactRatio(); got != defaultCompactRatio {
t.Errorf("invalid ratio %q should fall back to the default, got %v", bad, got)
}
}
}
func TestResolveContextTokensOverride(t *testing.T) {
// The explicit override short-circuits any model introspection (no network).
t.Setenv("GADFLY_MODEL_CONTEXT_TOKENS", "123456")
if got := resolveContextTokens("anything"); got != 123456 {
t.Errorf("explicit context-token override = %d, want 123456", got)
}
}
func TestPRCallerID(t *testing.T) {
t.Setenv("GADFLY_REPO", "")
t.Setenv("GADFLY_PR", "")
if got := prCallerID(); got != "local" {
t.Errorf("no repo/PR should be %q, got %q", "local", got)
}
t.Setenv("GADFLY_REPO", "steve/mort")
t.Setenv("GADFLY_PR", "1367")
if got := prCallerID(); got != "steve/mort#1367" {
t.Errorf("callerID = %q, want steve/mort#1367", got)
}
}
func TestCompactionEnabled(t *testing.T) {
for _, v := range []string{"", "1", "true", "yes"} {
t.Setenv("GADFLY_COMPACT", v)
if !compactionEnabled() {
t.Errorf("GADFLY_COMPACT=%q should be enabled", v)
}
}
for _, v := range []string{"0", "false", "no", "off"} {
t.Setenv("GADFLY_COMPACT", v)
if compactionEnabled() {
t.Errorf("GADFLY_COMPACT=%q should be disabled", v)
}
}
}
func TestGadflyToolRegistry(t *testing.T) {
fs, err := newRepoFS(t.TempDir(), "diff")
if err != nil {
t.Fatal(err)
}
_, names, err := gadflyToolRegistry(fs)
if err != nil {
t.Fatalf("gadflyToolRegistry: %v", err)
}
want := map[string]bool{"read_file": true, "list_dir": true, "grep": true, "find_files": true, "get_diff": true}
has := func(n string) bool {
for _, g := range names {
if g == n {
return true
}
}
return false
}
for n := range want {
if !has(n) {
t.Errorf("registry missing tool %q (got %v)", n, names)
}
}
if has("delegate_investigation") {
t.Error("delegate_investigation must be absent without a worker model")
}
// With a worker model the delegate tool is registered too.
fs.worker = fakeModel(t, "x")
_, names, err = gadflyToolRegistry(fs)
if err != nil {
t.Fatalf("gadflyToolRegistry with worker: %v", err)
}
if !has("delegate_investigation") {
t.Errorf("delegate_investigation should be registered with a worker model, got %v", names)
}
}
+3 -3
View File
@@ -104,7 +104,7 @@ func TestRunSpecialists_FansOut(t *testing.T) {
} }
specs := threeLenses() specs := threeLenses()
results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff") results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff")
if got := peak(); got != 3 { if got := peak(); got != 3 {
t.Errorf("peak concurrent lenses = %d, want 3", got) t.Errorf("peak concurrent lenses = %d, want 3", got)
@@ -124,7 +124,7 @@ func TestRunSpecialists_SequentialByDefault(t *testing.T) {
} }
specs := threeLenses() specs := threeLenses()
results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff") results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff")
if got := peak(); got != 1 { if got := peak(); got != 1 {
t.Errorf("peak concurrent lenses = %d, want 1 (sequential by default)", got) t.Errorf("peak concurrent lenses = %d, want 1 (sequential by default)", got)
@@ -146,7 +146,7 @@ func TestRunSpecialists_PerProviderFanOut(t *testing.T) {
} }
specs := threeLenses() specs := threeLenses()
results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff") results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff")
if got := peak(); got != 3 { if got := peak(); got != 3 {
t.Errorf("peak concurrent lenses = %d, want 3 (m1 per-provider override)", got) t.Errorf("peak concurrent lenses = %d, want 3 (m1 per-provider override)", got)
+76 -140
View File
@@ -39,10 +39,9 @@
// GADFLY_TITLE PR title (optional). // GADFLY_TITLE PR title (optional).
// GADFLY_BODY PR description (optional). // GADFLY_BODY PR description (optional).
// GADFLY_MAX_STEPS review-pass step cap (optional, default 24). // GADFLY_MAX_STEPS review-pass step cap (optional, default 24).
// GADFLY_WRAPUP_RESERVE steps before the cap at which the agent is told to // GADFLY_WRAPUP_RESERVE steps before the cap at which the wrap-up critic nudges
// stop investigating and write its answer (optional, // the agent to stop investigating and write its answer
// default 4). Plus a tool-free finalization fallback // (optional, default 4).
// guarantees a step-exhausted pass still emits output.
// GADFLY_RECHECK set to 0/false to skip the recheck pass (optional, default on). // GADFLY_RECHECK set to 0/false to skip the recheck pass (optional, default on).
// GADFLY_RECHECK_MAX_STEPS recheck-pass step cap (optional, default 16). // GADFLY_RECHECK_MAX_STEPS recheck-pass step cap (optional, default 16).
// GADFLY_TIMEOUT_SECS overall deadline in seconds, shared by both passes (optional, default 300). // GADFLY_TIMEOUT_SECS overall deadline in seconds, shared by both passes (optional, default 300).
@@ -55,8 +54,8 @@
// lanes as GADFLY_PROVIDER_CONCURRENCY (e.g. // lanes as GADFLY_PROVIDER_CONCURRENCY (e.g.
// "ollama-cloud=3,m1=1"). Wins over GADFLY_LENS_CONCURRENCY // "ollama-cloud=3,m1=1"). Wins over GADFLY_LENS_CONCURRENCY
// for the model's provider; falls back to it otherwise. // for the model's provider; falls back to it otherwise.
// GADFLY_MAX_DIFF_CHARS diff chars embedded in the prompt (optional, default 60000; // GADFLY_MAX_DIFF_CHARS diff chars embedded in the review prompt (optional, default 60000;
// the full diff is always available via the get_diff tool). // the full diff is reachable via the paginated get_diff tool).
// //
// On success it prints the review to stdout and exits 0. On a usage/config or // On success it prints the review to stdout and exits 0. On a usage/config or
// model error it prints a diagnostic to stderr and exits non-zero; run.sh then // model error it prints a diagnostic to stderr and exits non-zero; run.sh then
@@ -70,11 +69,9 @@ import (
"os" "os"
"strconv" "strconv"
"strings" "strings"
"sync"
"time" "time"
"gitea.stevedudenhoeffer.com/steve/majordomo/agent" "gitea.stevedudenhoeffer.com/steve/executus/fanout"
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
) )
const ( const (
@@ -107,13 +104,6 @@ const wrapUpInstruction = "⚠️ You are almost out of your investigation budge
"Do not begin any new investigation. If a finding could not be confirmed, drop it or mark it explicitly as unverified. " + "Do not begin any new investigation. If a finding could not be confirmed, drop it or mark it explicitly as unverified. " +
"Output the review in the required format right now." "Output the review in the required format right now."
// finalizeInstruction is the user message sent on the tool-free fallback pass
// when the agent exhausted its budget (or tripped a loop guard) without ever
// producing a final answer. It forces the model to synthesize whatever it has.
const finalizeInstruction = "You have run out of investigation steps. Do NOT call any tools. " +
"Based solely on what you have already gathered above, write your final answer now in the required format. " +
"If you could not confirm some findings, omit them or mark them as unverified, but produce the answer."
func main() { func main() {
if err := run(); err != nil { if err := run(); err != nil {
fmt.Fprintln(os.Stderr, "gadfly:", err) fmt.Fprintln(os.Stderr, "gadfly:", err)
@@ -177,7 +167,18 @@ func run() error {
} else if worker != nil { } else if worker != nil {
fsTools.worker = worker fsTools.worker = worker
} }
eng = &majordomoEngine{mdl: mdl, fsTools: fsTools} // The context compactor needs a cheap summarizer; reuse the worker model
// when present, else the review model. A bad explicit GADFLY_COMPACT_MODEL
// just disables compaction rather than sinking the review.
summarizer, serr := resolveSummarizerModel(mdl, fsTools.worker)
if serr != nil {
fmt.Fprintln(os.Stderr, "gadfly: compaction summarizer disabled:", serr)
}
rex, rerr := newReviewExecutor(fsTools, mdl, summarizer, os.Getenv("GADFLY_MODEL"), newPRBudget())
if rerr != nil {
return fmt.Errorf("build review executor: %w", rerr)
}
eng = &majordomoEngine{rex: rex, mdl: mdl}
} }
specialists, registry, auto, serrs := resolveSpecialists(repoDir) specialists, registry, auto, serrs := resolveSpecialists(repoDir)
@@ -231,55 +232,55 @@ func run() error {
} }
// runSpecialists reviews the diff through each lens and returns the results in // runSpecialists reviews the diff through each lens and returns the results in
// the SAME order as specialists, regardless of finish order. Up to // the SAME order as specialists, regardless of finish order. It uses executus's
// GADFLY_LENS_CONCURRENCY lenses run concurrently; the default of 1 keeps the // fanout primitive: up to GADFLY_LENS_CONCURRENCY lenses run concurrently (the
// suite sequential, exactly as before. Each lens already runs under its own // default of 1 keeps the suite sequential, exactly as before), and fanout.Run
// per-lens timeout (reviewWithSpecialist), so concurrency simply overlaps those // returns one result per lens in input order. Each lens already runs under its
// independent passes — and because reviewWithSpecialist builds a fresh toolbox // own per-lens timeout (reviewWithSpecialist) and the lenses only read the
// per pass and the lenses only read the immutable repoFS, they share no mutable // immutable repoFS, so concurrency simply overlaps independent passes.
// state. Results are stored by index so the consolidated comment keeps the
// configured lens order.
// //
// Caution: this fans out WITHIN one model. It multiplies with entrypoint.sh's // Caution: this fans out WITHIN one model. It multiplies with entrypoint.sh's
// per-provider model concurrency, so total concurrent backend requests ≈ // per-provider model concurrency, so total concurrent backend requests ≈
// (models at once) × (lenses at once). To fan lenses out without oversubscribing // (models at once) × (lenses at once). To fan lenses out without oversubscribing
// the backend, run models one at a time (provider lane cap 1) and raise this. // the backend, run models one at a time (provider lane cap 1) and raise this.
func runSpecialists(eng reviewEngine, base string, specialists []Specialist, task, diff string) []specialistResult { func runSpecialists(eng reviewEngine, base string, specialists []Specialist, task, diff string) []specialistResult {
results := make([]specialistResult, len(specialists))
// Optional live status board: publishes this model's per-lens progress to a // Optional live status board: publishes this model's per-lens progress to a
// file the entrypoint board renders. Inert (no-op) unless GADFLY_STATUS_FILE // file the entrypoint board renders. Inert (no-op) unless GADFLY_STATUS_FILE
// is set, so plain runs are unaffected. // is set, so plain runs are unaffected.
sw := newStatusWriter(os.Getenv("GADFLY_MODEL"), modelProvider(), specialists) sw := newStatusWriter(os.Getenv("GADFLY_MODEL"), modelProvider(), specialists)
conc := min(lensConcurrency(), len(specialists)) fanResults := fanout.Run(context.Background(), specialists, fanout.Options[Specialist]{
MaxConcurrent: lensConcurrency(),
}, func(_ context.Context, sp Specialist) (res specialistResult, _ error) {
// A panic in one lens must not crash the whole binary (which would kill
// every other lens's output) or leave this lens stuck at "running" on the
// status board. fanout does not recover fn panics, so we do it here:
// record the panic as an errored result and mark the lens finished.
defer func() {
if r := recover(); r != nil {
res = specialistResult{spec: sp, out: fmt.Sprintf("⚠️ This reviewer panicked: %v", r), verdict: verdictUnknown, errored: true}
sw.set(sp.Name, lensFinished, "", true)
}
}()
sw.set(sp.Name, lensRunning, "", false)
out, errored := reviewWithSpecialist(eng, base, sp, task, diff)
v := parseVerdict(out)
sw.set(sp.Name, lensFinished, v.label(), errored)
return specialistResult{spec: sp, out: out, verdict: v, errored: errored}, nil
})
sem := make(chan struct{}, conc) // fanout guarantees input order; its Result.Err is set only when the context
var wg sync.WaitGroup // is cancelled before a lens ran (reviewWithSpecialist embeds its own failures
for i, sp := range specialists { // in the result), so surface that as an errored lens rather than dropping it.
wg.Add(1) results := make([]specialistResult, len(specialists))
sem <- struct{}{} // blocks once `conc` lenses are already in flight for i, r := range fanResults {
go func(i int, sp Specialist) { if r.Err != nil {
defer wg.Done() results[i] = specialistResult{spec: specialists[i], out: fmt.Sprintf("⚠️ This reviewer did not run: %v", r.Err), verdict: verdictUnknown, errored: true}
defer func() { <-sem }() sw.set(specialists[i].Name, lensFinished, "", true)
// A panic in one lens must not crash the whole binary (which would continue
// kill every other lens's output) or leave this lens stuck at }
// "running" on the status board. Recover, record it as an errored results[i] = r.Value
// result, and mark the lens finished so the board can complete.
defer func() {
if r := recover(); r != nil {
results[i] = specialistResult{spec: sp, out: fmt.Sprintf("⚠️ This reviewer panicked: %v", r), verdict: verdictUnknown, errored: true}
sw.set(sp.Name, lensFinished, "", true)
}
}()
sw.set(sp.Name, lensRunning, "", false)
out, errored := reviewWithSpecialist(eng, base, sp, task, diff)
v := parseVerdict(out)
results[i] = specialistResult{spec: sp, out: out, verdict: v, errored: errored}
sw.set(sp.Name, lensFinished, v.label(), errored)
}(i, sp)
} }
wg.Wait()
return results return results
} }
@@ -320,8 +321,7 @@ func providerOverride(envName, provider string) (int, bool) {
// returned bool is true when the review pass failed (rendered as an inline // returned bool is true when the review pass failed (rendered as an inline
// notice — advisory; one lens failing never sinks the others or the job). // notice — advisory; one lens failing never sinks the others or the job).
func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, diff string) (string, bool) { func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, diff string) (string, bool) {
timeout := time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second ctx, cancel := context.WithTimeout(context.Background(), reviewTimeout())
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel() defer cancel()
draft, err := eng.runPass(ctx, composeSpecialistPrompt(base, sp), task, draft, err := eng.runPass(ctx, composeSpecialistPrompt(base, sp), task,
@@ -344,90 +344,6 @@ func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, di
return final, false return final, false
} }
// runAgent runs one agent pass (its own fresh toolbox over the sandbox) and
// returns the final answer. An empty answer is an error — the caller decides
// whether that is fatal (review pass) or recoverable (recheck pass). A
// non-empty answer that ended on a budget/guard error is still returned: the
// model wrote its output, then ran out of steps.
//
// Two mechanisms keep a step-hungry model from hard-failing with no output:
// 1. A wrap-up steer: once the run comes within wrapUpReserve steps of the
// cap, a forceful "stop calling tools, write your final answer" message is
// injected so the model spends its remaining steps finalizing.
// 2. A finalization fallback: if the loop still ends empty (the model ignored
// the nudge, or a loop guard tripped), one tool-free model call forces a
// final answer out of the transcript already gathered.
func runAgent(ctx context.Context, mdl llm.Model, fsTools *repoFS, system, task string, maxSteps int) (string, error) {
box, err := fsTools.toolbox()
if err != nil {
return "", err
}
loop := agent.New(mdl, system,
agent.WithToolbox(box),
agent.WithMaxSteps(maxSteps),
// Guard rails: stop the model from spinning on failing or identical
// tool calls instead of writing its answer.
agent.WithToolErrorLimits(4, 4),
)
wrapUpAt := maxSteps - wrapUpReserve()
if wrapUpAt < 1 {
wrapUpAt = 1
}
var completed int // steps finished so far (updated after each step)
nudged := false
res, runErr := loop.Run(ctx, task,
agent.OnStep(func(s agent.Step) { completed = s.Index + 1 }),
agent.WithSteer(func() []llm.Message {
if !nudged && completed >= wrapUpAt {
nudged = true
return []llm.Message{llm.UserText(wrapUpInstruction)}
}
return nil
}),
)
out := ""
if res != nil {
out = strings.TrimSpace(res.Output)
}
if out != "" {
return out, nil
}
// No final answer. If we still have budget on the clock and a transcript to
// work from, force a tool-free finalization rather than losing the pass.
if res != nil && len(res.Messages) > 0 && ctx.Err() == nil {
if forced := forceFinalAnswer(ctx, mdl, system, res.Messages); forced != "" {
return forced, nil
}
}
if runErr != nil {
return "", runErr
}
return "", errors.New("agent produced no output")
}
// forceFinalAnswer makes one tool-free model call to squeeze a final answer out
// of an agent that exhausted its step budget without producing one. Tools are
// forbidden (ToolChoice "none") so the model must synthesize from the transcript
// instead of investigating further. Best-effort: any error or empty reply
// returns "" and the caller falls back to its normal empty-output handling.
func forceFinalAnswer(ctx context.Context, mdl llm.Model, system string, transcript []llm.Message) string {
msgs := append(append([]llm.Message(nil), transcript...), llm.UserText(finalizeInstruction))
resp, err := mdl.Generate(ctx, llm.Request{
System: system,
Messages: msgs,
ToolChoice: "none",
})
if err != nil || resp == nil {
return ""
}
return strings.TrimSpace(resp.Text())
}
// wrapUpReserve is how many steps before the cap the wrap-up nudge fires, // wrapUpReserve is how many steps before the cap the wrap-up nudge fires,
// overridable via GADFLY_WRAPUP_RESERVE. // overridable via GADFLY_WRAPUP_RESERVE.
func wrapUpReserve() int { func wrapUpReserve() int {
@@ -444,7 +360,7 @@ func buildTask(diff string) string {
truncNote := "" truncNote := ""
if maxDiff > 0 && len(diff) > maxDiff { if maxDiff > 0 && len(diff) > maxDiff {
diff = diff[:maxDiff] diff = diff[:maxDiff]
truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars in this message; read the changed files (or call get_diff, if available) for the full text.]", maxDiff) truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars in this message; page the full diff with get_diff (paginated; pass a `path` to scope it to one file) or read the changed files.]", maxDiff)
} }
var b strings.Builder var b strings.Builder
@@ -471,3 +387,23 @@ func envInt(name string, def int) int {
} }
return n return n
} }
// envBool reads a boolean-ish env var: def when unset, false for an explicit
// falsey value (0/false/no/off), true otherwise. The shared spelling for
// gadfly's "on unless disabled" opt-out flags (GADFLY_RECHECK, GADFLY_COMPACT).
func envBool(name string, def bool) bool {
switch strings.ToLower(strings.TrimSpace(os.Getenv(name))) {
case "":
return def
case "0", "false", "no", "off":
return false
default:
return true
}
}
// reviewTimeout is the per-specialist-lens deadline (GADFLY_TIMEOUT_SECS), shared
// across a lens's review+recheck passes and applied as each pass's run cap.
func reviewTimeout() time.Duration {
return time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second
}
+21
View File
@@ -127,6 +127,27 @@ func resolveWorkerModel() (llm.Model, error) {
return majordomo.Parse(buildSpec(provider, spec)) return majordomo.Parse(buildSpec(provider, spec))
} }
// resolveSummarizerModel picks the model the context compactor uses to compress
// the runaway middle of a transcript. It should be CHEAP, since it fires once per
// compaction: GADFLY_COMPACT_MODEL if set (honoring GADFLY_PROVIDER for a bare
// id), else the delegate worker model when one is configured (already cheap by
// design), else the review model itself. Returns (nil, err) only on an explicit
// bad GADFLY_COMPACT_MODEL spec — the caller logs it and simply runs without
// compaction rather than failing the review.
func resolveSummarizerModel(review, worker llm.Model) (llm.Model, error) {
if spec := strings.TrimSpace(os.Getenv("GADFLY_COMPACT_MODEL")); spec != "" {
provider := strings.TrimSpace(os.Getenv("GADFLY_PROVIDER"))
if provider == "" {
provider = defaultProvider
}
return majordomo.Parse(buildSpec(provider, spec))
}
if worker != nil {
return worker, nil
}
return review, nil
}
// buildSpec turns (provider, model) into a majordomo spec. A model id that // buildSpec turns (provider, model) into a majordomo spec. A model id that
// already carries a "provider/" prefix (or is a multi-element failover chain) // already carries a "provider/" prefix (or is a multi-element failover chain)
// is passed through verbatim; a bare id is prefixed with the provider. // is passed through verbatim; a bare id is prefixed with the provider.
+10 -11
View File
@@ -2,7 +2,6 @@ package main
import ( import (
"fmt" "fmt"
"os"
"strings" "strings"
) )
@@ -11,6 +10,13 @@ import (
// than discovering them. // than discovering them.
const defaultRecheckMaxSteps = 16 const defaultRecheckMaxSteps = 16
// defaultRecheckDiffChars caps the diff embedded in the recheck task. It is much
// smaller than the review task's GADFLY_MAX_DIFF_CHARS: the recheck already has
// the draft findings to verify and can pull the exact hunks it needs via the
// paginated get_diff tool (optionally scoped to a path), so re-embedding the
// whole diff on every recheck step is pure burn. Override: GADFLY_RECHECK_DIFF_CHARS.
const defaultRecheckDiffChars = 20000
// recheckSystemPrompt drives the second, adversarial verification pass. The // recheckSystemPrompt drives the second, adversarial verification pass. The
// model is given a DRAFT review and must independently confirm each finding // model is given a DRAFT review and must independently confirm each finding
// against the real code before letting it survive — the antidote to a // against the real code before letting it survive — the antidote to a
@@ -58,14 +64,7 @@ Output rules:
// recheckEnabled reports whether the verification pass should run. On unless // recheckEnabled reports whether the verification pass should run. On unless
// GADFLY_RECHECK is explicitly a falsey value. // GADFLY_RECHECK is explicitly a falsey value.
func recheckEnabled() bool { func recheckEnabled() bool { return envBool("GADFLY_RECHECK", true) }
switch strings.ToLower(strings.TrimSpace(os.Getenv("GADFLY_RECHECK"))) {
case "0", "false", "no", "off":
return false
default:
return true
}
}
// shouldRecheck decides whether to run the verification pass for a given draft. // shouldRecheck decides whether to run the verification pass for a given draft.
// A clean "no material issues" draft has nothing to verify, so it is skipped // A clean "no material issues" draft has nothing to verify, so it is skipped
@@ -84,11 +83,11 @@ func shouldRecheck(draft string) bool {
// scrutinize, with the full diff available via get_diff (and embedded here, // scrutinize, with the full diff available via get_diff (and embedded here,
// truncated, to save a tool call). // truncated, to save a tool call).
func buildRecheckTask(draft, diff string) string { func buildRecheckTask(draft, diff string) string {
maxDiff := envInt("GADFLY_MAX_DIFF_CHARS", defaultMaxDiffChars) maxDiff := envInt("GADFLY_RECHECK_DIFF_CHARS", defaultRecheckDiffChars)
truncNote := "" truncNote := ""
if maxDiff > 0 && len(diff) > maxDiff { if maxDiff > 0 && len(diff) > maxDiff {
diff = diff[:maxDiff] diff = diff[:maxDiff]
truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars here; read the changed files (or call get_diff, if available) for the full text.]", maxDiff) truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars here; call get_diff (paginated; pass a `path` to scope it to one file) or read the changed files for the rest.]", maxDiff)
} }
var b strings.Builder var b strings.Builder
+29 -8
View File
@@ -49,7 +49,7 @@ func TestRecheckEnabled(t *testing.T) {
} }
func TestBuildRecheckTask(t *testing.T) { func TestBuildRecheckTask(t *testing.T) {
t.Setenv("GADFLY_MAX_DIFF_CHARS", "") t.Setenv("GADFLY_RECHECK_DIFF_CHARS", "")
draft := "VERDICT: Blocking issues found\n- foo.go:1 broken" draft := "VERDICT: Blocking issues found\n- foo.go:1 broken"
out := buildRecheckTask(draft, "diff --git a/x b/x\n+y\n") out := buildRecheckTask(draft, "diff --git a/x b/x\n+y\n")
if !strings.Contains(out, draft) { if !strings.Contains(out, draft) {
@@ -77,25 +77,46 @@ func fakeModel(t *testing.T, reply string) llm.Model {
return m return m
} }
func TestRunAgent_ReturnsOutput(t *testing.T) { // newTestReviewExecutor builds a reviewExecutor over a fake model + repo for unit
// tests: no compaction summarizer and no budget, so it exercises the bare agent
// loop through the executus run kernel without any network call.
func newTestReviewExecutor(t *testing.T, mdl llm.Model, fs *repoFS) *reviewExecutor {
t.Helper()
rex, err := newReviewExecutor(fs, mdl, nil, "mock", nil)
if err != nil {
t.Fatal(err)
}
return rex
}
// testEngine wraps newTestReviewExecutor in a majordomoEngine for the
// runSpecialists-level tests.
func testEngine(t *testing.T, mdl llm.Model, fs *repoFS) *majordomoEngine {
t.Helper()
return &majordomoEngine{rex: newTestReviewExecutor(t, mdl, fs), mdl: mdl}
}
func TestReviewExecutor_ReturnsOutput(t *testing.T) {
fs, err := newRepoFS(t.TempDir(), "diff") fs, err := newRepoFS(t.TempDir(), "diff")
if err != nil { if err != nil {
t.Fatal(err) t.Fatal(err)
} }
mdl := fakeModel(t, " corrected review: No material issues found. ") mdl := fakeModel(t, " corrected review: No material issues found. ")
out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4) rex := newTestReviewExecutor(t, mdl, fs)
out, err := rex.run(context.Background(), "sys", "task", 4)
if err != nil { if err != nil {
t.Fatalf("runAgent: %v", err) t.Fatalf("run: %v", err)
} }
if out != "corrected review: No material issues found." { if out != "corrected review: No material issues found." {
t.Errorf("runAgent should return trimmed model output, got %q", out) t.Errorf("run should return trimmed model output, got %q", out)
} }
} }
func TestRunAgent_EmptyIsError(t *testing.T) { func TestReviewExecutor_EmptyIsError(t *testing.T) {
fs, _ := newRepoFS(t.TempDir(), "diff") fs, _ := newRepoFS(t.TempDir(), "diff")
mdl := fakeModel(t, " ") mdl := fakeModel(t, " ")
if _, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4); err == nil { rex := newTestReviewExecutor(t, mdl, fs)
t.Error("runAgent should error on empty model output") if _, err := rex.run(context.Background(), "sys", "task", 4); err == nil {
t.Error("run should error on empty model output")
} }
} }
+143 -16
View File
@@ -9,6 +9,7 @@ import (
"regexp" "regexp"
"sort" "sort"
"strings" "strings"
"sync"
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm" llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
) )
@@ -17,11 +18,13 @@ import (
// every tool caps how much it can pull in one call — a runaway read_file or // every tool caps how much it can pull in one call — a runaway read_file or
// grep would blow the window and stall the loop. // grep would blow the window and stall the loop.
const ( const (
maxFileBytes = 64 * 1024 // per read_file call maxFileBytes = 64 * 1024 // per read_file call
maxReadLines = 800 // per read_file call maxReadLines = 800 // per read_file call
maxGrepResults = 200 // per grep call maxGrepResults = 200 // per grep call
maxFindResults = 200 // per find_files call maxFindResults = 200 // per find_files call
maxLineLen = 400 // truncate any single returned line to this maxLineLen = 400 // truncate any single returned line to this
maxGetDiffLines = 800 // per get_diff call (paginated window)
maxGetDiffBytes = 64 * 1024 // per get_diff call
) )
// skipDirs are never descended into by grep / find_files — noise and bulk that // skipDirs are never descended into by grep / find_files — noise and bulk that
@@ -40,6 +43,12 @@ type repoFS struct {
root string // absolute, symlink-resolved repo root root string // absolute, symlink-resolved repo root
diff string // the full PR unified diff (served by get_diff) diff string // the full PR unified diff (served by get_diff)
worker llm.Model // optional cheap model for delegate_investigation; nil = no delegation worker llm.Model // optional cheap model for delegate_investigation; nil = no delegation
// diffLines caches the split diff so paging through get_diff doesn't re-split
// the whole (possibly large) diff on every call. The diff is immutable, so the
// cache is computed once and safe to share across concurrent lenses.
diffOnce sync.Once
diffLines []string
} }
// newRepoFS resolves root to an absolute, symlink-free path. // newRepoFS resolves root to an absolute, symlink-free path.
@@ -108,15 +117,24 @@ func (r *repoFS) fsTools() []llm.Tool {
} }
} }
// toolbox builds the reviewer's toolbox: the read-only repo tools, plus the // allTools is the single source of truth for the reviewer's tool set: the
// delegate_investigation tool when a worker model is configured. // read-only repo tools, plus delegate_investigation when a worker model is
func (r *repoFS) toolbox() (*llm.Toolbox, error) { // configured. Both the executus registry (gadflyToolRegistry, the production
box := llm.NewToolbox("gadfly") // path) and toolbox() build from this list.
func (r *repoFS) allTools() []llm.Tool {
tools := r.fsTools() tools := r.fsTools()
if r.worker != nil { if r.worker != nil {
tools = append(tools, r.delegateTool()) tools = append(tools, r.delegateTool())
} }
for _, t := range tools { return tools
}
// toolbox builds a majordomo toolbox from allTools(). The production review path
// now goes through executus's tool.Registry (see executus.go); this remains for
// the toolbox-level tests (the `call` helper).
func (r *repoFS) toolbox() (*llm.Toolbox, error) {
box := llm.NewToolbox("gadfly")
for _, t := range r.allTools() {
if err := box.Add(t); err != nil { if err := box.Add(t); err != nil {
return nil, fmt.Errorf("add tool %q: %w", t.Name, err) return nil, fmt.Errorf("add tool %q: %w", t.Name, err)
} }
@@ -396,15 +414,124 @@ func (r *repoFS) findFilesTool() llm.Tool {
) )
} }
type getDiffArgs struct {
Path string `json:"path,omitempty" description:"Optional changed-file path (e.g. pkg/foo/bar.go); returns ONLY that file's diff hunks. Omit for the whole diff. Use this on a large PR to pull just the file a finding is about."`
StartLine int `json:"start_line,omitempty" description:"Optional 1-based line to start from within the (whole or path-scoped) diff (default 1)."`
Limit int `json:"limit,omitempty" description:"Optional max number of diff lines to return (default/maximum 800)."`
}
func (r *repoFS) getDiffTool() llm.Tool { func (r *repoFS) getDiffTool() llm.Tool {
return llm.DefineTool[struct{}]( return llm.DefineTool[getDiffArgs](
"get_diff", "get_diff",
"Return the complete unified diff under review. The diff is also included (possibly truncated) in the task message; call this to get the full, untruncated text.", "Return the unified diff under review as a numbered, PAGINATED window (like read_file) — not the whole diff at once, so a huge PR can't blow the context window. Pass `path` to fetch just one changed file's hunks, or `start_line`/`limit` to page through. A truncated copy of the diff is also embedded in the task message.",
func(_ context.Context, _ struct{}) (any, error) { func(_ context.Context, args getDiffArgs) (any, error) {
if strings.TrimSpace(r.diff) == "" { scope := "the diff"
return "(empty diff)", nil var lines []string
if p := strings.TrimSpace(args.Path); p != "" {
lines = diffLinesForPath(r.diff, p)
if len(lines) == 0 {
return fmt.Sprintf("(no diff hunks for path %q; check it against the changed-files list — the path must match a file the diff touches)", p), nil
}
scope = "the diff for " + p
} else {
lines = r.diffAllLines()
if len(lines) == 0 {
return "(empty diff)", nil
}
} }
return r.diff, nil return windowDiff(lines, scope, args.StartLine, args.Limit), nil
}, },
) )
} }
// diffAllLines returns the whole diff split into lines, cached so paging never
// re-splits the (possibly large) diff.
func (r *repoFS) diffAllLines() []string {
r.diffOnce.Do(func() { r.diffLines = splitDiffLines(r.diff) })
return r.diffLines
}
// splitDiffLines splits a unified diff into lines, dropping the single trailing
// empty element a trailing newline produces — otherwise windowDiff would emit a
// blank final line and over-count the total by one.
func splitDiffLines(diff string) []string {
lines := strings.Split(diff, "\n")
if n := len(lines); n > 0 && lines[n-1] == "" {
lines = lines[:n-1]
}
return lines
}
// windowDiff returns a numbered, paginated slice of pre-split diff lines,
// mirroring read_file's caps (maxGetDiffLines / maxGetDiffBytes / maxLineLen) so
// a single get_diff call can never dump a multi-hundred-KB diff into the
// transcript — the amplifier behind the large-PR token burn. The full diff stays
// reachable by paging with start_line, or scoped per file via the path arg.
func windowDiff(lines []string, scope string, start, limit int) string {
total := len(lines)
if start < 1 {
start = 1
}
if limit <= 0 || limit > maxGetDiffLines {
limit = maxGetDiffLines
}
if start > total {
return fmt.Sprintf("(%s has %d lines; nothing at/after line %d)", scope, total, start)
}
var b strings.Builder
emitted := 0
i := start - 1
for ; i < total; i++ {
if emitted >= limit || b.Len() >= maxGetDiffBytes {
break
}
line := lines[i]
if len(line) > maxLineLen {
line = line[:maxLineLen] + "…"
}
fmt.Fprintf(&b, "%d\t%s\n", i+1, line)
emitted++
}
if i < total {
// i (0-based) is the first line NOT emitted; line i was the last shown.
fmt.Fprintf(&b, "... (%s truncated after line %d of %d; call get_diff again with start_line=%d for the rest, or pass a `path` to scope to one file)\n", scope, i, total, i+1)
}
return b.String()
}
// diffLinesForPath returns the unified-diff lines for one changed file: from the
// `diff --git` header that names path through to the next header (or end). The
// header names two path tokens (a/<old> b/<new>); a match is on a WHOLE token,
// so path "foo.go" does not pull in "barfoo.go" — and a trailing "/" scopes to a
// directory (e.g. "pkg/foo/" matches pkg/foo/bar.go).
func diffLinesForPath(diff, path string) []string {
want := strings.TrimPrefix(strings.TrimPrefix(strings.TrimSpace(path), "a/"), "b/")
var out []string
inSection := false
for _, ln := range splitDiffLines(diff) {
if strings.HasPrefix(ln, "diff --git ") {
inSection = diffHeaderNames(ln, want)
}
if inSection {
out = append(out, ln)
}
}
return out
}
// diffHeaderNames reports whether a `diff --git a/X b/Y` header names want as one
// of its (a/b-stripped) path tokens — exact whole-token match, or a directory
// prefix when want ends with "/".
func diffHeaderNames(header, want string) bool {
fields := strings.Fields(header)
if len(fields) < 3 {
return false
}
for _, f := range fields[2:] {
p := strings.TrimPrefix(strings.TrimPrefix(f, "a/"), "b/")
if p == want || (strings.HasSuffix(want, "/") && strings.HasPrefix(p, want)) {
return true
}
}
return false
}
+81 -3
View File
@@ -197,15 +197,93 @@ func TestFindFilesTool(t *testing.T) {
func TestGetDiffTool(t *testing.T) { func TestGetDiffTool(t *testing.T) {
root := buildFixtureRepo(t) root := buildFixtureRepo(t)
const diff = "diff --git a/x b/x\n+added line\n" const diff = "diff --git a/x b/x\n--- a/x\n+++ b/x\n+added line\n" +
"diff --git a/y.go b/y.go\n--- a/y.go\n+++ b/y.go\n+y change\n"
fs, _ := newRepoFS(root, diff) fs, _ := newRepoFS(root, diff)
// Default: the whole diff as a NUMBERED window (paginated), not a raw dump —
// so a huge PR can't be poured into the transcript in one call.
out, err := call(t, fs, "get_diff", map[string]any{})
if err != nil {
t.Fatalf("get_diff: %v", err)
}
if !strings.Contains(out, "1\tdiff --git a/x b/x") {
t.Errorf("get_diff should return a numbered window, got:\n%s", out)
}
if !strings.Contains(out, "+added line") || !strings.Contains(out, "+y change") {
t.Errorf("the full (short) diff window should include every hunk, got:\n%s", out)
}
// path filter: only the named file's hunks come back.
out, err = call(t, fs, "get_diff", map[string]any{"path": "y.go"})
if err != nil {
t.Fatalf("get_diff path: %v", err)
}
if !strings.Contains(out, "y change") {
t.Errorf("get_diff path=y.go should include y's hunk, got:\n%s", out)
}
if strings.Contains(out, "added line") {
t.Errorf("get_diff path=y.go must NOT include x's hunk, got:\n%s", out)
}
// unknown path: a clear note, never an error.
out, err = call(t, fs, "get_diff", map[string]any{"path": "nope.txt"})
if err != nil {
t.Fatalf("get_diff unknown path: %v", err)
}
if !strings.Contains(out, "no diff hunks") {
t.Errorf("get_diff for an unknown path should note no hunks, got:\n%s", out)
}
}
// TestGetDiffTool_Paginates: a diff longer than the per-call line cap is returned
// as a truncated window with a paging hint, and start_line pages past it — the
// mechanism that stops get_diff from dumping a multi-hundred-KB diff at once.
func TestGetDiffTool_Paginates(t *testing.T) {
diff := "diff --git a/big b/big\n" + strings.Repeat("+line\n", maxGetDiffLines+50)
fs, _ := newRepoFS(t.TempDir(), diff)
out, err := call(t, fs, "get_diff", map[string]any{}) out, err := call(t, fs, "get_diff", map[string]any{})
if err != nil { if err != nil {
t.Fatalf("get_diff: %v", err) t.Fatalf("get_diff: %v", err)
} }
if out != diff { if !strings.Contains(out, "truncated after line") {
t.Errorf("get_diff returned %q, want %q", out, diff) t.Error("a diff longer than the per-call cap should be truncated with a paging hint")
}
if strings.Contains(out, "801\t") {
t.Error("the first window must stop at the line cap, not reach line 801")
}
out, err = call(t, fs, "get_diff", map[string]any{"start_line": 805})
if err != nil {
t.Fatalf("get_diff page: %v", err)
}
if !strings.Contains(out, "805\t") {
t.Error("paging with start_line=805 should include line 805")
}
}
// TestDiffLinesForPath_Anchored: get_diff path= matches on a WHOLE path token,
// so "foo.go" never pulls in "barfoo.go" (the unanchored-substring weakness the
// swarm flagged), while a trailing "/" still scopes to a directory.
func TestDiffLinesForPath_Anchored(t *testing.T) {
diff := "diff --git a/foo.go b/foo.go\n+foo change\n" +
"diff --git a/barfoo.go b/barfoo.go\n+barfoo change\n" +
"diff --git a/pkg/x.go b/pkg/x.go\n+x change\n"
joined := strings.Join(diffLinesForPath(diff, "foo.go"), "\n")
if !strings.Contains(joined, "foo change") {
t.Errorf("foo.go should match its own hunk:\n%s", joined)
}
if strings.Contains(joined, "barfoo change") {
t.Errorf("foo.go must NOT match barfoo.go (unanchored substring regression):\n%s", joined)
}
if !strings.Contains(strings.Join(diffLinesForPath(diff, "pkg/"), "\n"), "x change") {
t.Error("a trailing-slash path should scope to the directory (pkg/ -> pkg/x.go)")
}
if len(diffLinesForPath(diff, "nope.go")) != 0 {
t.Error("an unknown path should yield no lines")
} }
} }
+20 -51
View File
@@ -37,10 +37,11 @@ func lastUserText(req llm.Request) string {
return req.Messages[len(req.Messages)-1].Text() return req.Messages[len(req.Messages)-1].Text()
} }
// TestRunAgent_WrapUpNudgeProducesAnswer: a model that keeps calling tools until // TestReviewExecutor_WrapUpNudgeProducesAnswer: a model that keeps calling tools
// it is nudged to wrap up should still finish inside its budget — the steer // until it is nudged to wrap up should still finish inside its budget — the steer
// message arrives a few steps before the cap and the model writes its answer. // message (delivered by the executus wrap-up critic a few steps before the cap)
func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) { // arrives and the model writes its answer.
func TestReviewExecutor_WrapUpNudgeProducesAnswer(t *testing.T) {
t.Setenv("GADFLY_WRAPUP_RESERVE", "4") t.Setenv("GADFLY_WRAPUP_RESERVE", "4")
final := "VERDICT: No material issues found." final := "VERDICT: No material issues found."
@@ -60,9 +61,10 @@ func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) {
} }
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n") fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 12) rex := newTestReviewExecutor(t, mdl, fs)
out, err := rex.run(context.Background(), "sys", "task", 12)
if err != nil { if err != nil {
t.Fatalf("runAgent should succeed via wrap-up nudge, got error: %v", err) t.Fatalf("run should succeed via wrap-up nudge, got error: %v", err)
} }
if out != final { if out != final {
t.Errorf("expected final review %q, got %q", final, out) t.Errorf("expected final review %q, got %q", final, out)
@@ -72,24 +74,19 @@ func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) {
} }
} }
// TestRunAgent_FinalizationFallback: a model that ignores the wrap-up nudge and // TestReviewExecutor_ExhaustionWithoutAnswerIsError: a model that ignores the
// spins on tools until the cap should NOT hard-fail — the tool-free finalization // wrap-up nudge and spins on tools until the step cap produces no final answer.
// pass forces a final answer out of the transcript. // The transcript-based forced-finalization fallback was removed in the executus
func TestRunAgent_FinalizationFallback(t *testing.T) { // re-platform (run.Result does not expose the loop transcript), so the pass now
// surfaces an error — which reviewWithSpecialist renders as an advisory "reviewer
// failed to complete" notice rather than a phantom success.
func TestReviewExecutor_ExhaustionWithoutAnswerIsError(t *testing.T) {
t.Setenv("GADFLY_WRAPUP_RESERVE", "2") t.Setenv("GADFLY_WRAPUP_RESERVE", "2")
final := "VERDICT: Minor issues\n- something"
forcedCalled := false
n := 0 n := 0
p := fake.New("fake", fake.WithDefault(func(_ string, req llm.Request) fake.Step { p := fake.New("fake", fake.WithDefault(func(_ string, _ llm.Request) fake.Step {
// Only the tool-free finalization pass forbids tools — reply there.
if req.ToolChoice == "none" {
forcedCalled = true
return fake.Reply(final)
}
// Otherwise keep spinning, ignoring the wrap-up nudge entirely.
n++ n++
return spinToolCall(n) return spinToolCall(n) // spin forever, ignoring the wrap-up nudge
})) }))
mdl, err := p.Model("mock") mdl, err := p.Model("mock")
if err != nil { if err != nil {
@@ -97,37 +94,9 @@ func TestRunAgent_FinalizationFallback(t *testing.T) {
} }
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n") fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 6) rex := newTestReviewExecutor(t, mdl, fs)
if err != nil { if _, err := rex.run(context.Background(), "sys", "task", 6); err == nil {
t.Fatalf("runAgent should recover via finalization fallback, got error: %v", err) t.Error("run should error when the model exhausts its steps without an answer")
}
if !forcedCalled {
t.Error("finalization fallback was never invoked")
}
if out != final {
t.Errorf("expected forced final answer %q, got %q", final, out)
}
}
// TestRunAgent_FallbackStillEmptyIsError: if even the tool-free finalization
// yields nothing, runAgent surfaces an error rather than a phantom success.
func TestRunAgent_FallbackStillEmptyIsError(t *testing.T) {
n := 0
p := fake.New("fake", fake.WithDefault(func(_ string, req llm.Request) fake.Step {
if req.ToolChoice == "none" {
return fake.Reply(" ") // finalization produces only whitespace
}
n++
return spinToolCall(n)
}))
mdl, err := p.Model("mock")
if err != nil {
t.Fatal(err)
}
fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n")
if _, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4); err == nil {
t.Error("runAgent should error when the finalization fallback also yields no output")
} }
} }
+74 -1
View File
@@ -183,6 +183,36 @@ export GADFLY_FINDINGS_TOKEN="${GADFLY_FINDINGS_TOKEN:-}"
MODELS="${GADFLY_MODELS:-${OLLAMA_REVIEW_MODELS:-$DEFAULT_MODELS}}" MODELS="${GADFLY_MODELS:-${OLLAMA_REVIEW_MODELS:-$DEFAULT_MODELS}}"
DEFAULT_CONC="${GADFLY_CONCURRENCY:-1}" DEFAULT_CONC="${GADFLY_CONCURRENCY:-1}"
# --- huge-PR downshift ------------------------------------------------------
# A very large diff is what burns the model budget: every review step re-sends
# it, multiplied across models × lenses × passes × steps (this is what nuked a
# whole Ollama Cloud block on one giant PR). entrypoint is the only process that
# spans the whole fleet, so the fleet-wide size decision lives here: size the PR
# diff ONCE, and above GADFLY_HUGE_DIFF_BYTES collapse to a single cheap model +
# a focused lens subset, fewer steps, no recheck, and a smaller embedded diff.
# A finished shallow review beats a budget-nuking one. All knobs override; set
# GADFLY_HUGE_DIFF_BYTES=0 to disable. Small PRs are never touched.
HUGE_PR=0
HUGE_DIFF_BYTES="${GADFLY_HUGE_DIFF_BYTES:-600000}"
if [ "$HUGE_DIFF_BYTES" -gt 0 ] 2>/dev/null; then
PR_DIFF_BYTES="$(API "${GITEA_API}/pulls/${PR}.diff" 2>/dev/null | wc -c | tr -d '[:space:]')"
[ -z "$PR_DIFF_BYTES" ] && PR_DIFF_BYTES=0
if [ "$PR_DIFF_BYTES" -gt "$HUGE_DIFF_BYTES" ] 2>/dev/null; then
HUGE_PR=1
log "huge PR: diff ${PR_DIFF_BYTES}B > ${HUGE_DIFF_BYTES}B — downshifting the fleet (advisory)"
MODELS="${GADFLY_HUGE_DIFF_MODELS:-${MODELS%%,*}}" # first model only by default
export GADFLY_SPECIALISTS="${GADFLY_HUGE_DIFF_SPECIALISTS:-security,correctness,error-handling}"
export GADFLY_MAX_STEPS="${GADFLY_HUGE_DIFF_MAX_STEPS:-12}"
export GADFLY_RECHECK_MAX_STEPS="${GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS:-8}"
export GADFLY_RECHECK="${GADFLY_HUGE_DIFF_RECHECK:-0}" # skip recheck on huge PRs
# The Go-visible name directly (run.sh prefers GADFLY_MAX_DIFF_CHARS over its
# own MAX_DIFF_CHARS), so the cap is honored without relying on run.sh's alias.
export GADFLY_MAX_DIFF_CHARS="${GADFLY_HUGE_DIFF_MAX_DIFF_CHARS:-20000}"
# Surfaced on each posted comment so the shallower review is self-explaining.
export GADFLY_NOTICE="⚠️ Large PR (${PR_DIFF_BYTES} bytes): Gadfly downshifted to a focused, single-model review to stay within budget — coverage is intentionally shallower. Consider splitting the PR for a deeper review."
fi
fi
provider_of() { case "$1" in */*) echo "${1%%/*}";; *) echo "${GADFLY_PROVIDER:-ollama-cloud}";; esac; } provider_of() { case "$1" in */*) echo "${1%%/*}";; *) echo "${GADFLY_PROVIDER:-ollama-cloud}";; esac; }
# Per-model status file path for the live board. The model id can contain '/' # Per-model status file path for the live board. The model id can contain '/'
@@ -297,6 +327,32 @@ if [ "${GADFLY_STATUS_BOARD:-1}" != "0" ]; then
log "status board started (pid ${BOARD_PID})" log "status board started (pid ${BOARD_PID})"
fi fi
# --- swarm-wide hard backstop ----------------------------------------------
# A wall-clock ceiling across the WHOLE fleet, so a pathological PR can never
# drain the usage block however the models behave. entrypoint is the only
# process spanning every model, so a single "never exceed X" guard lives here.
# On expiry it stops the review subtrees (the binary + run.sh); whatever partial
# findings were gathered are still posted and the job never fails (advisory).
# GADFLY_PR_BUDGET_SECS=0 (default) disables it.
KILLER_PID=""
rm -f "${WORKDIR}/.budget_killed" "${WORKDIR}/.disarmed" 2>/dev/null || true
if [ "${GADFLY_PR_BUDGET_SECS:-0}" -gt 0 ] 2>/dev/null; then
(
sleep "${GADFLY_PR_BUDGET_SECS}"
log "PR wall-clock budget (${GADFLY_PR_BUDGET_SECS}s) reached — stopping the review fleet (advisory; partial findings still posted)"
: > "${WORKDIR}/.budget_killed"
pkill -TERM -f '/usr/local/bin/gadfly' 2>/dev/null || true
pkill -TERM -f "${SCRIPTS_DIR}/run.sh" 2>/dev/null || true
sleep 5
# Guard the delayed SIGKILL on the disarm marker: once the lanes finished and
# the watchdog was disarmed, the consolidation gadfly pass runs next, and a
# name-based KILL here must NOT catch it.
[ -f "${WORKDIR}/.disarmed" ] || pkill -KILL -f '/usr/local/bin/gadfly' 2>/dev/null || true
) &
KILLER_PID=$!
log "PR budget watchdog armed (${GADFLY_PR_BUDGET_SECS}s, pid ${KILLER_PID})"
fi
log "providers: ${PROVIDERS:-none}" log "providers: ${PROVIDERS:-none}"
# Each provider lane runs in parallel; cap is enforced within each lane. Track # Each provider lane runs in parallel; cap is enforced within each lane. Track
# the lane PIDs so we wait ONLY for the review work — not the status board, # the lane PIDs so we wait ONLY for the review work — not the status board,
@@ -308,6 +364,21 @@ for p in $PROVIDERS; do
done done
[ "${#LANE_PIDS[@]}" -gt 0 ] && wait "${LANE_PIDS[@]}" [ "${#LANE_PIDS[@]}" -gt 0 ] && wait "${LANE_PIDS[@]}"
# Reviews finished (or the watchdog killed them): disarm the watchdog so its
# delayed SIGKILL can't catch the consolidation pass that runs next. Drop the
# disarm marker FIRST so even a racing watchdog that already reached its KILL line
# skips it (the kill below also tears the watchdog subshell down during its sleep).
if [ -n "$KILLER_PID" ]; then
: > "${WORKDIR}/.disarmed"
kill "$KILLER_PID" 2>/dev/null || true
fi
# If the backstop fired, note it on the consensus comment (per-model comments
# were already posted during the run; a killed model surfaces as a failed lane).
if [ -f "${WORKDIR}/.budget_killed" ]; then
export GADFLY_NOTICE="${GADFLY_NOTICE:+${GADFLY_NOTICE} }⏱️ This review was stopped early by the per-PR time budget (GADFLY_PR_BUDGET_SECS); findings are partial."
fi
# Reviews are done: signal the board to render the final state once and exit. # Reviews are done: signal the board to render the final state once and exit.
if [ -n "$BOARD_PID" ]; then if [ -n "$BOARD_PID" ]; then
touch "${STATUS_DIR}/.done" 2>/dev/null || true touch "${STATUS_DIR}/.done" 2>/dev/null || true
@@ -331,7 +402,9 @@ if [ "$CONSOLIDATE" = "1" ]; then
CONSENSUS="$(GADFLY_CONSOLIDATE_DIR="$FINDINGS_DIR" GADFLY_DIFF_FILE="$DIFF_FILE" \ CONSENSUS="$(GADFLY_CONSOLIDATE_DIR="$FINDINGS_DIR" GADFLY_DIFF_FILE="$DIFF_FILE" \
/usr/local/bin/gadfly 2>"${WORKDIR}/consolidate.err" || true)" /usr/local/bin/gadfly 2>"${WORKDIR}/consolidate.err" || true)"
if [ -n "$CONSENSUS" ]; then if [ -n "$CONSENSUS" ]; then
BODY="$(printf '%s\n\n<sub>Automated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.</sub>' "$CONSENSUS")" NOTICE_BLOCK=""
[ -n "${GADFLY_NOTICE:-}" ] && NOTICE_BLOCK="> ${GADFLY_NOTICE}"$'\n\n'
BODY="$(printf '%s%s\n\n<sub>Automated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.</sub>' "$NOTICE_BLOCK" "$CONSENSUS")"
upsert_comment_body "<!-- gadfly-consensus -->" "$BODY" upsert_comment_body "<!-- gadfly-consensus -->" "$BODY"
log "consensus comment posted" log "consensus comment posted"
else else
+2
View File
@@ -3,6 +3,7 @@ module gitea.stevedudenhoeffer.com/steve/gadfly
go 1.26.2 go 1.26.2
require ( require (
gitea.stevedudenhoeffer.com/steve/executus v0.1.4
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462 gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462
gopkg.in/yaml.v3 v3.0.1 gopkg.in/yaml.v3 v3.0.1
) )
@@ -17,6 +18,7 @@ require (
github.com/go-logr/stdr v1.2.2 // indirect github.com/go-logr/stdr v1.2.2 // indirect
github.com/google/go-cmp v0.7.0 // indirect github.com/google/go-cmp v0.7.0 // indirect
github.com/google/s2a-go v0.1.9 // indirect github.com/google/s2a-go v0.1.9 // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/googleapis/enterprise-certificate-proxy v0.3.17 // indirect github.com/googleapis/enterprise-certificate-proxy v0.3.17 // indirect
github.com/googleapis/gax-go/v2 v2.22.0 // indirect github.com/googleapis/gax-go/v2 v2.22.0 // indirect
github.com/gorilla/websocket v1.5.3 // indirect github.com/gorilla/websocket v1.5.3 // indirect
+2
View File
@@ -4,6 +4,8 @@ cloud.google.com/go/auth v0.20.0 h1:kXTssoVb4azsVDoUiF8KvxAqrsQcQtB53DcSgta74CA=
cloud.google.com/go/auth v0.20.0/go.mod h1:942/yi/itH1SsmpyrbnTMDgGfdy2BUqIKyd0cyYLc5Q= cloud.google.com/go/auth v0.20.0/go.mod h1:942/yi/itH1SsmpyrbnTMDgGfdy2BUqIKyd0cyYLc5Q=
cloud.google.com/go/compute/metadata v0.9.0 h1:pDUj4QMoPejqq20dK0Pg2N4yG9zIkYGdBtwLoEkH9Zs= cloud.google.com/go/compute/metadata v0.9.0 h1:pDUj4QMoPejqq20dK0Pg2N4yG9zIkYGdBtwLoEkH9Zs=
cloud.google.com/go/compute/metadata v0.9.0/go.mod h1:E0bWwX5wTnLPedCKqk3pJmVgCBSM6qQI1yTBdEb3C10= cloud.google.com/go/compute/metadata v0.9.0/go.mod h1:E0bWwX5wTnLPedCKqk3pJmVgCBSM6qQI1yTBdEb3C10=
gitea.stevedudenhoeffer.com/steve/executus v0.1.4 h1:4F99uCV3OVaE9ITFp0FjPiYxLUQO+WpE+wU2HCnpXNM=
gitea.stevedudenhoeffer.com/steve/executus v0.1.4/go.mod h1:WQP/lH+meU06OSNF0TQO/wQLcJCrMwpi0EMj5vSpVtk=
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462 h1:1crjE1YkWHLZ91tUDOxN/Y5cuOnJ56e0U9UADoFfEPY= gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462 h1:1crjE1YkWHLZ91tUDOxN/Y5cuOnJ56e0U9UADoFfEPY=
gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462/go.mod h1:UZLveG17SmENt4sne2RSLIbioix30RZbRIQUzBAnOyY= gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462/go.mod h1:UZLveG17SmENt4sne2RSLIbioix30RZbRIQUzBAnOyY=
github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs= github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
+7 -3
View File
@@ -177,7 +177,7 @@ case "$PROVIDER" in
GADFLY_SYSTEM_FILE="${SCRIPT_DIR}/system-prompt.txt" \ GADFLY_SYSTEM_FILE="${SCRIPT_DIR}/system-prompt.txt" \
GADFLY_TITLE="$TITLE" \ GADFLY_TITLE="$TITLE" \
GADFLY_BODY="$BODY" \ GADFLY_BODY="$BODY" \
GADFLY_MAX_DIFF_CHARS="$MAX_DIFF_CHARS" \ GADFLY_MAX_DIFF_CHARS="${GADFLY_MAX_DIFF_CHARS:-$MAX_DIFF_CHARS}" \
GADFLY_STATUS_FILE="${GADFLY_STATUS_FILE:-}" \ GADFLY_STATUS_FILE="${GADFLY_STATUS_FILE:-}" \
GADFLY_FINDINGS_OUT="${GADFLY_FINDINGS_OUT:-}" \ GADFLY_FINDINGS_OUT="${GADFLY_FINDINGS_OUT:-}" \
"$BIN" 2>"$ERR_FILE" "$BIN" 2>"$ERR_FILE"
@@ -230,8 +230,12 @@ fi
if [ "$CONSOLIDATE" = "1" ]; then if [ "$CONSOLIDATE" = "1" ]; then
say "done in ${DUR} (consolidated; no per-model comment)" say "done in ${DUR} (consolidated; no per-model comment)"
else else
COMMENT="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s\n\n<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in %s</sub>' \ # An optional one-line notice (e.g. entrypoint's huge-PR downshift advisory),
"$MARKER" "$MODEL" "$MODEL_PROVIDER" "$REVIEW" "$DUR")" # shown under the header so a shallower review is self-explaining.
NOTICE_BLOCK=""
[ -n "${GADFLY_NOTICE:-}" ] && NOTICE_BLOCK="> ${GADFLY_NOTICE}"$'\n\n'
COMMENT="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s%s\n\n<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in %s</sub>' \
"$MARKER" "$MODEL" "$MODEL_PROVIDER" "$NOTICE_BLOCK" "$REVIEW" "$DUR")"
upsert_comment "$COMMENT" upsert_comment "$COMMENT"
say "done in ${DUR}" say "done in ${DUR}"
fi fi
+3 -1
View File
@@ -16,7 +16,9 @@ state. USE THEM to verify before you report. Do not review the diff in isolation
- list_dir([path]) — list a directory. - list_dir([path]) — list a directory.
- grep(pattern[, path, max_results]) — RE2 regex search across the repo. - grep(pattern[, path, max_results]) — RE2 regex search across the repo.
- find_files(name[, max_results]) — locate a file by path substring. - find_files(name[, max_results]) — locate a file by path substring.
- get_diff() — the full unified diff (the task message may truncate it). - get_diff([path, start_line, limit]) — the unified diff as a paginated, numbered window;
pass `path` to fetch just one changed file's hunks (do this on a big PR instead of pulling
the whole diff at once).
Mandatory verification discipline — this is the whole point of giving you tools: Mandatory verification discipline — this is the whole point of giving you tools:
- Before claiming a missing/duplicate import, an undefined symbol, a wrong signature, - Before claiming a missing/duplicate import, an undefined symbol, a wrong signature,