From b86011933238833d39d1a7bd07e506a2e50316a0 Mon Sep 17 00:00:00 2001 From: Steve Dudenhoeffer Date: Tue, 30 Jun 2026 10:43:34 -0400 Subject: [PATCH 1/2] feat: re-platform agentic review onto executus + large-PR cost controls MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fixes the large-PR token burn: a ~250K-token diff was re-sent every agent step across models × lenses × passes, draining a metered usage block in minutes. Small PRs are untouched (every mitigation is size-gated / no-op under threshold). - Re-platform the in-process review path onto executus run.Executor: context compaction (executus/compact, threshold from the model's real context window via executus/model), run-bounding, a per-PR budget gate (Ports.Budget), and the wrap-up nudge re-expressed as a run.Critic. Lens fan-out now uses executus/fanout. gadfly keeps its own model.go, so GADFLY_ENDPOINT_ aliases and the claude-code engine are unaffected. No majordomo bump; the binary stays static (executus core is majordomo+stdlib only). - Paginate get_diff (per-file `path` + start_line/limit) instead of dumping the whole diff; trim the recheck diff embed (60k -> 20k chars). - entrypoint.sh: downshift the fleet above GADFLY_HUGE_DIFF_BYTES (one cheap model, fewer lenses/steps, no recheck) + a swarm-wide GADFLY_PR_BUDGET_SECS wall-clock backstop (adds procps for pkill). All advisory; CI never fails. - README + CLAUDE.md + tests updated. Note: run.Result exposes no transcript, so the old transcript-based forced- finalization fallback is dropped; the wrap-up critic nudge is the remaining "always emit something" mechanism. Co-Authored-By: Claude Opus 4.8 (1M context) --- CLAUDE.md | 34 ++- Dockerfile | 5 +- README.md | 44 +++- cmd/gadfly/engine.go | 12 +- cmd/gadfly/executus.go | 364 ++++++++++++++++++++++++++++ cmd/gadfly/executus_test.go | 140 +++++++++++ cmd/gadfly/lens_concurrency_test.go | 6 +- cmd/gadfly/main.go | 182 ++++---------- cmd/gadfly/model.go | 21 ++ cmd/gadfly/recheck.go | 11 +- cmd/gadfly/recheck_test.go | 35 ++- cmd/gadfly/tools.go | 93 ++++++- cmd/gadfly/tools_test.go | 60 ++++- cmd/gadfly/wrapup_test.go | 71 ++---- entrypoint.sh | 65 ++++- go.mod | 2 + go.sum | 2 + scripts/run.sh | 8 +- scripts/system-prompt.txt | 4 +- 19 files changed, 935 insertions(+), 224 deletions(-) create mode 100644 cmd/gadfly/executus.go create mode 100644 cmd/gadfly/executus_test.go diff --git a/CLAUDE.md b/CLAUDE.md index b9d3d1a..09efb15 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -22,15 +22,21 @@ verifies each one against the actual code, and posts its findings as a comment. 4. **Provider-agnostic.** Powered by [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo), so it can target Ollama (local/cloud), OpenAI, Anthropic, Google, or any OpenAI/Ollama-compatible endpoint. Don't re-hardcode a single provider. -5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib + majordomo. Keep - it that way — no heavyweight deps, no coupling to any one consumer repo (e.g. mort). +5. **Portable & self-contained.** `cmd/gadfly` depends only on the Go stdlib, majordomo, and + [executus](https://gitea.stevedudenhoeffer.com/steve/executus) (whose *core* — `run`/`compact`/ + `model`/`fanout`/`tool` — is itself majordomo+stdlib only, so the binary stays static; do NOT + pull executus's `contrib/store` or any battery that drags in a DB driver). No heavyweight deps, + no coupling to any one consumer repo (e.g. mort). Gadfly is executus's canonical *light* consumer. ## Architecture ``` cmd/gadfly/ the reviewer binary — pure producer of review markdown (stdout) - main.go orchestration: loop specialists, each a review pass + adversarial recheck - engine.go reviewEngine abstraction: majordomo agent loop vs claude-code CLI shell-out + main.go orchestration: fan specialists out (executus/fanout), each a review pass + recheck + engine.go reviewEngine abstraction: executus run.Executor (majordomo agent loop + + compaction/bounding/budget/critic) vs claude-code CLI shell-out + executus.go executus wiring: tool.Registry over the repo tools, the run.Executor build + (compact + model context-limit threshold + per-PR budget + wrap-up critic) specialists.go specialist lenses: built-ins, default suite, env + .gadfly.yml resolution auto.go dynamic `auto` selection: a selector model picks lenses per-diff (may invent) delegate.go worker-tier delegate_investigation tool (cheap sub-agent does legwork) @@ -69,7 +75,7 @@ verdict. Verdict is one of: `No material issues found` / `Minor issues` / `Block ## Build / test ```sh -go build ./cmd/gadfly # needs read access to the private majordomo module +go build ./cmd/gadfly # needs read access to the private majordomo + executus modules go test ./... gofmt -l cmd/ # must be clean docker build -t gadfly:dev --secret id=REGISTRY_USER,env=REGISTRY_USER --secret id=REGISTRY_PASSWORD,env=REGISTRY_PASSWORD . @@ -149,3 +155,21 @@ are actually exercised. OpenAI/Anthropic/Google come from majordomo's abstractio parallel, `cap` (from `GADFLY_PROVIDER_CONCURRENCY` else `GADFLY_CONCURRENCY`, default 1) bounds models-at-once within a lane. The review timeout (`GADFLY_TIMEOUT_SECS`) is **per-lens**, not shared across the suite — a slow model can't starve later lenses (the original timeout bug). +- **Large-PR token burn**: the agent loop re-sends the whole transcript every step, so a giant + diff (the old `get_diff` dumped it untruncated, and it was embedded in both the review and + recheck task) was re-transmitted ~steps × lenses × passes × models times — a ~250 K-token PR + could drain a metered usage block in minutes. Fixed in three size-gated layers (small PRs + untouched): paginated `get_diff` + `executus/compact` compaction in the binary; an + `entrypoint.sh` downshift above `GADFLY_HUGE_DIFF_BYTES` (one cheap model, fewer lenses/steps, + no recheck); and a swarm-wide `GADFLY_PR_BUDGET_SECS` wall-clock backstop. Compaction's threshold + is intentionally LOW (`GADFLY_COMPACT_RATIO` 0.45, not executus's 0.7) because the burning + transcript on the embedded path rarely reaches 0.7×context. +- **executus re-platform**: the in-process review path runs through `executus/run`'s `run.Executor` + (compaction, run-bounding, `Ports.Budget`, the wrap-up nudge as `Ports.Critic`), wiring it in + `cmd/gadfly/executus.go`. Gadfly KEEPS its own `model.go` resolution (so `GADFLY_ENDPOINT_` + http aliases + the claude-code engine survive) and only hands `run.Executor` the already-resolved + model via a trivial resolver — do NOT route review-model resolution through + `model.ParseModelForContext` (it bypasses gadfly's endpoint aliases). `run.Result` exposes no + transcript, so the old transcript-based forced-finalization fallback is gone; the wrap-up critic + nudge is the remaining "always emit something" mechanism. The claude-code engine still shells out + and is unaffected. diff --git a/Dockerfile b/Dockerfile index cc7f356..114b3f1 100644 --- a/Dockerfile +++ b/Dockerfile @@ -24,7 +24,10 @@ RUN --mount=type=cache,target=/go/pkg/mod \ go build -trimpath -ldflags="-s -w" -o /out/gadfly ./cmd/gadfly FROM alpine:3.20 -RUN apk add --no-cache bash git curl jq ca-certificates nodejs npm +# procps provides pkill/pgrep, which entrypoint.sh's per-PR wall-clock backstop +# (GADFLY_PR_BUDGET_SECS) uses to stop the review subtrees — busybox's applets +# are not guaranteed to include them. +RUN apk add --no-cache bash git curl jq ca-certificates nodejs npm procps # Bundle the Claude Code CLI so the `claude-code` review engine works out of the # box (GADFLY_MODELS=claude-code or claude-code/). This adds Node + the # CLI to the image (notably larger); ollama-only users pay the size but nothing diff --git a/README.md b/README.md index dedc675..2b0eb99 100644 --- a/README.md +++ b/README.md @@ -396,7 +396,16 @@ The reviewer binary reads these (the stub/entrypoint set sane defaults): | `GADFLY_TIMEOUT_SECS` | 300 | deadline **per specialist lens** (review+recheck) | | `GADFLY_RECHECK` | on | set `0`/`false` to skip the recheck pass | | `GADFLY_RECHECK_MAX_STEPS` | 16 | recheck-pass step cap | -| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the prompt (full diff via `get_diff`) | +| `GADFLY_MAX_DIFF_CHARS` | 60000 | diff chars embedded in the **review** prompt (the full diff is reachable via the paginated `get_diff` tool, scoped per file with its `path` arg) | +| `GADFLY_RECHECK_DIFF_CHARS` | 20000 | diff chars embedded in the **recheck** prompt (smaller — the recheck pages `get_diff` for the hunks it verifies) | +| `GADFLY_COMPACT` | on | context compaction (via [executus](https://gitea.stevedudenhoeffer.com/steve/executus)): fold the transcript's runaway middle into a summary as it nears the model's context window, so a big diff + accumulating tool output can't balloon every step. `0` disables | +| `GADFLY_COMPACT_RATIO` | 0.45 | fraction of the model's context window at which compaction fires | +| `GADFLY_COMPACT_MODEL` | worker, else review model | cheap model the compactor uses to summarize the folded middle | +| `GADFLY_COMPACT_KEEP_RECENT` | 8 | most-recent messages kept verbatim during compaction | +| `GADFLY_COMPACT_SUMMARY_WORDS` | 200 | word cap on the compaction summary | +| `GADFLY_MODEL_CONTEXT_TOKENS` | *(auto)* | override the model's context-window size (tokens) for the compaction threshold; set it for self-hosted endpoints executus can't introspect (Ollama Cloud models resolve automatically) | +| `GADFLY_PR_TOKEN_BUDGET` | — | per-model token ceiling for this PR; once spent, remaining lenses/passes are skipped (advisory). 0 = off | +| `GADFLY_PR_TIME_BUDGET_SECS` | — | per-model wall-clock ceiling for this PR (advisory). 0 = off | | `GADFLY_STATUS_BOARD` | on | set `0` to disable the live status-board comment | | `GADFLY_STATUS_POLL_SECS` | 12 | how often the status board re-renders/upserts | | `GADFLY_CONSOLIDATE` | `auto` | cross-model consensus comment: `auto` (on for ≥2 models), `1` (force on), `0` (off — one comment per model) | @@ -408,6 +417,37 @@ The reviewer binary reads these (the stub/entrypoint set sane defaults): | `GADFLY_REPO` | *(from `GITEA_API`)* | `owner/repo` slug stamped on emitted runs/findings (set by `entrypoint.sh`) | | `GADFLY_PR` | *(from event)* | PR number stamped on emitted runs/findings (set by `entrypoint.sh`) | +### Large-PR cost controls + +A very large diff is the one thing that can blow the budget: every review step +re-sends it, multiplied across models × lenses × passes × steps (a single +~250 K-token PR can otherwise burn a whole metered usage block). Gadfly handles +big PRs in three layers, all **size-gated so small PRs are untouched**: + +1. **Paginated `get_diff` + compaction** (reviewer binary, on by default) — + `get_diff` returns a paginated, optionally per-file window instead of the whole + diff, and once a transcript nears the model's context window its middle is + folded into a summary (powered by [executus](https://gitea.stevedudenhoeffer.com/steve/executus)'s + `compact`). Tune with the `GADFLY_COMPACT_*` knobs above. +2. **Downshift** (`entrypoint.sh`) — above `GADFLY_HUGE_DIFF_BYTES` the whole fleet + collapses to a single cheap model + a focused lens subset, fewer steps, and no + recheck. A finished shallow review beats a budget-nuking one, and the posted + comment says so. +3. **Hard backstop** (`entrypoint.sh`) — `GADFLY_PR_BUDGET_SECS` is a wall-clock + ceiling across the *entire* fleet; on expiry the review is stopped and whatever + was found so far is posted. Like everything else, it never fails CI. + +| Env | Default | Meaning | +|-----|---------|---------| +| `GADFLY_HUGE_DIFF_BYTES` | 600000 | downshift the fleet when the PR diff exceeds this many bytes (0 = never downshift) | +| `GADFLY_HUGE_DIFF_MODELS` | first model | model(s) to run on a downshifted huge PR | +| `GADFLY_HUGE_DIFF_SPECIALISTS` | `security,correctness,error-handling` | lenses on a downshifted huge PR | +| `GADFLY_HUGE_DIFF_MAX_STEPS` | 12 | review step cap on a huge PR | +| `GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS` | 8 | recheck step cap on a huge PR | +| `GADFLY_HUGE_DIFF_RECHECK` | 0 | run the recheck pass on a huge PR (off by default) | +| `GADFLY_HUGE_DIFF_MAX_DIFF_CHARS` | 20000 | embedded review-diff chars on a huge PR | +| `GADFLY_PR_BUDGET_SECS` | — | swarm-wide wall-clock backstop; stops the whole fleet when reached (0 = off) | + ## Findings telemetry (optional) Gadfly can record what it found so model quality can be tracked over time. It is @@ -431,7 +471,7 @@ code. ## Building locally ```sh -go build ./cmd/gadfly # needs read access to the private majordomo module +go build ./cmd/gadfly # needs read access to the private majordomo + executus modules go test ./... ``` diff --git a/cmd/gadfly/engine.go b/cmd/gadfly/engine.go index 7fe4c09..6d87489 100644 --- a/cmd/gadfly/engine.go +++ b/cmd/gadfly/engine.go @@ -33,14 +33,18 @@ type reviewEngine interface { runPass(ctx context.Context, system, task string, maxSteps int) (string, error) } -// majordomoEngine drives the in-process majordomo agent over the repo sandbox. +// majordomoEngine drives the in-process review path over the repo sandbox. It no +// longer calls majordomo's agent loop directly: each pass runs through an +// executus run.Executor (see executus.go), which adds context compaction, run +// bounding, the per-PR budget gate, and the wrap-up critic. mdl is retained only +// so auto-select can fall back to the review model as its selector. type majordomoEngine struct { - mdl llm.Model - fsTools *repoFS + rex *reviewExecutor + mdl llm.Model } func (e *majordomoEngine) runPass(ctx context.Context, system, task string, maxSteps int) (string, error) { - return runAgent(ctx, e.mdl, e.fsTools, system, task, maxSteps) + return e.rex.run(ctx, system, task, maxSteps) } // claudeCodeEngine reviews by shelling out to the `claude` CLI (Claude Code) in diff --git a/cmd/gadfly/executus.go b/cmd/gadfly/executus.go new file mode 100644 index 0000000..41078e5 --- /dev/null +++ b/cmd/gadfly/executus.go @@ -0,0 +1,364 @@ +package main + +// executus.go wires gadfly's agentic review path onto the executus run kernel +// (gitea.stevedudenhoeffer.com/steve/executus), layered above majordomo. The +// majordomoEngine no longer drives majordomo's agent loop directly; it builds a +// run.Executor that gives gadfly, for free: +// +// - context compaction (executus/compact): once the transcript a step would +// SEND crosses a token threshold derived from the model's real context +// window, the runaway middle is folded into a one-paragraph summary by a +// cheap summarizer model — so a big diff + accumulating read_file/grep +// results can't balloon every re-sent step (the large-PR burn). +// - run bounding + a per-PR spend budget (executus/run Ports.Budget): a hard +// token/seconds ceiling so a pathological PR can't drain the usage block. +// - the wrap-up nudge, re-expressed as an executus Critic (Ports.Critic): the +// steer that tells a step-hungry model to stop investigating and write its +// answer is now the critic seam, not a bespoke RunOption. +// +// Everything degrades to today's behavior when unconfigured: nil summarizer or a +// 0 context window disables compaction; nil budget disables the ceiling; the +// claude-code engine shells out and is unaffected by any of this. +// +// gadfly keeps its own model.go resolution (so GADFLY_ENDPOINT_ http +// aliases, failover chains, and the claude-code engine all survive) — the +// run.Executor is handed gadfly's already-resolved model via a trivial resolver, +// not routed through executus's tier table. + +import ( + "context" + "errors" + "fmt" + "os" + "strconv" + "strings" + "sync" + "sync/atomic" + "time" + + llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm" + + "gitea.stevedudenhoeffer.com/steve/executus/compact" + "gitea.stevedudenhoeffer.com/steve/executus/model" + exrun "gitea.stevedudenhoeffer.com/steve/executus/run" + exectool "gitea.stevedudenhoeffer.com/steve/executus/tool" +) + +const ( + // defaultCompactRatio is the fraction of the model's context window at which + // compaction fires. It is deliberately LOWER than executus's own 0.7 default: + // on the large-PR burn the per-step transcript is the embedded diff (~17K) plus + // accumulating read_file results, which rarely reaches 0.7×262K≈183K — so a + // 0.7 threshold never bites. ~0.45×262K≈118K folds the runaway middle while a + // transcript is still well under the cap. Override with GADFLY_COMPACT_RATIO. + defaultCompactRatio = 0.45 + // defaultCompactKeepRecent / defaultCompactSummaryWords mirror executus's own + // compactor defaults; surfaced as gadfly env knobs for tuning. + defaultCompactKeepRecent = 8 + defaultCompactSummaryWords = 200 +) + +// runSeq mints a unique-per-process RunID suffix for each executor run so audit +// and the run kernel can tell one pass from another within a binary process. +var runSeq atomic.Uint64 + +// wrappedTool adapts an already-built majordomo llm.Tool (gadfly's sandboxed +// read_file/grep/get_diff/… closures over the repoFS) to executus's tool.Tool +// interface so the run kernel can build a toolbox from them by name. gadfly's +// tools need no caller/channel identity, so BuildLLM ignores the Invocation and +// returns the pre-built tool; Permission is the zero value (private, ungated). +type wrappedTool struct{ t llm.Tool } + +func (w wrappedTool) Name() string { return w.t.Name } +func (w wrappedTool) Description() string { return w.t.Description } +func (w wrappedTool) Permission() exectool.Permission { return exectool.Permission{} } +func (w wrappedTool) BuildLLM(_ exectool.Invocation) llm.Tool { return w.t } + +// gadflyToolRegistry registers the repo's read-only tools (plus the optional +// delegate_investigation worker tool) in a fresh executus tool.Registry and +// returns it along with the tool names for RunnableAgent.LowLevelTools. +func gadflyToolRegistry(fs *repoFS) (exectool.Registry, []string, error) { + reg := exectool.NewRegistry() + tools := fs.fsTools() + if fs.worker != nil { + tools = append(tools, fs.delegateTool()) + } + names := make([]string, 0, len(tools)) + for _, t := range tools { + if err := reg.Register(wrappedTool{t: t}); err != nil { + return nil, nil, fmt.Errorf("register tool %q: %w", t.Name, err) + } + names = append(names, t.Name) + } + return reg, names, nil +} + +// gadflyBudget is gadfly's per-PR spend ceiling, satisfying run.Ports.Budget. +// It gates a run BEFORE it makes any model call (Check) once the process has +// spent its token or wall-clock allowance on this PR. Tokens are fed in +// out-of-band via addUsage (the Budget interface's Commit only carries seconds); +// the engine calls addUsage after each pass with run.Result.Usage. A nil +// *gadflyBudget is never installed — caps of 0 mean "unlimited", so the port is +// only wired when at least one cap is set. +type gadflyBudget struct { + mu sync.Mutex + maxTokens int64 + maxSeconds float64 + tokens int64 + seconds float64 +} + +// newPRBudget builds the per-PR budget from env, or nil when neither cap is set +// (the default — the swarm-wide ceiling lives in entrypoint.sh; this is the +// per-process belt to its suspenders). +func newPRBudget() *gadflyBudget { + toks := envInt("GADFLY_PR_TOKEN_BUDGET", 0) + secs := envInt("GADFLY_PR_TIME_BUDGET_SECS", 0) + if toks <= 0 && secs <= 0 { + return nil + } + return &gadflyBudget{maxTokens: int64(toks), maxSeconds: float64(secs)} +} + +func (b *gadflyBudget) Check(_ context.Context, _ string) error { + if b == nil { + return nil + } + b.mu.Lock() + defer b.mu.Unlock() + if b.maxTokens > 0 && b.tokens >= b.maxTokens { + return fmt.Errorf("gadfly: per-PR token budget exhausted (%d/%d)", b.tokens, b.maxTokens) + } + if b.maxSeconds > 0 && b.seconds >= b.maxSeconds { + return fmt.Errorf("gadfly: per-PR time budget exhausted (%.0f/%.0fs)", b.seconds, b.maxSeconds) + } + return nil +} + +func (b *gadflyBudget) Commit(_ context.Context, _ string, runtimeSeconds float64) { + if b == nil { + return + } + b.mu.Lock() + defer b.mu.Unlock() + b.seconds += runtimeSeconds +} + +// addUsage records a finished pass's token spend toward the budget. Safe on nil. +func (b *gadflyBudget) addUsage(u llm.Usage) { + if b == nil { + return + } + b.mu.Lock() + defer b.mu.Unlock() + b.tokens += int64(u.InputTokens) + int64(u.OutputTokens) +} + +// wrapUpCritic re-expresses gadfly's wrap-up nudge as an executus run.Critic: +// once a run comes within wrapUpReserve steps of its cap, Steer() injects the +// "stop calling tools and write your final answer" message so a thorough model +// spends its last steps finalizing instead of hard-failing empty. It sets no +// hard deadline (Deadline()==zero) and never raises the step ceiling +// (MaxSteps()==0, defer to the run's MaxIterations) — it is purely the nudge. +type wrapUpCritic struct{ reserve int } + +func (c *wrapUpCritic) Monitor(_ context.Context, info exrun.RunInfo, _ time.Duration) exrun.CriticHandle { + return &wrapUpHandle{maxSteps: info.MaxIterations, reserve: c.reserve} +} + +type wrapUpHandle struct { + mu sync.Mutex + maxSteps int + reserve int + done int // steps completed so far + nudged bool +} + +func (h *wrapUpHandle) RecordStep(iter int, _ *llm.Response) { + h.mu.Lock() + h.done = iter + 1 + h.mu.Unlock() +} +func (h *wrapUpHandle) RecordToolStart(string, string) {} +func (h *wrapUpHandle) Steer() []llm.Message { + h.mu.Lock() + defer h.mu.Unlock() + at := h.maxSteps - h.reserve + if at < 1 { + at = 1 + } + if !h.nudged && h.maxSteps > 0 && h.done >= at { + h.nudged = true + return []llm.Message{llm.UserText(wrapUpInstruction)} + } + return nil +} +func (h *wrapUpHandle) Deadline() time.Time { return time.Time{} } +func (h *wrapUpHandle) MaxSteps() int { return 0 } +func (h *wrapUpHandle) KillCause() error { return nil } +func (h *wrapUpHandle) Stop() {} + +// reviewExecutor bundles a run.Executor with the per-run wiring the engine needs +// for each pass (the tool names to expose, the model spec to report as the tier, +// the per-PR caller id, and the budget to feed token usage into). +type reviewExecutor struct { + ex *exrun.Executor + toolNames []string + modelSpec string + callerID string + budget *gadflyBudget +} + +// newReviewExecutor builds the run.Executor for the in-process majordomo review +// path. mdl is gadfly's already-resolved review model; summarizer is the cheap +// model the compactor uses (nil disables compaction). Compaction also needs the +// model's context window (resolved once here, not per pass); a 0 window likewise +// disables it. The budget (may be nil) becomes the run.Ports.Budget gate. +func newReviewExecutor(fs *repoFS, mdl, summarizer llm.Model, modelSpec string, budget *gadflyBudget) (*reviewExecutor, error) { + reg, names, err := gadflyToolRegistry(fs) + if err != nil { + return nil, err + } + + // gadfly resolves exactly one model per process; the run kernel's resolver + // just hands that model back regardless of the tier string it is asked for. + modelsResolver := func(ctx context.Context, _ string) (context.Context, llm.Model, error) { + return ctx, mdl, nil + } + + var compactor compact.CompactorFactory + var ctxTokens func(string) int + if summarizer != nil && compactionEnabled() { + if window := resolveContextTokens(modelSpec); window > 0 { + sumResolver := func(ctx context.Context, _ string) (context.Context, llm.Model, error) { + return ctx, summarizer, nil + } + compactor = compact.NewCompactor(compact.CompactorConfig{ + Models: sumResolver, + KeepRecent: envInt("GADFLY_COMPACT_KEEP_RECENT", defaultCompactKeepRecent), + SummaryWordCap: envInt("GADFLY_COMPACT_SUMMARY_WORDS", defaultCompactSummaryWords), + }) + ctxTokens = func(string) int { return window } // memoized: one window per process + } + } + + var ports exrun.Ports + ports.Critic = &wrapUpCritic{reserve: wrapUpReserve()} + if budget != nil { + ports.Budget = budget + } + + cfg := exrun.Config{ + Registry: reg, + Models: modelsResolver, + Compactor: compactor, + ContextTokens: ctxTokens, + Defaults: exrun.Defaults{ + MaxIterations: envInt("GADFLY_MAX_STEPS", defaultMaxSteps), + MaxRuntime: time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second, + MaxConsecutiveToolErrors: 4, + MaxSameToolCallRepeats: 4, + CompactionThresholdRatio: compactRatio(), + FallbackTier: modelSpec, + }, + Ports: ports, + } + return &reviewExecutor{ + ex: exrun.New(cfg), + toolNames: names, + modelSpec: modelSpec, + callerID: prCallerID(), + budget: budget, + }, nil +} + +// run executes one agent pass (review or recheck) through the run kernel and +// returns the model's final text. An empty answer with no error is reported as +// an error so the caller (reviewWithSpecialist) renders the advisory "reviewer +// failed to complete" notice rather than a blank section. +func (r *reviewExecutor) run(ctx context.Context, system, task string, maxSteps int) (string, error) { + res := r.ex.Run(ctx, exrun.RunnableAgent{ + Name: "gadfly-review", + SystemPrompt: system, + ModelTier: r.modelSpec, + MaxIterations: maxSteps, + MaxRuntime: time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second, + LowLevelTools: r.toolNames, + Critic: exrun.CriticConfig{Enabled: true}, + }, exectool.Invocation{ + RunID: fmt.Sprintf("gadfly-%d", runSeq.Add(1)), + CallerID: r.callerID, + }, task) + + // Feed token spend toward the per-PR budget out-of-band (Commit carries only + // seconds; the executor already called it). Safe on a nil budget. + r.budget.addUsage(res.Usage) + + if res.Err != nil { + return "", res.Err + } + if out := strings.TrimSpace(res.Output); out != "" { + return out, nil + } + return "", errors.New("agent produced no output") +} + +// prCallerID is the budget/audit caller key: the repo + PR, so a budget keyed on +// it is naturally per-PR. Falls back to "local" for an out-of-CI run. +func prCallerID() string { + repo := strings.TrimSpace(os.Getenv("GADFLY_REPO")) + pr := strings.TrimSpace(os.Getenv("GADFLY_PR")) + if repo == "" && pr == "" { + return "local" + } + return repo + "#" + pr +} + +// compactionEnabled reports whether context compaction should be wired. On +// unless GADFLY_COMPACT is explicitly falsey. +func compactionEnabled() bool { + switch strings.ToLower(strings.TrimSpace(os.Getenv("GADFLY_COMPACT"))) { + case "0", "false", "no", "off": + return false + default: + return true + } +} + +// compactRatio is the compaction threshold as a fraction of the model context +// window (GADFLY_COMPACT_RATIO), clamped to (0,1]; default defaultCompactRatio. +func compactRatio() float64 { + v := strings.TrimSpace(os.Getenv("GADFLY_COMPACT_RATIO")) + if v == "" { + return defaultCompactRatio + } + f, err := strconv.ParseFloat(v, 64) + if err != nil || f <= 0 || f > 1 { + return defaultCompactRatio + } + return f +} + +// resolveContextTokens returns the review model's context window in tokens, used +// to set the compaction threshold. GADFLY_MODEL_CONTEXT_TOKENS overrides it +// (needed for custom/self-hosted endpoints executus can't introspect); otherwise +// it asks executus/model, which knows the static catalog and can fetch an +// Ollama Cloud model's limit via /api/show (one call, at executor-build time). +// Returns 0 — disabling compaction — for an unknown model, mirroring executus's +// "unknown ⇒ don't budget" contract. +func resolveContextTokens(modelSpec string) int { + if v := envInt("GADFLY_MODEL_CONTEXT_TOKENS", 0); v > 0 { + return v + } + key := strings.TrimSpace(os.Getenv("GADFLY_API_KEY")) + if key == "" { + key = strings.TrimSpace(os.Getenv("OLLAMA_API_KEY")) + } + cache := model.NewCloudOllamaLimitCache("", key, nil) + ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second) + defer cancel() + if n, ok := model.MaxContextTokensResolving(ctx, modelSpec, cache); ok { + return n + } + return 0 +} diff --git a/cmd/gadfly/executus_test.go b/cmd/gadfly/executus_test.go new file mode 100644 index 0000000..f1a5c57 --- /dev/null +++ b/cmd/gadfly/executus_test.go @@ -0,0 +1,140 @@ +package main + +import ( + "context" + "testing" + + llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm" +) + +func TestGadflyBudget(t *testing.T) { + ctx := context.Background() + + // A nil budget never blocks and never panics. + var nilB *gadflyBudget + if err := nilB.Check(ctx, "pr"); err != nil { + t.Errorf("nil budget Check should be nil, got %v", err) + } + nilB.Commit(ctx, "pr", 10) + nilB.addUsage(llm.Usage{InputTokens: 5}) + + // Token ceiling: passes until usage crosses it. + b := &gadflyBudget{maxTokens: 100} + if err := b.Check(ctx, "pr"); err != nil { + t.Fatalf("fresh budget should pass, got %v", err) + } + b.addUsage(llm.Usage{InputTokens: 60, OutputTokens: 50}) // 110 >= 100 + if err := b.Check(ctx, "pr"); err == nil { + t.Error("budget over the token cap should reject the next run") + } + + // Seconds ceiling, accumulated via Commit. + s := &gadflyBudget{maxSeconds: 30} + s.Commit(ctx, "pr", 31) + if err := s.Check(ctx, "pr"); err == nil { + t.Error("budget over the time cap should reject the next run") + } +} + +func TestNewPRBudget(t *testing.T) { + t.Setenv("GADFLY_PR_TOKEN_BUDGET", "") + t.Setenv("GADFLY_PR_TIME_BUDGET_SECS", "") + if newPRBudget() != nil { + t.Error("no caps set should yield a nil (disabled) budget") + } + t.Setenv("GADFLY_PR_TOKEN_BUDGET", "1000") + if b := newPRBudget(); b == nil || b.maxTokens != 1000 { + t.Errorf("token cap should build a budget with maxTokens=1000, got %+v", b) + } +} + +func TestCompactRatio(t *testing.T) { + t.Setenv("GADFLY_COMPACT_RATIO", "") + if got := compactRatio(); got != defaultCompactRatio { + t.Errorf("default ratio = %v, want %v", got, defaultCompactRatio) + } + t.Setenv("GADFLY_COMPACT_RATIO", "0.6") + if got := compactRatio(); got != 0.6 { + t.Errorf("ratio override = %v, want 0.6", got) + } + for _, bad := range []string{"0", "-1", "2", "nope"} { + t.Setenv("GADFLY_COMPACT_RATIO", bad) + if got := compactRatio(); got != defaultCompactRatio { + t.Errorf("invalid ratio %q should fall back to the default, got %v", bad, got) + } + } +} + +func TestResolveContextTokensOverride(t *testing.T) { + // The explicit override short-circuits any model introspection (no network). + t.Setenv("GADFLY_MODEL_CONTEXT_TOKENS", "123456") + if got := resolveContextTokens("anything"); got != 123456 { + t.Errorf("explicit context-token override = %d, want 123456", got) + } +} + +func TestPRCallerID(t *testing.T) { + t.Setenv("GADFLY_REPO", "") + t.Setenv("GADFLY_PR", "") + if got := prCallerID(); got != "local" { + t.Errorf("no repo/PR should be %q, got %q", "local", got) + } + t.Setenv("GADFLY_REPO", "steve/mort") + t.Setenv("GADFLY_PR", "1367") + if got := prCallerID(); got != "steve/mort#1367" { + t.Errorf("callerID = %q, want steve/mort#1367", got) + } +} + +func TestCompactionEnabled(t *testing.T) { + for _, v := range []string{"", "1", "true", "yes"} { + t.Setenv("GADFLY_COMPACT", v) + if !compactionEnabled() { + t.Errorf("GADFLY_COMPACT=%q should be enabled", v) + } + } + for _, v := range []string{"0", "false", "no", "off"} { + t.Setenv("GADFLY_COMPACT", v) + if compactionEnabled() { + t.Errorf("GADFLY_COMPACT=%q should be disabled", v) + } + } +} + +func TestGadflyToolRegistry(t *testing.T) { + fs, err := newRepoFS(t.TempDir(), "diff") + if err != nil { + t.Fatal(err) + } + _, names, err := gadflyToolRegistry(fs) + if err != nil { + t.Fatalf("gadflyToolRegistry: %v", err) + } + want := map[string]bool{"read_file": true, "list_dir": true, "grep": true, "find_files": true, "get_diff": true} + has := func(n string) bool { + for _, g := range names { + if g == n { + return true + } + } + return false + } + for n := range want { + if !has(n) { + t.Errorf("registry missing tool %q (got %v)", n, names) + } + } + if has("delegate_investigation") { + t.Error("delegate_investigation must be absent without a worker model") + } + + // With a worker model the delegate tool is registered too. + fs.worker = fakeModel(t, "x") + _, names, err = gadflyToolRegistry(fs) + if err != nil { + t.Fatalf("gadflyToolRegistry with worker: %v", err) + } + if !has("delegate_investigation") { + t.Errorf("delegate_investigation should be registered with a worker model, got %v", names) + } +} diff --git a/cmd/gadfly/lens_concurrency_test.go b/cmd/gadfly/lens_concurrency_test.go index f91c008..4b67a6a 100644 --- a/cmd/gadfly/lens_concurrency_test.go +++ b/cmd/gadfly/lens_concurrency_test.go @@ -104,7 +104,7 @@ func TestRunSpecialists_FansOut(t *testing.T) { } specs := threeLenses() - results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff") + results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff") if got := peak(); got != 3 { t.Errorf("peak concurrent lenses = %d, want 3", got) @@ -124,7 +124,7 @@ func TestRunSpecialists_SequentialByDefault(t *testing.T) { } specs := threeLenses() - results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff") + results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff") if got := peak(); got != 1 { t.Errorf("peak concurrent lenses = %d, want 1 (sequential by default)", got) @@ -146,7 +146,7 @@ func TestRunSpecialists_PerProviderFanOut(t *testing.T) { } specs := threeLenses() - results := runSpecialists(&majordomoEngine{mdl: mdl, fsTools: fs}, "sys", specs, "task", "diff") + results := runSpecialists(testEngine(t, mdl, fs), "sys", specs, "task", "diff") if got := peak(); got != 3 { t.Errorf("peak concurrent lenses = %d, want 3 (m1 per-provider override)", got) diff --git a/cmd/gadfly/main.go b/cmd/gadfly/main.go index 57943a1..85b5b69 100644 --- a/cmd/gadfly/main.go +++ b/cmd/gadfly/main.go @@ -70,11 +70,9 @@ import ( "os" "strconv" "strings" - "sync" "time" - "gitea.stevedudenhoeffer.com/steve/majordomo/agent" - llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm" + "gitea.stevedudenhoeffer.com/steve/executus/fanout" ) const ( @@ -107,13 +105,6 @@ const wrapUpInstruction = "⚠️ You are almost out of your investigation budge "Do not begin any new investigation. If a finding could not be confirmed, drop it or mark it explicitly as unverified. " + "Output the review in the required format right now." -// finalizeInstruction is the user message sent on the tool-free fallback pass -// when the agent exhausted its budget (or tripped a loop guard) without ever -// producing a final answer. It forces the model to synthesize whatever it has. -const finalizeInstruction = "You have run out of investigation steps. Do NOT call any tools. " + - "Based solely on what you have already gathered above, write your final answer now in the required format. " + - "If you could not confirm some findings, omit them or mark them as unverified, but produce the answer." - func main() { if err := run(); err != nil { fmt.Fprintln(os.Stderr, "gadfly:", err) @@ -177,7 +168,18 @@ func run() error { } else if worker != nil { fsTools.worker = worker } - eng = &majordomoEngine{mdl: mdl, fsTools: fsTools} + // The context compactor needs a cheap summarizer; reuse the worker model + // when present, else the review model. A bad explicit GADFLY_COMPACT_MODEL + // just disables compaction rather than sinking the review. + summarizer, serr := resolveSummarizerModel(mdl, fsTools.worker) + if serr != nil { + fmt.Fprintln(os.Stderr, "gadfly: compaction summarizer disabled:", serr) + } + rex, rerr := newReviewExecutor(fsTools, mdl, summarizer, os.Getenv("GADFLY_MODEL"), newPRBudget()) + if rerr != nil { + return fmt.Errorf("build review executor: %w", rerr) + } + eng = &majordomoEngine{rex: rex, mdl: mdl} } specialists, registry, auto, serrs := resolveSpecialists(repoDir) @@ -231,55 +233,55 @@ func run() error { } // runSpecialists reviews the diff through each lens and returns the results in -// the SAME order as specialists, regardless of finish order. Up to -// GADFLY_LENS_CONCURRENCY lenses run concurrently; the default of 1 keeps the -// suite sequential, exactly as before. Each lens already runs under its own -// per-lens timeout (reviewWithSpecialist), so concurrency simply overlaps those -// independent passes — and because reviewWithSpecialist builds a fresh toolbox -// per pass and the lenses only read the immutable repoFS, they share no mutable -// state. Results are stored by index so the consolidated comment keeps the -// configured lens order. +// the SAME order as specialists, regardless of finish order. It uses executus's +// fanout primitive: up to GADFLY_LENS_CONCURRENCY lenses run concurrently (the +// default of 1 keeps the suite sequential, exactly as before), and fanout.Run +// returns one result per lens in input order. Each lens already runs under its +// own per-lens timeout (reviewWithSpecialist) and the lenses only read the +// immutable repoFS, so concurrency simply overlaps independent passes. // // Caution: this fans out WITHIN one model. It multiplies with entrypoint.sh's // per-provider model concurrency, so total concurrent backend requests ≈ // (models at once) × (lenses at once). To fan lenses out without oversubscribing // the backend, run models one at a time (provider lane cap 1) and raise this. func runSpecialists(eng reviewEngine, base string, specialists []Specialist, task, diff string) []specialistResult { - results := make([]specialistResult, len(specialists)) - // Optional live status board: publishes this model's per-lens progress to a // file the entrypoint board renders. Inert (no-op) unless GADFLY_STATUS_FILE // is set, so plain runs are unaffected. sw := newStatusWriter(os.Getenv("GADFLY_MODEL"), modelProvider(), specialists) - conc := min(lensConcurrency(), len(specialists)) + fanResults := fanout.Run(context.Background(), specialists, fanout.Options[Specialist]{ + MaxConcurrent: lensConcurrency(), + }, func(_ context.Context, sp Specialist) (res specialistResult, _ error) { + // A panic in one lens must not crash the whole binary (which would kill + // every other lens's output) or leave this lens stuck at "running" on the + // status board. fanout does not recover fn panics, so we do it here: + // record the panic as an errored result and mark the lens finished. + defer func() { + if r := recover(); r != nil { + res = specialistResult{spec: sp, out: fmt.Sprintf("⚠️ This reviewer panicked: %v", r), verdict: verdictUnknown, errored: true} + sw.set(sp.Name, lensFinished, "", true) + } + }() + sw.set(sp.Name, lensRunning, "", false) + out, errored := reviewWithSpecialist(eng, base, sp, task, diff) + v := parseVerdict(out) + sw.set(sp.Name, lensFinished, v.label(), errored) + return specialistResult{spec: sp, out: out, verdict: v, errored: errored}, nil + }) - sem := make(chan struct{}, conc) - var wg sync.WaitGroup - for i, sp := range specialists { - wg.Add(1) - sem <- struct{}{} // blocks once `conc` lenses are already in flight - go func(i int, sp Specialist) { - defer wg.Done() - defer func() { <-sem }() - // A panic in one lens must not crash the whole binary (which would - // kill every other lens's output) or leave this lens stuck at - // "running" on the status board. Recover, record it as an errored - // result, and mark the lens finished so the board can complete. - defer func() { - if r := recover(); r != nil { - results[i] = specialistResult{spec: sp, out: fmt.Sprintf("⚠️ This reviewer panicked: %v", r), verdict: verdictUnknown, errored: true} - sw.set(sp.Name, lensFinished, "", true) - } - }() - sw.set(sp.Name, lensRunning, "", false) - out, errored := reviewWithSpecialist(eng, base, sp, task, diff) - v := parseVerdict(out) - results[i] = specialistResult{spec: sp, out: out, verdict: v, errored: errored} - sw.set(sp.Name, lensFinished, v.label(), errored) - }(i, sp) + // fanout guarantees input order; its Result.Err is set only when the context + // is cancelled before a lens ran (reviewWithSpecialist embeds its own failures + // in the result), so surface that as an errored lens rather than dropping it. + results := make([]specialistResult, len(specialists)) + for i, r := range fanResults { + if r.Err != nil { + results[i] = specialistResult{spec: specialists[i], out: fmt.Sprintf("⚠️ This reviewer did not run: %v", r.Err), verdict: verdictUnknown, errored: true} + sw.set(specialists[i].Name, lensFinished, "", true) + continue + } + results[i] = r.Value } - wg.Wait() return results } @@ -344,90 +346,6 @@ func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, di return final, false } -// runAgent runs one agent pass (its own fresh toolbox over the sandbox) and -// returns the final answer. An empty answer is an error — the caller decides -// whether that is fatal (review pass) or recoverable (recheck pass). A -// non-empty answer that ended on a budget/guard error is still returned: the -// model wrote its output, then ran out of steps. -// -// Two mechanisms keep a step-hungry model from hard-failing with no output: -// 1. A wrap-up steer: once the run comes within wrapUpReserve steps of the -// cap, a forceful "stop calling tools, write your final answer" message is -// injected so the model spends its remaining steps finalizing. -// 2. A finalization fallback: if the loop still ends empty (the model ignored -// the nudge, or a loop guard tripped), one tool-free model call forces a -// final answer out of the transcript already gathered. -func runAgent(ctx context.Context, mdl llm.Model, fsTools *repoFS, system, task string, maxSteps int) (string, error) { - box, err := fsTools.toolbox() - if err != nil { - return "", err - } - loop := agent.New(mdl, system, - agent.WithToolbox(box), - agent.WithMaxSteps(maxSteps), - // Guard rails: stop the model from spinning on failing or identical - // tool calls instead of writing its answer. - agent.WithToolErrorLimits(4, 4), - ) - - wrapUpAt := maxSteps - wrapUpReserve() - if wrapUpAt < 1 { - wrapUpAt = 1 - } - var completed int // steps finished so far (updated after each step) - nudged := false - - res, runErr := loop.Run(ctx, task, - agent.OnStep(func(s agent.Step) { completed = s.Index + 1 }), - agent.WithSteer(func() []llm.Message { - if !nudged && completed >= wrapUpAt { - nudged = true - return []llm.Message{llm.UserText(wrapUpInstruction)} - } - return nil - }), - ) - - out := "" - if res != nil { - out = strings.TrimSpace(res.Output) - } - if out != "" { - return out, nil - } - - // No final answer. If we still have budget on the clock and a transcript to - // work from, force a tool-free finalization rather than losing the pass. - if res != nil && len(res.Messages) > 0 && ctx.Err() == nil { - if forced := forceFinalAnswer(ctx, mdl, system, res.Messages); forced != "" { - return forced, nil - } - } - - if runErr != nil { - return "", runErr - } - return "", errors.New("agent produced no output") -} - -// forceFinalAnswer makes one tool-free model call to squeeze a final answer out -// of an agent that exhausted its step budget without producing one. Tools are -// forbidden (ToolChoice "none") so the model must synthesize from the transcript -// instead of investigating further. Best-effort: any error or empty reply -// returns "" and the caller falls back to its normal empty-output handling. -func forceFinalAnswer(ctx context.Context, mdl llm.Model, system string, transcript []llm.Message) string { - msgs := append(append([]llm.Message(nil), transcript...), llm.UserText(finalizeInstruction)) - resp, err := mdl.Generate(ctx, llm.Request{ - System: system, - Messages: msgs, - ToolChoice: "none", - }) - if err != nil || resp == nil { - return "" - } - return strings.TrimSpace(resp.Text()) -} - // wrapUpReserve is how many steps before the cap the wrap-up nudge fires, // overridable via GADFLY_WRAPUP_RESERVE. func wrapUpReserve() int { @@ -444,7 +362,7 @@ func buildTask(diff string) string { truncNote := "" if maxDiff > 0 && len(diff) > maxDiff { diff = diff[:maxDiff] - truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars in this message; read the changed files (or call get_diff, if available) for the full text.]", maxDiff) + truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars in this message; page the full diff with get_diff (paginated; pass a `path` to scope it to one file) or read the changed files.]", maxDiff) } var b strings.Builder diff --git a/cmd/gadfly/model.go b/cmd/gadfly/model.go index 5376271..c969d10 100644 --- a/cmd/gadfly/model.go +++ b/cmd/gadfly/model.go @@ -127,6 +127,27 @@ func resolveWorkerModel() (llm.Model, error) { return majordomo.Parse(buildSpec(provider, spec)) } +// resolveSummarizerModel picks the model the context compactor uses to compress +// the runaway middle of a transcript. It should be CHEAP, since it fires once per +// compaction: GADFLY_COMPACT_MODEL if set (honoring GADFLY_PROVIDER for a bare +// id), else the delegate worker model when one is configured (already cheap by +// design), else the review model itself. Returns (nil, err) only on an explicit +// bad GADFLY_COMPACT_MODEL spec — the caller logs it and simply runs without +// compaction rather than failing the review. +func resolveSummarizerModel(review, worker llm.Model) (llm.Model, error) { + if spec := strings.TrimSpace(os.Getenv("GADFLY_COMPACT_MODEL")); spec != "" { + provider := strings.TrimSpace(os.Getenv("GADFLY_PROVIDER")) + if provider == "" { + provider = defaultProvider + } + return majordomo.Parse(buildSpec(provider, spec)) + } + if worker != nil { + return worker, nil + } + return review, nil +} + // buildSpec turns (provider, model) into a majordomo spec. A model id that // already carries a "provider/" prefix (or is a multi-element failover chain) // is passed through verbatim; a bare id is prefixed with the provider. diff --git a/cmd/gadfly/recheck.go b/cmd/gadfly/recheck.go index d23502c..e00bfa7 100644 --- a/cmd/gadfly/recheck.go +++ b/cmd/gadfly/recheck.go @@ -11,6 +11,13 @@ import ( // than discovering them. const defaultRecheckMaxSteps = 16 +// defaultRecheckDiffChars caps the diff embedded in the recheck task. It is much +// smaller than the review task's GADFLY_MAX_DIFF_CHARS: the recheck already has +// the draft findings to verify and can pull the exact hunks it needs via the +// paginated get_diff tool (optionally scoped to a path), so re-embedding the +// whole diff on every recheck step is pure burn. Override: GADFLY_RECHECK_DIFF_CHARS. +const defaultRecheckDiffChars = 20000 + // recheckSystemPrompt drives the second, adversarial verification pass. The // model is given a DRAFT review and must independently confirm each finding // against the real code before letting it survive — the antidote to a @@ -84,11 +91,11 @@ func shouldRecheck(draft string) bool { // scrutinize, with the full diff available via get_diff (and embedded here, // truncated, to save a tool call). func buildRecheckTask(draft, diff string) string { - maxDiff := envInt("GADFLY_MAX_DIFF_CHARS", defaultMaxDiffChars) + maxDiff := envInt("GADFLY_RECHECK_DIFF_CHARS", defaultRecheckDiffChars) truncNote := "" if maxDiff > 0 && len(diff) > maxDiff { diff = diff[:maxDiff] - truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars here; read the changed files (or call get_diff, if available) for the full text.]", maxDiff) + truncNote = fmt.Sprintf("\n\n[NOTE: diff truncated to %d chars here; call get_diff (paginated; pass a `path` to scope it to one file) or read the changed files for the rest.]", maxDiff) } var b strings.Builder diff --git a/cmd/gadfly/recheck_test.go b/cmd/gadfly/recheck_test.go index 877c393..75e8217 100644 --- a/cmd/gadfly/recheck_test.go +++ b/cmd/gadfly/recheck_test.go @@ -77,25 +77,46 @@ func fakeModel(t *testing.T, reply string) llm.Model { return m } -func TestRunAgent_ReturnsOutput(t *testing.T) { +// newTestReviewExecutor builds a reviewExecutor over a fake model + repo for unit +// tests: no compaction summarizer and no budget, so it exercises the bare agent +// loop through the executus run kernel without any network call. +func newTestReviewExecutor(t *testing.T, mdl llm.Model, fs *repoFS) *reviewExecutor { + t.Helper() + rex, err := newReviewExecutor(fs, mdl, nil, "mock", nil) + if err != nil { + t.Fatal(err) + } + return rex +} + +// testEngine wraps newTestReviewExecutor in a majordomoEngine for the +// runSpecialists-level tests. +func testEngine(t *testing.T, mdl llm.Model, fs *repoFS) *majordomoEngine { + t.Helper() + return &majordomoEngine{rex: newTestReviewExecutor(t, mdl, fs), mdl: mdl} +} + +func TestReviewExecutor_ReturnsOutput(t *testing.T) { fs, err := newRepoFS(t.TempDir(), "diff") if err != nil { t.Fatal(err) } mdl := fakeModel(t, " corrected review: No material issues found. ") - out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4) + rex := newTestReviewExecutor(t, mdl, fs) + out, err := rex.run(context.Background(), "sys", "task", 4) if err != nil { - t.Fatalf("runAgent: %v", err) + t.Fatalf("run: %v", err) } if out != "corrected review: No material issues found." { - t.Errorf("runAgent should return trimmed model output, got %q", out) + t.Errorf("run should return trimmed model output, got %q", out) } } -func TestRunAgent_EmptyIsError(t *testing.T) { +func TestReviewExecutor_EmptyIsError(t *testing.T) { fs, _ := newRepoFS(t.TempDir(), "diff") mdl := fakeModel(t, " ") - if _, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4); err == nil { - t.Error("runAgent should error on empty model output") + rex := newTestReviewExecutor(t, mdl, fs) + if _, err := rex.run(context.Background(), "sys", "task", 4); err == nil { + t.Error("run should error on empty model output") } } diff --git a/cmd/gadfly/tools.go b/cmd/gadfly/tools.go index 6a36868..ece4183 100644 --- a/cmd/gadfly/tools.go +++ b/cmd/gadfly/tools.go @@ -17,11 +17,13 @@ import ( // every tool caps how much it can pull in one call — a runaway read_file or // grep would blow the window and stall the loop. const ( - maxFileBytes = 64 * 1024 // per read_file call - maxReadLines = 800 // per read_file call - maxGrepResults = 200 // per grep call - maxFindResults = 200 // per find_files call - maxLineLen = 400 // truncate any single returned line to this + maxFileBytes = 64 * 1024 // per read_file call + maxReadLines = 800 // per read_file call + maxGrepResults = 200 // per grep call + maxFindResults = 200 // per find_files call + maxLineLen = 400 // truncate any single returned line to this + maxGetDiffLines = 800 // per get_diff call (paginated window) + maxGetDiffBytes = 64 * 1024 // per get_diff call ) // skipDirs are never descended into by grep / find_files — noise and bulk that @@ -396,15 +398,86 @@ func (r *repoFS) findFilesTool() llm.Tool { ) } +type getDiffArgs struct { + Path string `json:"path,omitempty" description:"Optional changed-file path (e.g. pkg/foo/bar.go); returns ONLY that file's diff hunks. Omit for the whole diff. Use this on a large PR to pull just the file a finding is about."` + StartLine int `json:"start_line,omitempty" description:"Optional 1-based line to start from within the (whole or path-scoped) diff (default 1)."` + Limit int `json:"limit,omitempty" description:"Optional max number of diff lines to return (default/maximum 800)."` +} + func (r *repoFS) getDiffTool() llm.Tool { - return llm.DefineTool[struct{}]( + return llm.DefineTool[getDiffArgs]( "get_diff", - "Return the complete unified diff under review. The diff is also included (possibly truncated) in the task message; call this to get the full, untruncated text.", - func(_ context.Context, _ struct{}) (any, error) { - if strings.TrimSpace(r.diff) == "" { + "Return the unified diff under review as a numbered, PAGINATED window (like read_file) — not the whole diff at once, so a huge PR can't blow the context window. Pass `path` to fetch just one changed file's hunks, or `start_line`/`limit` to page through. A truncated copy of the diff is also embedded in the task message.", + func(_ context.Context, args getDiffArgs) (any, error) { + diff := r.diff + scope := "the diff" + if p := strings.TrimSpace(args.Path); p != "" { + diff = diffForPath(r.diff, p) + if strings.TrimSpace(diff) == "" { + return fmt.Sprintf("(no diff hunks for path %q; check it against the changed-files list — the path must match a file the diff touches)", p), nil + } + scope = "the diff for " + p + } + if strings.TrimSpace(diff) == "" { return "(empty diff)", nil } - return r.diff, nil + return windowDiff(diff, scope, args.StartLine, args.Limit), nil }, ) } + +// windowDiff returns a numbered, paginated slice of a unified diff, mirroring +// read_file's caps (maxGetDiffLines / maxGetDiffBytes / maxLineLen) so a single +// get_diff call can never dump a multi-hundred-KB diff into the transcript — the +// amplifier behind the large-PR token burn. The full diff stays reachable by +// paging with start_line, or scoped per file via the path arg. +func windowDiff(diff, scope string, start, limit int) string { + lines := strings.Split(diff, "\n") + total := len(lines) + if start < 1 { + start = 1 + } + if limit <= 0 || limit > maxGetDiffLines { + limit = maxGetDiffLines + } + if start > total { + return fmt.Sprintf("(%s has %d lines; nothing at/after line %d)", scope, total, start) + } + var b strings.Builder + emitted := 0 + i := start - 1 + for ; i < total; i++ { + if emitted >= limit || b.Len() >= maxGetDiffBytes { + break + } + line := lines[i] + if len(line) > maxLineLen { + line = line[:maxLineLen] + "…" + } + fmt.Fprintf(&b, "%d\t%s\n", i+1, line) + emitted++ + } + if i < total { + fmt.Fprintf(&b, "... (%s truncated at line %d of %d; call get_diff again with start_line=%d for more, or pass a `path` to scope to one file)\n", scope, i, total, i+1) + } + return b.String() +} + +// diffForPath extracts the unified-diff section for one changed file: from the +// `diff --git` header that names the path through to the next `diff --git` +// header (or end). Matching is a substring on the header line, so either side +// (a/ or b/) hits, including a rename's two paths. +func diffForPath(diff, path string) string { + var b strings.Builder + inSection := false + for _, ln := range strings.Split(diff, "\n") { + if strings.HasPrefix(ln, "diff --git ") { + inSection = strings.Contains(ln, path) + } + if inSection { + b.WriteString(ln) + b.WriteByte('\n') + } + } + return b.String() +} diff --git a/cmd/gadfly/tools_test.go b/cmd/gadfly/tools_test.go index 3c2a2c4..10dc644 100644 --- a/cmd/gadfly/tools_test.go +++ b/cmd/gadfly/tools_test.go @@ -197,15 +197,69 @@ func TestFindFilesTool(t *testing.T) { func TestGetDiffTool(t *testing.T) { root := buildFixtureRepo(t) - const diff = "diff --git a/x b/x\n+added line\n" + const diff = "diff --git a/x b/x\n--- a/x\n+++ b/x\n+added line\n" + + "diff --git a/y.go b/y.go\n--- a/y.go\n+++ b/y.go\n+y change\n" fs, _ := newRepoFS(root, diff) + // Default: the whole diff as a NUMBERED window (paginated), not a raw dump — + // so a huge PR can't be poured into the transcript in one call. + out, err := call(t, fs, "get_diff", map[string]any{}) + if err != nil { + t.Fatalf("get_diff: %v", err) + } + if !strings.Contains(out, "1\tdiff --git a/x b/x") { + t.Errorf("get_diff should return a numbered window, got:\n%s", out) + } + if !strings.Contains(out, "+added line") || !strings.Contains(out, "+y change") { + t.Errorf("the full (short) diff window should include every hunk, got:\n%s", out) + } + + // path filter: only the named file's hunks come back. + out, err = call(t, fs, "get_diff", map[string]any{"path": "y.go"}) + if err != nil { + t.Fatalf("get_diff path: %v", err) + } + if !strings.Contains(out, "y change") { + t.Errorf("get_diff path=y.go should include y's hunk, got:\n%s", out) + } + if strings.Contains(out, "added line") { + t.Errorf("get_diff path=y.go must NOT include x's hunk, got:\n%s", out) + } + + // unknown path: a clear note, never an error. + out, err = call(t, fs, "get_diff", map[string]any{"path": "nope.txt"}) + if err != nil { + t.Fatalf("get_diff unknown path: %v", err) + } + if !strings.Contains(out, "no diff hunks") { + t.Errorf("get_diff for an unknown path should note no hunks, got:\n%s", out) + } +} + +// TestGetDiffTool_Paginates: a diff longer than the per-call line cap is returned +// as a truncated window with a paging hint, and start_line pages past it — the +// mechanism that stops get_diff from dumping a multi-hundred-KB diff at once. +func TestGetDiffTool_Paginates(t *testing.T) { + diff := "diff --git a/big b/big\n" + strings.Repeat("+line\n", maxGetDiffLines+50) + fs, _ := newRepoFS(t.TempDir(), diff) + out, err := call(t, fs, "get_diff", map[string]any{}) if err != nil { t.Fatalf("get_diff: %v", err) } - if out != diff { - t.Errorf("get_diff returned %q, want %q", out, diff) + if !strings.Contains(out, "truncated at line") { + t.Error("a diff longer than the per-call cap should be truncated with a paging hint") + } + if strings.Contains(out, "801\t") { + t.Error("the first window must stop at the line cap, not reach line 801") + } + + out, err = call(t, fs, "get_diff", map[string]any{"start_line": 805}) + if err != nil { + t.Fatalf("get_diff page: %v", err) + } + if !strings.Contains(out, "805\t") { + t.Error("paging with start_line=805 should include line 805") } } diff --git a/cmd/gadfly/wrapup_test.go b/cmd/gadfly/wrapup_test.go index 7ea1c0c..7d99a75 100644 --- a/cmd/gadfly/wrapup_test.go +++ b/cmd/gadfly/wrapup_test.go @@ -37,10 +37,11 @@ func lastUserText(req llm.Request) string { return req.Messages[len(req.Messages)-1].Text() } -// TestRunAgent_WrapUpNudgeProducesAnswer: a model that keeps calling tools until -// it is nudged to wrap up should still finish inside its budget — the steer -// message arrives a few steps before the cap and the model writes its answer. -func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) { +// TestReviewExecutor_WrapUpNudgeProducesAnswer: a model that keeps calling tools +// until it is nudged to wrap up should still finish inside its budget — the steer +// message (delivered by the executus wrap-up critic a few steps before the cap) +// arrives and the model writes its answer. +func TestReviewExecutor_WrapUpNudgeProducesAnswer(t *testing.T) { t.Setenv("GADFLY_WRAPUP_RESERVE", "4") final := "VERDICT: No material issues found." @@ -60,9 +61,10 @@ func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) { } fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n") - out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 12) + rex := newTestReviewExecutor(t, mdl, fs) + out, err := rex.run(context.Background(), "sys", "task", 12) if err != nil { - t.Fatalf("runAgent should succeed via wrap-up nudge, got error: %v", err) + t.Fatalf("run should succeed via wrap-up nudge, got error: %v", err) } if out != final { t.Errorf("expected final review %q, got %q", final, out) @@ -72,24 +74,19 @@ func TestRunAgent_WrapUpNudgeProducesAnswer(t *testing.T) { } } -// TestRunAgent_FinalizationFallback: a model that ignores the wrap-up nudge and -// spins on tools until the cap should NOT hard-fail — the tool-free finalization -// pass forces a final answer out of the transcript. -func TestRunAgent_FinalizationFallback(t *testing.T) { +// TestReviewExecutor_ExhaustionWithoutAnswerIsError: a model that ignores the +// wrap-up nudge and spins on tools until the step cap produces no final answer. +// The transcript-based forced-finalization fallback was removed in the executus +// re-platform (run.Result does not expose the loop transcript), so the pass now +// surfaces an error — which reviewWithSpecialist renders as an advisory "reviewer +// failed to complete" notice rather than a phantom success. +func TestReviewExecutor_ExhaustionWithoutAnswerIsError(t *testing.T) { t.Setenv("GADFLY_WRAPUP_RESERVE", "2") - final := "VERDICT: Minor issues\n- something" - forcedCalled := false n := 0 - p := fake.New("fake", fake.WithDefault(func(_ string, req llm.Request) fake.Step { - // Only the tool-free finalization pass forbids tools — reply there. - if req.ToolChoice == "none" { - forcedCalled = true - return fake.Reply(final) - } - // Otherwise keep spinning, ignoring the wrap-up nudge entirely. + p := fake.New("fake", fake.WithDefault(func(_ string, _ llm.Request) fake.Step { n++ - return spinToolCall(n) + return spinToolCall(n) // spin forever, ignoring the wrap-up nudge })) mdl, err := p.Model("mock") if err != nil { @@ -97,37 +94,9 @@ func TestRunAgent_FinalizationFallback(t *testing.T) { } fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n") - out, err := runAgent(context.Background(), mdl, fs, "sys", "task", 6) - if err != nil { - t.Fatalf("runAgent should recover via finalization fallback, got error: %v", err) - } - if !forcedCalled { - t.Error("finalization fallback was never invoked") - } - if out != final { - t.Errorf("expected forced final answer %q, got %q", final, out) - } -} - -// TestRunAgent_FallbackStillEmptyIsError: if even the tool-free finalization -// yields nothing, runAgent surfaces an error rather than a phantom success. -func TestRunAgent_FallbackStillEmptyIsError(t *testing.T) { - n := 0 - p := fake.New("fake", fake.WithDefault(func(_ string, req llm.Request) fake.Step { - if req.ToolChoice == "none" { - return fake.Reply(" ") // finalization produces only whitespace - } - n++ - return spinToolCall(n) - })) - mdl, err := p.Model("mock") - if err != nil { - t.Fatal(err) - } - fs, _ := newRepoFS(t.TempDir(), "diff --git a/x b/x\n+y\n") - - if _, err := runAgent(context.Background(), mdl, fs, "sys", "task", 4); err == nil { - t.Error("runAgent should error when the finalization fallback also yields no output") + rex := newTestReviewExecutor(t, mdl, fs) + if _, err := rex.run(context.Background(), "sys", "task", 6); err == nil { + t.Error("run should error when the model exhausts its steps without an answer") } } diff --git a/entrypoint.sh b/entrypoint.sh index 034f38f..35c2453 100644 --- a/entrypoint.sh +++ b/entrypoint.sh @@ -183,6 +183,34 @@ export GADFLY_FINDINGS_TOKEN="${GADFLY_FINDINGS_TOKEN:-}" MODELS="${GADFLY_MODELS:-${OLLAMA_REVIEW_MODELS:-$DEFAULT_MODELS}}" DEFAULT_CONC="${GADFLY_CONCURRENCY:-1}" +# --- huge-PR downshift ------------------------------------------------------ +# A very large diff is what burns the model budget: every review step re-sends +# it, multiplied across models × lenses × passes × steps (this is what nuked a +# whole Ollama Cloud block on one giant PR). entrypoint is the only process that +# spans the whole fleet, so the fleet-wide size decision lives here: size the PR +# diff ONCE, and above GADFLY_HUGE_DIFF_BYTES collapse to a single cheap model + +# a focused lens subset, fewer steps, no recheck, and a smaller embedded diff. +# A finished shallow review beats a budget-nuking one. All knobs override; set +# GADFLY_HUGE_DIFF_BYTES=0 to disable. Small PRs are never touched. +HUGE_PR=0 +HUGE_DIFF_BYTES="${GADFLY_HUGE_DIFF_BYTES:-600000}" +if [ "$HUGE_DIFF_BYTES" -gt 0 ] 2>/dev/null; then + PR_DIFF_BYTES="$(API "${GITEA_API}/pulls/${PR}.diff" 2>/dev/null | wc -c | tr -d '[:space:]')" + [ -z "$PR_DIFF_BYTES" ] && PR_DIFF_BYTES=0 + if [ "$PR_DIFF_BYTES" -gt "$HUGE_DIFF_BYTES" ] 2>/dev/null; then + HUGE_PR=1 + log "huge PR: diff ${PR_DIFF_BYTES}B > ${HUGE_DIFF_BYTES}B — downshifting the fleet (advisory)" + MODELS="${GADFLY_HUGE_DIFF_MODELS:-${MODELS%%,*}}" # first model only by default + export GADFLY_SPECIALISTS="${GADFLY_HUGE_DIFF_SPECIALISTS:-security,correctness,error-handling}" + export GADFLY_MAX_STEPS="${GADFLY_HUGE_DIFF_MAX_STEPS:-12}" + export GADFLY_RECHECK_MAX_STEPS="${GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS:-8}" + export GADFLY_RECHECK="${GADFLY_HUGE_DIFF_RECHECK:-0}" # skip recheck on huge PRs + export MAX_DIFF_CHARS="${GADFLY_HUGE_DIFF_MAX_DIFF_CHARS:-20000}" # run.sh -> GADFLY_MAX_DIFF_CHARS + # Surfaced on each posted comment so the shallower review is self-explaining. + export GADFLY_NOTICE="⚠️ Large PR (${PR_DIFF_BYTES} bytes): Gadfly downshifted to a focused, single-model review to stay within budget — coverage is intentionally shallower. Consider splitting the PR for a deeper review." + fi +fi + provider_of() { case "$1" in */*) echo "${1%%/*}";; *) echo "${GADFLY_PROVIDER:-ollama-cloud}";; esac; } # Per-model status file path for the live board. The model id can contain '/' @@ -297,6 +325,29 @@ if [ "${GADFLY_STATUS_BOARD:-1}" != "0" ]; then log "status board started (pid ${BOARD_PID})" fi +# --- swarm-wide hard backstop ---------------------------------------------- +# A wall-clock ceiling across the WHOLE fleet, so a pathological PR can never +# drain the usage block however the models behave. entrypoint is the only +# process spanning every model, so a single "never exceed X" guard lives here. +# On expiry it stops the review subtrees (the binary + run.sh); whatever partial +# findings were gathered are still posted and the job never fails (advisory). +# GADFLY_PR_BUDGET_SECS=0 (default) disables it. +KILLER_PID="" +rm -f "${WORKDIR}/.budget_killed" 2>/dev/null || true +if [ "${GADFLY_PR_BUDGET_SECS:-0}" -gt 0 ] 2>/dev/null; then + ( + sleep "${GADFLY_PR_BUDGET_SECS}" + log "PR wall-clock budget (${GADFLY_PR_BUDGET_SECS}s) reached — stopping the review fleet (advisory; partial findings still posted)" + : > "${WORKDIR}/.budget_killed" + pkill -TERM -f '/usr/local/bin/gadfly' 2>/dev/null || true + pkill -TERM -f "${SCRIPTS_DIR}/run.sh" 2>/dev/null || true + sleep 5 + pkill -KILL -f '/usr/local/bin/gadfly' 2>/dev/null || true + ) & + KILLER_PID=$! + log "PR budget watchdog armed (${GADFLY_PR_BUDGET_SECS}s, pid ${KILLER_PID})" +fi + log "providers: ${PROVIDERS:-none}" # Each provider lane runs in parallel; cap is enforced within each lane. Track # the lane PIDs so we wait ONLY for the review work — not the status board, @@ -308,6 +359,16 @@ for p in $PROVIDERS; do done [ "${#LANE_PIDS[@]}" -gt 0 ] && wait "${LANE_PIDS[@]}" +# Reviews finished (or the watchdog killed them): disarm the watchdog so its +# delayed SIGKILL can't catch the consolidation pass that runs next. +if [ -n "$KILLER_PID" ]; then kill "$KILLER_PID" 2>/dev/null || true; fi + +# If the backstop fired, note it on the consensus comment (per-model comments +# were already posted during the run; a killed model surfaces as a failed lane). +if [ -f "${WORKDIR}/.budget_killed" ]; then + export GADFLY_NOTICE="${GADFLY_NOTICE:+${GADFLY_NOTICE} }⏱️ This review was stopped early by the per-PR time budget (GADFLY_PR_BUDGET_SECS); findings are partial." +fi + # Reviews are done: signal the board to render the final state once and exit. if [ -n "$BOARD_PID" ]; then touch "${STATUS_DIR}/.done" 2>/dev/null || true @@ -331,7 +392,9 @@ if [ "$CONSOLIDATE" = "1" ]; then CONSENSUS="$(GADFLY_CONSOLIDATE_DIR="$FINDINGS_DIR" GADFLY_DIFF_FILE="$DIFF_FILE" \ /usr/local/bin/gadfly 2>"${WORKDIR}/consolidate.err" || true)" if [ -n "$CONSENSUS" ]; then - BODY="$(printf '%s\n\nAutomated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.' "$CONSENSUS")" + NOTICE_BLOCK="" + [ -n "${GADFLY_NOTICE:-}" ] && NOTICE_BLOCK="> ${GADFLY_NOTICE}"$'\n\n' + BODY="$(printf '%s%s\n\nAutomated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.' "$NOTICE_BLOCK" "$CONSENSUS")" upsert_comment_body "" "$BODY" log "consensus comment posted" else diff --git a/go.mod b/go.mod index 5868b92..efecde8 100644 --- a/go.mod +++ b/go.mod @@ -3,6 +3,7 @@ module gitea.stevedudenhoeffer.com/steve/gadfly go 1.26.2 require ( + gitea.stevedudenhoeffer.com/steve/executus v0.1.4 gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462 gopkg.in/yaml.v3 v3.0.1 ) @@ -17,6 +18,7 @@ require ( github.com/go-logr/stdr v1.2.2 // indirect github.com/google/go-cmp v0.7.0 // indirect github.com/google/s2a-go v0.1.9 // indirect + github.com/google/uuid v1.6.0 // indirect github.com/googleapis/enterprise-certificate-proxy v0.3.17 // indirect github.com/googleapis/gax-go/v2 v2.22.0 // indirect github.com/gorilla/websocket v1.5.3 // indirect diff --git a/go.sum b/go.sum index b41b816..581c478 100644 --- a/go.sum +++ b/go.sum @@ -4,6 +4,8 @@ cloud.google.com/go/auth v0.20.0 h1:kXTssoVb4azsVDoUiF8KvxAqrsQcQtB53DcSgta74CA= cloud.google.com/go/auth v0.20.0/go.mod h1:942/yi/itH1SsmpyrbnTMDgGfdy2BUqIKyd0cyYLc5Q= cloud.google.com/go/compute/metadata v0.9.0 h1:pDUj4QMoPejqq20dK0Pg2N4yG9zIkYGdBtwLoEkH9Zs= cloud.google.com/go/compute/metadata v0.9.0/go.mod h1:E0bWwX5wTnLPedCKqk3pJmVgCBSM6qQI1yTBdEb3C10= +gitea.stevedudenhoeffer.com/steve/executus v0.1.4 h1:4F99uCV3OVaE9ITFp0FjPiYxLUQO+WpE+wU2HCnpXNM= +gitea.stevedudenhoeffer.com/steve/executus v0.1.4/go.mod h1:WQP/lH+meU06OSNF0TQO/wQLcJCrMwpi0EMj5vSpVtk= gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462 h1:1crjE1YkWHLZ91tUDOxN/Y5cuOnJ56e0U9UADoFfEPY= gitea.stevedudenhoeffer.com/steve/majordomo v0.0.0-20260627225659-aa25b2c33462/go.mod h1:UZLveG17SmENt4sne2RSLIbioix30RZbRIQUzBAnOyY= github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs= diff --git a/scripts/run.sh b/scripts/run.sh index 0a5fe49..6e0a686 100644 --- a/scripts/run.sh +++ b/scripts/run.sh @@ -230,8 +230,12 @@ fi if [ "$CONSOLIDATE" = "1" ]; then say "done in ${DUR} (consolidated; no per-model comment)" else - COMMENT="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s\n\nAutomated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in %s' \ - "$MARKER" "$MODEL" "$MODEL_PROVIDER" "$REVIEW" "$DUR")" + # An optional one-line notice (e.g. entrypoint's huge-PR downshift advisory), + # shown under the header so a shallower review is self-explaining. + NOTICE_BLOCK="" + [ -n "${GADFLY_NOTICE:-}" ] && NOTICE_BLOCK="> ${GADFLY_NOTICE}"$'\n\n' + COMMENT="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s%s\n\nAutomated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in %s' \ + "$MARKER" "$MODEL" "$MODEL_PROVIDER" "$NOTICE_BLOCK" "$REVIEW" "$DUR")" upsert_comment "$COMMENT" say "done in ${DUR}" fi diff --git a/scripts/system-prompt.txt b/scripts/system-prompt.txt index ff4ebe8..5b1d82e 100644 --- a/scripts/system-prompt.txt +++ b/scripts/system-prompt.txt @@ -16,7 +16,9 @@ state. USE THEM to verify before you report. Do not review the diff in isolation - list_dir([path]) — list a directory. - grep(pattern[, path, max_results]) — RE2 regex search across the repo. - find_files(name[, max_results]) — locate a file by path substring. -- get_diff() — the full unified diff (the task message may truncate it). +- get_diff([path, start_line, limit]) — the unified diff as a paginated, numbered window; + pass `path` to fetch just one changed file's hunks (do this on a big PR instead of pulling + the whole diff at once). Mandatory verification discipline — this is the whole point of giving you tools: - Before claiming a missing/duplicate import, an undefined symbol, a wrong signature, -- 2.52.0 From f5ca813af3cc8e3ec7e41b766b41c9b20a8940d6 Mon Sep 17 00:00:00 2001 From: Steve Dudenhoeffer Date: Tue, 30 Jun 2026 11:32:23 -0400 Subject: [PATCH 2/2] fix: address gadfly swarm review of the executus re-platform Folds in the real findings the dogfood swarm raised on PR #20 (21 graded real, 1 false positive): - tools.go: anchor get_diff `path` matching to whole path tokens (foo.go no longer pulls in barfoo.go; a trailing "/" still scopes a directory); split the diff once + cache it; drop the spurious trailing blank line; fix the "truncated after line N" off-by-one wording. Share one tool-list source (allTools) between toolbox() and the executus registry. - executus.go: drop the dead Config.Defaults caps (per-run RunnableAgent always overrides them); shared envBool/reviewTimeout helpers; resolveContextTokens logs a failed lookup and uses a 5s timeout (was 15s); note the budget guard is pass-granular (the wall-clock backstop covers mid-pass). - main.go/recheck.go: shared envBool; fix package-doc drift (the removed finalization fallback, the paginated get_diff). - entrypoint.sh/run.sh: export GADFLY_MAX_DIFF_CHARS directly (run.sh prefers it); guard the watchdog's delayed SIGKILL on a .disarmed marker so it can't catch the consolidation pass. - tests: anchoring test; corrected obsolete env var + truncation-wording asserts. Co-Authored-By: Claude Opus 4.8 (1M context) --- cmd/gadfly/executus.go | 39 +++++++------ cmd/gadfly/main.go | 34 ++++++++--- cmd/gadfly/recheck.go | 10 +--- cmd/gadfly/recheck_test.go | 2 +- cmd/gadfly/tools.go | 114 +++++++++++++++++++++++++++---------- cmd/gadfly/tools_test.go | 26 ++++++++- entrypoint.sh | 20 +++++-- scripts/run.sh | 2 +- 8 files changed, 176 insertions(+), 71 deletions(-) diff --git a/cmd/gadfly/executus.go b/cmd/gadfly/executus.go index 41078e5..3f0b04e 100644 --- a/cmd/gadfly/executus.go +++ b/cmd/gadfly/executus.go @@ -56,6 +56,11 @@ const ( // compactor defaults; surfaced as gadfly env knobs for tuning. defaultCompactKeepRecent = 8 defaultCompactSummaryWords = 200 + // contextTokenLookupTimeout bounds the one-shot /api/show call that resolves a + // cloud model's context window at executor-build time. Kept short so a slow or + // unreachable endpoint adds at most this to startup before degrading to + // no-compaction (rather than the provider cache's default 15s). + contextTokenLookupTimeout = 5 * time.Second ) // runSeq mints a unique-per-process RunID suffix for each executor run so audit @@ -79,10 +84,7 @@ func (w wrappedTool) BuildLLM(_ exectool.Invocation) llm.Tool { return w.t } // returns it along with the tool names for RunnableAgent.LowLevelTools. func gadflyToolRegistry(fs *repoFS) (exectool.Registry, []string, error) { reg := exectool.NewRegistry() - tools := fs.fsTools() - if fs.worker != nil { - tools = append(tools, fs.delegateTool()) - } + tools := fs.allTools() names := make([]string, 0, len(tools)) for _, t := range tools { if err := reg.Register(wrappedTool{t: t}); err != nil { @@ -100,6 +102,11 @@ func gadflyToolRegistry(fs *repoFS) (exectool.Registry, []string, error) { // the engine calls addUsage after each pass with run.Result.Usage. A nil // *gadflyBudget is never installed — caps of 0 mean "unlimited", so the port is // only wired when at least one cap is set. +// +// The guard is PASS-granular: Check runs before each pass, so it stops the NEXT +// pass once the budget is spent but cannot abort a single runaway pass mid-flight. +// The swarm-wide GADFLY_PR_BUDGET_SECS wall-clock backstop (entrypoint.sh) is what +// bounds a mid-pass runaway. type gadflyBudget struct { mu sync.Mutex maxTokens int64 @@ -254,8 +261,10 @@ func newReviewExecutor(fs *repoFS, mdl, summarizer llm.Model, modelSpec string, Compactor: compactor, ContextTokens: ctxTokens, Defaults: exrun.Defaults{ - MaxIterations: envInt("GADFLY_MAX_STEPS", defaultMaxSteps), - MaxRuntime: time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second, + // MaxIterations/MaxRuntime are intentionally omitted: every pass sets its + // own per-run cap on the RunnableAgent below (the review and recheck caps + // differ), so a Defaults value here would always be overridden — dead. This + // leaves only the cross-pass guards + the compaction ratio. MaxConsecutiveToolErrors: 4, MaxSameToolCallRepeats: 4, CompactionThresholdRatio: compactRatio(), @@ -282,7 +291,7 @@ func (r *reviewExecutor) run(ctx context.Context, system, task string, maxSteps SystemPrompt: system, ModelTier: r.modelSpec, MaxIterations: maxSteps, - MaxRuntime: time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second, + MaxRuntime: reviewTimeout(), LowLevelTools: r.toolNames, Critic: exrun.CriticConfig{Enabled: true}, }, exectool.Invocation{ @@ -316,14 +325,7 @@ func prCallerID() string { // compactionEnabled reports whether context compaction should be wired. On // unless GADFLY_COMPACT is explicitly falsey. -func compactionEnabled() bool { - switch strings.ToLower(strings.TrimSpace(os.Getenv("GADFLY_COMPACT"))) { - case "0", "false", "no", "off": - return false - default: - return true - } -} +func compactionEnabled() bool { return envBool("GADFLY_COMPACT", true) } // compactRatio is the compaction threshold as a fraction of the model context // window (GADFLY_COMPACT_RATIO), clamped to (0,1]; default defaultCompactRatio. @@ -355,10 +357,15 @@ func resolveContextTokens(modelSpec string) int { key = strings.TrimSpace(os.Getenv("OLLAMA_API_KEY")) } cache := model.NewCloudOllamaLimitCache("", key, nil) - ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second) + ctx, cancel := context.WithTimeout(context.Background(), contextTokenLookupTimeout) defer cancel() if n, ok := model.MaxContextTokensResolving(ctx, modelSpec, cache); ok { return n } + // Unknown model or a failed lookup (e.g. no key / unreachable endpoint): don't + // guess — compaction is disabled. Log it so a misconfiguration is debuggable + // rather than silently dropping the protection. Set GADFLY_MODEL_CONTEXT_TOKENS + // to force a window for an endpoint executus can't introspect. + fmt.Fprintf(os.Stderr, "gadfly: no context window resolved for %q; compaction disabled (set GADFLY_MODEL_CONTEXT_TOKENS to enable it)\n", modelSpec) return 0 } diff --git a/cmd/gadfly/main.go b/cmd/gadfly/main.go index 85b5b69..18e6dc5 100644 --- a/cmd/gadfly/main.go +++ b/cmd/gadfly/main.go @@ -39,10 +39,9 @@ // GADFLY_TITLE PR title (optional). // GADFLY_BODY PR description (optional). // GADFLY_MAX_STEPS review-pass step cap (optional, default 24). -// GADFLY_WRAPUP_RESERVE steps before the cap at which the agent is told to -// stop investigating and write its answer (optional, -// default 4). Plus a tool-free finalization fallback -// guarantees a step-exhausted pass still emits output. +// GADFLY_WRAPUP_RESERVE steps before the cap at which the wrap-up critic nudges +// the agent to stop investigating and write its answer +// (optional, default 4). // GADFLY_RECHECK set to 0/false to skip the recheck pass (optional, default on). // GADFLY_RECHECK_MAX_STEPS recheck-pass step cap (optional, default 16). // GADFLY_TIMEOUT_SECS overall deadline in seconds, shared by both passes (optional, default 300). @@ -55,8 +54,8 @@ // lanes as GADFLY_PROVIDER_CONCURRENCY (e.g. // "ollama-cloud=3,m1=1"). Wins over GADFLY_LENS_CONCURRENCY // for the model's provider; falls back to it otherwise. -// GADFLY_MAX_DIFF_CHARS diff chars embedded in the prompt (optional, default 60000; -// the full diff is always available via the get_diff tool). +// GADFLY_MAX_DIFF_CHARS diff chars embedded in the review prompt (optional, default 60000; +// the full diff is reachable via the paginated get_diff tool). // // On success it prints the review to stdout and exits 0. On a usage/config or // model error it prints a diagnostic to stderr and exits non-zero; run.sh then @@ -322,8 +321,7 @@ func providerOverride(envName, provider string) (int, bool) { // returned bool is true when the review pass failed (rendered as an inline // notice — advisory; one lens failing never sinks the others or the job). func reviewWithSpecialist(eng reviewEngine, base string, sp Specialist, task, diff string) (string, bool) { - timeout := time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second - ctx, cancel := context.WithTimeout(context.Background(), timeout) + ctx, cancel := context.WithTimeout(context.Background(), reviewTimeout()) defer cancel() draft, err := eng.runPass(ctx, composeSpecialistPrompt(base, sp), task, @@ -389,3 +387,23 @@ func envInt(name string, def int) int { } return n } + +// envBool reads a boolean-ish env var: def when unset, false for an explicit +// falsey value (0/false/no/off), true otherwise. The shared spelling for +// gadfly's "on unless disabled" opt-out flags (GADFLY_RECHECK, GADFLY_COMPACT). +func envBool(name string, def bool) bool { + switch strings.ToLower(strings.TrimSpace(os.Getenv(name))) { + case "": + return def + case "0", "false", "no", "off": + return false + default: + return true + } +} + +// reviewTimeout is the per-specialist-lens deadline (GADFLY_TIMEOUT_SECS), shared +// across a lens's review+recheck passes and applied as each pass's run cap. +func reviewTimeout() time.Duration { + return time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second +} diff --git a/cmd/gadfly/recheck.go b/cmd/gadfly/recheck.go index e00bfa7..0e14539 100644 --- a/cmd/gadfly/recheck.go +++ b/cmd/gadfly/recheck.go @@ -2,7 +2,6 @@ package main import ( "fmt" - "os" "strings" ) @@ -65,14 +64,7 @@ Output rules: // recheckEnabled reports whether the verification pass should run. On unless // GADFLY_RECHECK is explicitly a falsey value. -func recheckEnabled() bool { - switch strings.ToLower(strings.TrimSpace(os.Getenv("GADFLY_RECHECK"))) { - case "0", "false", "no", "off": - return false - default: - return true - } -} +func recheckEnabled() bool { return envBool("GADFLY_RECHECK", true) } // shouldRecheck decides whether to run the verification pass for a given draft. // A clean "no material issues" draft has nothing to verify, so it is skipped diff --git a/cmd/gadfly/recheck_test.go b/cmd/gadfly/recheck_test.go index 75e8217..0ffe267 100644 --- a/cmd/gadfly/recheck_test.go +++ b/cmd/gadfly/recheck_test.go @@ -49,7 +49,7 @@ func TestRecheckEnabled(t *testing.T) { } func TestBuildRecheckTask(t *testing.T) { - t.Setenv("GADFLY_MAX_DIFF_CHARS", "") + t.Setenv("GADFLY_RECHECK_DIFF_CHARS", "") draft := "VERDICT: Blocking issues found\n- foo.go:1 broken" out := buildRecheckTask(draft, "diff --git a/x b/x\n+y\n") if !strings.Contains(out, draft) { diff --git a/cmd/gadfly/tools.go b/cmd/gadfly/tools.go index ece4183..3c1b21b 100644 --- a/cmd/gadfly/tools.go +++ b/cmd/gadfly/tools.go @@ -9,6 +9,7 @@ import ( "regexp" "sort" "strings" + "sync" llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm" ) @@ -42,6 +43,12 @@ type repoFS struct { root string // absolute, symlink-resolved repo root diff string // the full PR unified diff (served by get_diff) worker llm.Model // optional cheap model for delegate_investigation; nil = no delegation + + // diffLines caches the split diff so paging through get_diff doesn't re-split + // the whole (possibly large) diff on every call. The diff is immutable, so the + // cache is computed once and safe to share across concurrent lenses. + diffOnce sync.Once + diffLines []string } // newRepoFS resolves root to an absolute, symlink-free path. @@ -110,15 +117,24 @@ func (r *repoFS) fsTools() []llm.Tool { } } -// toolbox builds the reviewer's toolbox: the read-only repo tools, plus the -// delegate_investigation tool when a worker model is configured. -func (r *repoFS) toolbox() (*llm.Toolbox, error) { - box := llm.NewToolbox("gadfly") +// allTools is the single source of truth for the reviewer's tool set: the +// read-only repo tools, plus delegate_investigation when a worker model is +// configured. Both the executus registry (gadflyToolRegistry, the production +// path) and toolbox() build from this list. +func (r *repoFS) allTools() []llm.Tool { tools := r.fsTools() if r.worker != nil { tools = append(tools, r.delegateTool()) } - for _, t := range tools { + return tools +} + +// toolbox builds a majordomo toolbox from allTools(). The production review path +// now goes through executus's tool.Registry (see executus.go); this remains for +// the toolbox-level tests (the `call` helper). +func (r *repoFS) toolbox() (*llm.Toolbox, error) { + box := llm.NewToolbox("gadfly") + for _, t := range r.allTools() { if err := box.Add(t); err != nil { return nil, fmt.Errorf("add tool %q: %w", t.Name, err) } @@ -409,30 +425,49 @@ func (r *repoFS) getDiffTool() llm.Tool { "get_diff", "Return the unified diff under review as a numbered, PAGINATED window (like read_file) — not the whole diff at once, so a huge PR can't blow the context window. Pass `path` to fetch just one changed file's hunks, or `start_line`/`limit` to page through. A truncated copy of the diff is also embedded in the task message.", func(_ context.Context, args getDiffArgs) (any, error) { - diff := r.diff scope := "the diff" + var lines []string if p := strings.TrimSpace(args.Path); p != "" { - diff = diffForPath(r.diff, p) - if strings.TrimSpace(diff) == "" { + lines = diffLinesForPath(r.diff, p) + if len(lines) == 0 { return fmt.Sprintf("(no diff hunks for path %q; check it against the changed-files list — the path must match a file the diff touches)", p), nil } scope = "the diff for " + p + } else { + lines = r.diffAllLines() + if len(lines) == 0 { + return "(empty diff)", nil + } } - if strings.TrimSpace(diff) == "" { - return "(empty diff)", nil - } - return windowDiff(diff, scope, args.StartLine, args.Limit), nil + return windowDiff(lines, scope, args.StartLine, args.Limit), nil }, ) } -// windowDiff returns a numbered, paginated slice of a unified diff, mirroring -// read_file's caps (maxGetDiffLines / maxGetDiffBytes / maxLineLen) so a single -// get_diff call can never dump a multi-hundred-KB diff into the transcript — the -// amplifier behind the large-PR token burn. The full diff stays reachable by -// paging with start_line, or scoped per file via the path arg. -func windowDiff(diff, scope string, start, limit int) string { +// diffAllLines returns the whole diff split into lines, cached so paging never +// re-splits the (possibly large) diff. +func (r *repoFS) diffAllLines() []string { + r.diffOnce.Do(func() { r.diffLines = splitDiffLines(r.diff) }) + return r.diffLines +} + +// splitDiffLines splits a unified diff into lines, dropping the single trailing +// empty element a trailing newline produces — otherwise windowDiff would emit a +// blank final line and over-count the total by one. +func splitDiffLines(diff string) []string { lines := strings.Split(diff, "\n") + if n := len(lines); n > 0 && lines[n-1] == "" { + lines = lines[:n-1] + } + return lines +} + +// windowDiff returns a numbered, paginated slice of pre-split diff lines, +// mirroring read_file's caps (maxGetDiffLines / maxGetDiffBytes / maxLineLen) so +// a single get_diff call can never dump a multi-hundred-KB diff into the +// transcript — the amplifier behind the large-PR token burn. The full diff stays +// reachable by paging with start_line, or scoped per file via the path arg. +func windowDiff(lines []string, scope string, start, limit int) string { total := len(lines) if start < 1 { start = 1 @@ -458,26 +493,45 @@ func windowDiff(diff, scope string, start, limit int) string { emitted++ } if i < total { - fmt.Fprintf(&b, "... (%s truncated at line %d of %d; call get_diff again with start_line=%d for more, or pass a `path` to scope to one file)\n", scope, i, total, i+1) + // i (0-based) is the first line NOT emitted; line i was the last shown. + fmt.Fprintf(&b, "... (%s truncated after line %d of %d; call get_diff again with start_line=%d for the rest, or pass a `path` to scope to one file)\n", scope, i, total, i+1) } return b.String() } -// diffForPath extracts the unified-diff section for one changed file: from the -// `diff --git` header that names the path through to the next `diff --git` -// header (or end). Matching is a substring on the header line, so either side -// (a/ or b/) hits, including a rename's two paths. -func diffForPath(diff, path string) string { - var b strings.Builder +// diffLinesForPath returns the unified-diff lines for one changed file: from the +// `diff --git` header that names path through to the next header (or end). The +// header names two path tokens (a/ b/); a match is on a WHOLE token, +// so path "foo.go" does not pull in "barfoo.go" — and a trailing "/" scopes to a +// directory (e.g. "pkg/foo/" matches pkg/foo/bar.go). +func diffLinesForPath(diff, path string) []string { + want := strings.TrimPrefix(strings.TrimPrefix(strings.TrimSpace(path), "a/"), "b/") + var out []string inSection := false - for _, ln := range strings.Split(diff, "\n") { + for _, ln := range splitDiffLines(diff) { if strings.HasPrefix(ln, "diff --git ") { - inSection = strings.Contains(ln, path) + inSection = diffHeaderNames(ln, want) } if inSection { - b.WriteString(ln) - b.WriteByte('\n') + out = append(out, ln) } } - return b.String() + return out +} + +// diffHeaderNames reports whether a `diff --git a/X b/Y` header names want as one +// of its (a/b-stripped) path tokens — exact whole-token match, or a directory +// prefix when want ends with "/". +func diffHeaderNames(header, want string) bool { + fields := strings.Fields(header) + if len(fields) < 3 { + return false + } + for _, f := range fields[2:] { + p := strings.TrimPrefix(strings.TrimPrefix(f, "a/"), "b/") + if p == want || (strings.HasSuffix(want, "/") && strings.HasPrefix(p, want)) { + return true + } + } + return false } diff --git a/cmd/gadfly/tools_test.go b/cmd/gadfly/tools_test.go index 10dc644..4704162 100644 --- a/cmd/gadfly/tools_test.go +++ b/cmd/gadfly/tools_test.go @@ -247,7 +247,7 @@ func TestGetDiffTool_Paginates(t *testing.T) { if err != nil { t.Fatalf("get_diff: %v", err) } - if !strings.Contains(out, "truncated at line") { + if !strings.Contains(out, "truncated after line") { t.Error("a diff longer than the per-call cap should be truncated with a paging hint") } if strings.Contains(out, "801\t") { @@ -263,6 +263,30 @@ func TestGetDiffTool_Paginates(t *testing.T) { } } +// TestDiffLinesForPath_Anchored: get_diff path= matches on a WHOLE path token, +// so "foo.go" never pulls in "barfoo.go" (the unanchored-substring weakness the +// swarm flagged), while a trailing "/" still scopes to a directory. +func TestDiffLinesForPath_Anchored(t *testing.T) { + diff := "diff --git a/foo.go b/foo.go\n+foo change\n" + + "diff --git a/barfoo.go b/barfoo.go\n+barfoo change\n" + + "diff --git a/pkg/x.go b/pkg/x.go\n+x change\n" + + joined := strings.Join(diffLinesForPath(diff, "foo.go"), "\n") + if !strings.Contains(joined, "foo change") { + t.Errorf("foo.go should match its own hunk:\n%s", joined) + } + if strings.Contains(joined, "barfoo change") { + t.Errorf("foo.go must NOT match barfoo.go (unanchored substring regression):\n%s", joined) + } + + if !strings.Contains(strings.Join(diffLinesForPath(diff, "pkg/"), "\n"), "x change") { + t.Error("a trailing-slash path should scope to the directory (pkg/ -> pkg/x.go)") + } + if len(diffLinesForPath(diff, "nope.go")) != 0 { + t.Error("an unknown path should yield no lines") + } +} + func TestNewRepoFS_BadRoot(t *testing.T) { // A file (not a directory) is rejected. f := filepath.Join(t.TempDir(), "afile") diff --git a/entrypoint.sh b/entrypoint.sh index 35c2453..7927022 100644 --- a/entrypoint.sh +++ b/entrypoint.sh @@ -205,7 +205,9 @@ if [ "$HUGE_DIFF_BYTES" -gt 0 ] 2>/dev/null; then export GADFLY_MAX_STEPS="${GADFLY_HUGE_DIFF_MAX_STEPS:-12}" export GADFLY_RECHECK_MAX_STEPS="${GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS:-8}" export GADFLY_RECHECK="${GADFLY_HUGE_DIFF_RECHECK:-0}" # skip recheck on huge PRs - export MAX_DIFF_CHARS="${GADFLY_HUGE_DIFF_MAX_DIFF_CHARS:-20000}" # run.sh -> GADFLY_MAX_DIFF_CHARS + # The Go-visible name directly (run.sh prefers GADFLY_MAX_DIFF_CHARS over its + # own MAX_DIFF_CHARS), so the cap is honored without relying on run.sh's alias. + export GADFLY_MAX_DIFF_CHARS="${GADFLY_HUGE_DIFF_MAX_DIFF_CHARS:-20000}" # Surfaced on each posted comment so the shallower review is self-explaining. export GADFLY_NOTICE="⚠️ Large PR (${PR_DIFF_BYTES} bytes): Gadfly downshifted to a focused, single-model review to stay within budget — coverage is intentionally shallower. Consider splitting the PR for a deeper review." fi @@ -333,7 +335,7 @@ fi # findings were gathered are still posted and the job never fails (advisory). # GADFLY_PR_BUDGET_SECS=0 (default) disables it. KILLER_PID="" -rm -f "${WORKDIR}/.budget_killed" 2>/dev/null || true +rm -f "${WORKDIR}/.budget_killed" "${WORKDIR}/.disarmed" 2>/dev/null || true if [ "${GADFLY_PR_BUDGET_SECS:-0}" -gt 0 ] 2>/dev/null; then ( sleep "${GADFLY_PR_BUDGET_SECS}" @@ -342,7 +344,10 @@ if [ "${GADFLY_PR_BUDGET_SECS:-0}" -gt 0 ] 2>/dev/null; then pkill -TERM -f '/usr/local/bin/gadfly' 2>/dev/null || true pkill -TERM -f "${SCRIPTS_DIR}/run.sh" 2>/dev/null || true sleep 5 - pkill -KILL -f '/usr/local/bin/gadfly' 2>/dev/null || true + # Guard the delayed SIGKILL on the disarm marker: once the lanes finished and + # the watchdog was disarmed, the consolidation gadfly pass runs next, and a + # name-based KILL here must NOT catch it. + [ -f "${WORKDIR}/.disarmed" ] || pkill -KILL -f '/usr/local/bin/gadfly' 2>/dev/null || true ) & KILLER_PID=$! log "PR budget watchdog armed (${GADFLY_PR_BUDGET_SECS}s, pid ${KILLER_PID})" @@ -360,8 +365,13 @@ done [ "${#LANE_PIDS[@]}" -gt 0 ] && wait "${LANE_PIDS[@]}" # Reviews finished (or the watchdog killed them): disarm the watchdog so its -# delayed SIGKILL can't catch the consolidation pass that runs next. -if [ -n "$KILLER_PID" ]; then kill "$KILLER_PID" 2>/dev/null || true; fi +# delayed SIGKILL can't catch the consolidation pass that runs next. Drop the +# disarm marker FIRST so even a racing watchdog that already reached its KILL line +# skips it (the kill below also tears the watchdog subshell down during its sleep). +if [ -n "$KILLER_PID" ]; then + : > "${WORKDIR}/.disarmed" + kill "$KILLER_PID" 2>/dev/null || true +fi # If the backstop fired, note it on the consensus comment (per-model comments # were already posted during the run; a killed model surfaces as a failed lane). diff --git a/scripts/run.sh b/scripts/run.sh index 6e0a686..1731e6a 100644 --- a/scripts/run.sh +++ b/scripts/run.sh @@ -177,7 +177,7 @@ case "$PROVIDER" in GADFLY_SYSTEM_FILE="${SCRIPT_DIR}/system-prompt.txt" \ GADFLY_TITLE="$TITLE" \ GADFLY_BODY="$BODY" \ - GADFLY_MAX_DIFF_CHARS="$MAX_DIFF_CHARS" \ + GADFLY_MAX_DIFF_CHARS="${GADFLY_MAX_DIFF_CHARS:-$MAX_DIFF_CHARS}" \ GADFLY_STATUS_FILE="${GADFLY_STATUS_FILE:-}" \ GADFLY_FINDINGS_OUT="${GADFLY_FINDINGS_OUT:-}" \ "$BIN" 2>"$ERR_FILE" -- 2.52.0