feat: re-platform agentic review onto executus + large-PR cost controls #20

Merged
steve merged 2 commits from feat/executus-replatform into main 2026-06-30 15:41:03 +00:00
Owner

Why

A very large PR (mort #1367: ~1.01 MB / ~253K-token diff) burned a full 5-hour Ollama Cloud block in ~15 minutes. Root cause: the agent loop re-sends the whole transcript every step, and the diff entered the prompt three ways (review embed, the recheck embed, and get_diff dumping it untruncated), so it was re-transmitted across steps × lenses × passes × models (~30 cloud passes). Small PRs were never affected.

What

Makes gadfly a real consumer of executus (its canonical light-tier host) and fixes the burn in size-gated layers — small PRs are untouched.

executus re-platform

  • In-process review path now runs on executus/run.Executor: context compaction (executus/compact, threshold from the model's real context window via executus/model), run-bounding, a per-PR budget gate (Ports.Budget), and the wrap-up nudge re-expressed as a run.Critic.
  • Lens fan-out now uses executus/fanout.
  • gadfly keeps its own model.goGADFLY_ENDPOINT_<NAME> http aliases + the claude-code engine are unaffected. No majordomo bump; the binary stays static (executus core is majordomo+stdlib only).

Burn fix (three layers + the per-PR budget)

  1. Paginated get_diff (per-file path + start_line/limit) instead of dumping the whole diff; recheck diff embed trimmed 60k → 20k. Compaction at a deliberately low GADFLY_COMPACT_RATIO=0.45 (the embedded path rarely reaches 0.7×context).
  2. Downshift (entrypoint.sh) above GADFLY_HUGE_DIFF_BYTES → one cheap model, focused lenses, fewer steps, no recheck — with a self-explaining notice on the comment.
  3. Hard backstop GADFLY_PR_BUDGET_SECS — swarm-wide wall-clock kill (adds procps). All advisory; CI never fails.

Notes / trade-off

  • run.Result exposes no transcript, so the old transcript-based forced-finalization fallback is dropped; the wrap-up critic nudge is the remaining "always emit something" mechanism. (Could be restored via a small executus enhancement.)
  • README + CLAUDE.md + tests updated in the same change. gofmt/go vet/go test -race/static CGO_ENABLED=0 build all green (85 tests).

Advisory-only invariant re-confirmed: budget rejections and the backstop kill post advisory notices; the job never fails.

🤖 Generated with Claude Code

## Why A very large PR ([mort #1367](https://gitea.stevedudenhoeffer.com/steve/mort/pulls/1367): ~1.01 MB / ~253K-token diff) **burned a full 5-hour Ollama Cloud block in ~15 minutes**. Root cause: the agent loop re-sends the whole transcript every step, and the diff entered the prompt three ways (review embed, the recheck embed, and `get_diff` dumping it **untruncated**), so it was re-transmitted across `steps × lenses × passes × models` (~30 cloud passes). Small PRs were never affected. ## What Makes gadfly a real consumer of **[executus](https://gitea.stevedudenhoeffer.com/steve/executus)** (its canonical light-tier host) and fixes the burn in size-gated layers — **small PRs are untouched**. **executus re-platform** - In-process review path now runs on **`executus/run.Executor`**: context **compaction** (`executus/compact`, threshold from the model's real context window via `executus/model`), run-bounding, a per-PR **budget gate** (`Ports.Budget`), and the wrap-up nudge re-expressed as a **`run.Critic`**. - Lens fan-out now uses **`executus/fanout`**. - gadfly **keeps its own `model.go`** → `GADFLY_ENDPOINT_<NAME>` http aliases + the claude-code engine are unaffected. **No majordomo bump**; the binary stays **static** (executus core is majordomo+stdlib only). **Burn fix (three layers + the per-PR budget)** 1. Paginated `get_diff` (per-file `path` + `start_line`/`limit`) instead of dumping the whole diff; recheck diff embed trimmed 60k → 20k. Compaction at a deliberately low `GADFLY_COMPACT_RATIO=0.45` (the embedded path rarely reaches 0.7×context). 2. **Downshift** (`entrypoint.sh`) above `GADFLY_HUGE_DIFF_BYTES` → one cheap model, focused lenses, fewer steps, no recheck — with a self-explaining notice on the comment. 3. **Hard backstop** `GADFLY_PR_BUDGET_SECS` — swarm-wide wall-clock kill (adds `procps`). All advisory; **CI never fails**. ## Notes / trade-off - `run.Result` exposes no transcript, so the old transcript-based forced-finalization fallback is **dropped**; the wrap-up critic nudge is the remaining "always emit something" mechanism. (Could be restored via a small executus enhancement.) - README + CLAUDE.md + tests updated in the same change. `gofmt`/`go vet`/`go test -race`/static `CGO_ENABLED=0` build all green (85 tests). Advisory-only invariant re-confirmed: budget rejections and the backstop kill post advisory notices; the job never fails. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
steve added 1 commit 2026-06-30 14:44:31 +00:00
feat: re-platform agentic review onto executus + large-PR cost controls
Build & push image / build-and-push (pull_request) Successful in 11s
Adversarial Review (Gadfly) / review (pull_request) Successful in 25m56s
b860119332
Fixes the large-PR token burn: a ~250K-token diff was re-sent every agent
step across models × lenses × passes, draining a metered usage block in
minutes. Small PRs are untouched (every mitigation is size-gated / no-op
under threshold).

- Re-platform the in-process review path onto executus run.Executor: context
  compaction (executus/compact, threshold from the model's real context window
  via executus/model), run-bounding, a per-PR budget gate (Ports.Budget), and
  the wrap-up nudge re-expressed as a run.Critic. Lens fan-out now uses
  executus/fanout. gadfly keeps its own model.go, so GADFLY_ENDPOINT_<NAME>
  aliases and the claude-code engine are unaffected. No majordomo bump; the
  binary stays static (executus core is majordomo+stdlib only).
- Paginate get_diff (per-file `path` + start_line/limit) instead of dumping the
  whole diff; trim the recheck diff embed (60k -> 20k chars).
- entrypoint.sh: downshift the fleet above GADFLY_HUGE_DIFF_BYTES (one cheap
  model, fewer lenses/steps, no recheck) + a swarm-wide GADFLY_PR_BUDGET_SECS
  wall-clock backstop (adds procps for pkill). All advisory; CI never fails.
- README + CLAUDE.md + tests updated.

Note: run.Result exposes no transcript, so the old transcript-based forced-
finalization fallback is dropped; the wrap-up critic nudge is the remaining
"always emit something" mechanism.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

🪰 Gadfly — live review status

6/6 reviewers finished · updated 2026-06-30 15:10:27Z

claude-code/opus · claude-code — done

  • security — No material issues found
  • correctness — Minor issues
  • maintainability — Minor issues
  • performance — Minor issues
  • error-handling — No material issues found

claude-code/opus:max · claude-code — done

  • security — No material issues found
  • correctness — Minor issues
  • maintainability — Minor issues
  • performance — No material issues found
  • error-handling — Minor issues

deepseek-v4-pro:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — Minor issues
  • maintainability — Minor issues
  • performance — Minor issues
  • error-handling — Blocking issues found

glm-5.2:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — Minor issues
  • maintainability — Minor issues
  • performance — No material issues found
  • error-handling — Minor issues

kimi-k2.6:cloud · ollama-cloud — done

  • security — No material issues found
  • ⚠️ correctness — could not complete
  • maintainability — Minor issues
  • performance — No material issues found
  • error-handling — Minor issues

qwen3.5:397b-cloud · ollama-cloud — done

  • ⚠️ security — could not complete
  • ⚠️ correctness — could not complete
  • maintainability — No material issues found
  • performance — No material issues found
  • ⚠️ error-handling — could not complete

Live status board. Findings are posted in each model's own comment. Advisory only — does not block merge.

<!-- gadfly-status-board --> ## 🪰 Gadfly — live review status 6/6 reviewers finished · updated 2026-06-30 15:10:27Z #### `claude-code/opus` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — Minor issues - ✅ **error-handling** — No material issues found #### `claude-code/opus:max` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `deepseek-v4-pro:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — Minor issues - ✅ **error-handling** — Blocking issues found #### `glm-5.2:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `kimi-k2.6:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ⚠️ **correctness** — could not complete - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `qwen3.5:397b-cloud` · ollama-cloud — ✅ done - ⚠️ **security** — could not complete - ⚠️ **correctness** — could not complete - ✅ **maintainability** — No material issues found - ✅ **performance** — No material issues found - ⚠️ **error-handling** — could not complete <sub>Live status board. Findings are posted in each model's own comment. Advisory only — does not block merge.</sub>
gitea-actions bot reviewed 2026-06-30 15:10:28 +00:00
gitea-actions bot left a comment

🪰 Gadfly consensus review — 12 inline findings on changed lines. See the consensus comment for the full ranked summary.

Advisory only — does not block merge.

<!-- gadfly-inline-review --> 🪰 **Gadfly consensus review** — 12 inline findings on changed lines. See the consensus comment for the full ranked summary. <sub>Advisory only — does not block merge.</sub>
@@ -0,0 +16,4 @@
// steer that tells a step-hungry model to stop investigating and write its
// answer is now the critic seam, not a bespoke RunOption.
//
// Everything degrades to today's behavior when unconfigured: nil summarizer or a

file docstring claims 'degrades to today's behavior' but the old forced-finalization fallback was removed by this PR, so empty-output exhaustion no longer matches prior behavior

maintainability · flagged by 1 model

  • cmd/gadfly/executus.go:19-21 — docstring claims "degrades to today's behavior," but the old forced-finalization fallback was removed by this PR. The file-level comment says "Everything degrades to today's behavior when unconfigured," yet the prior transcript-based forceFinalAnswer/finalizeInstruction path was deleted (confirmed in the diff and the renamed TestReviewExecutor_ExhaustionWithoutAnswerIsError test comment). On empty-output exhaustion the new code returns an error where…

🪰 Gadfly · advisory

⚪ **file docstring claims 'degrades to today's behavior' but the old forced-finalization fallback was removed by this PR, so empty-output exhaustion no longer matches prior behavior** _maintainability · flagged by 1 model_ - **`cmd/gadfly/executus.go:19-21` — docstring claims "degrades to today's behavior," but the old forced-finalization fallback was removed by this PR.** The file-level comment says "Everything degrades to today's behavior when unconfigured," yet the prior transcript-based `forceFinalAnswer`/`finalizeInstruction` path was deleted (confirmed in the diff and the renamed `TestReviewExecutor_ExhaustionWithoutAnswerIsError` test comment). On empty-output exhaustion the new code returns an error where… <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +79,4 @@
// returns it along with the tool names for RunnableAgent.LowLevelTools.
func gadflyToolRegistry(fs *repoFS) (exectool.Registry, []string, error) {
reg := exectool.NewRegistry()
tools := fs.fsTools()

🟡 gadflyToolRegistry re-derives fsTools()+delegateTool, duplicating toolbox()'s membership logic (two sources of truth)

maintainability · flagged by 1 model

  • cmd/gadfly/executus.go:82-84 vs cmd/gadfly/tools.go:117-119 — duplicated tool-list construction. gadflyToolRegistry re-derives fs.fsTools() + if fs.worker != nil { append(fs.delegateTool()) }, byte-for-byte the membership logic of toolbox(). Verified by reading both functions. Adding a new built-in tool to fsTools() requires both paths to stay in sync by hand.

🪰 Gadfly · advisory

🟡 **gadflyToolRegistry re-derives fsTools()+delegateTool, duplicating toolbox()'s membership logic (two sources of truth)** _maintainability · flagged by 1 model_ - **`cmd/gadfly/executus.go:82-84` vs `cmd/gadfly/tools.go:117-119` — duplicated tool-list construction.** `gadflyToolRegistry` re-derives `fs.fsTools()` + `if fs.worker != nil { append(fs.delegateTool()) }`, byte-for-byte the membership logic of `toolbox()`. Verified by reading both functions. Adding a new built-in tool to `fsTools()` requires both paths to stay in sync by hand. <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +255,4 @@
ContextTokens: ctxTokens,
Defaults: exrun.Defaults{
MaxIterations: envInt("GADFLY_MAX_STEPS", defaultMaxSteps),
MaxRuntime: time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second,

🟡 GADFLY_TIMEOUT_SECS/MAX_STEPS configured in multiple places; Config.Defaults.MaxRuntime/MaxIterations duplicated verbatim with per-run RunnableAgent values

maintainability · flagged by 1 model

  • cmd/gadfly/executus.go:258 & :285 (with cmd/gadfly/main.go:325) — GADFLY_TIMEOUT_SECS / GADFLY_MAX_STEPS are configured in three/two places, and the Config.Defaults copies are redundant with the per-run values. The expression time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second is written verbatim twice in executus.go (once in Config.Defaults.MaxRuntime, once in the per-run RunnableAgent.MaxRuntime), and the same timeout is also applied as th…

🪰 Gadfly · advisory

🟡 **GADFLY_TIMEOUT_SECS/MAX_STEPS configured in multiple places; Config.Defaults.MaxRuntime/MaxIterations duplicated verbatim with per-run RunnableAgent values** _maintainability · flagged by 1 model_ - `cmd/gadfly/executus.go:258` & `:285` (with `cmd/gadfly/main.go:325`) — **`GADFLY_TIMEOUT_SECS` / `GADFLY_MAX_STEPS` are configured in three/two places, and the `Config.Defaults` copies are redundant with the per-run values.** The expression `time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second` is written verbatim twice in `executus.go` (once in `Config.Defaults.MaxRuntime`, once in the per-run `RunnableAgent.MaxRuntime`), and the same timeout is *also* applied as th… <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +280,4 @@
res := r.ex.Run(ctx, exrun.RunnableAgent{
Name: "gadfly-review",
SystemPrompt: system,
ModelTier: r.modelSpec,

🟡 Per-PR token budget only updated after each pass (addUsage post-run), so Budget.Check can't stop a single runaway pass mid-flight — token guard is pass-granular, weaker than its name implies

error-handling, maintainability · flagged by 3 models

  • cmd/gadfly/executus.go:283 (and :123–155, the gadflyBudget token gate) — the per-PR token budget can't bite mid-pass, only between passes. Budget.Check is the executor's pre-call gate, but it reads b.tokens, which is only updated by reviewExecutor.run calling r.budget.addUsage(res.Usage) after ex.Run returns (the run.Ports.Budget interface only carries Commit(seconds), with no per-call token feed — tokens are fed "out-of-band" once per finished pass). Consequence: t…

🪰 Gadfly · advisory

🟡 **Per-PR token budget only updated after each pass (addUsage post-run), so Budget.Check can't stop a single runaway pass mid-flight — token guard is pass-granular, weaker than its name implies** _error-handling, maintainability · flagged by 3 models_ - **`cmd/gadfly/executus.go:283` (and `:123–155`, the `gadflyBudget` token gate) — the per-PR *token* budget can't bite mid-pass, only between passes.** `Budget.Check` is the executor's pre-call gate, but it reads `b.tokens`, which is only updated by `reviewExecutor.run` calling `r.budget.addUsage(res.Usage)` *after* `ex.Run` returns (the `run.Ports.Budget` interface only carries `Commit(seconds)`, with no per-call token feed — tokens are fed "out-of-band" once per finished pass). Consequence: t… <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +316,4 @@
// compactionEnabled reports whether context compaction should be wired. On
// unless GADFLY_COMPACT is explicitly falsey.
func compactionEnabled() bool {

🟡 compactionEnabled() is a verbatim copy of recheckEnabled(); extract a shared env-bool helper

maintainability · flagged by 1 model

  • cmd/gadfly/executus.go:319-326compactionEnabled() is a verbatim copy of recheckEnabled(). It is byte-for-byte the same "on unless explicitly falsey" pattern as recheck.go:68-75 (switch strings.ToLower(strings.TrimSpace(os.Getenv(...))) over "0","false","no","off"default: true). This is exactly the copy-paste the lens targets: two identical env-bool parsers that will drift. Fix: extract one helper (e.g. envBoolDefaultTrue(name string) bool) and have both call it.

🪰 Gadfly · advisory

🟡 **compactionEnabled() is a verbatim copy of recheckEnabled(); extract a shared env-bool helper** _maintainability · flagged by 1 model_ - **`cmd/gadfly/executus.go:319-326` — `compactionEnabled()` is a verbatim copy of `recheckEnabled()`.** It is byte-for-byte the same "on unless explicitly falsey" pattern as `recheck.go:68-75` (`switch strings.ToLower(strings.TrimSpace(os.Getenv(...)))` over `"0","false","no","off"` → `default: true`). This is exactly the copy-paste the lens targets: two identical env-bool parsers that will drift. Fix: extract one helper (e.g. `envBoolDefaultTrue(name string) bool`) and have both call it. <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +354,4 @@
if key == "" {
key = strings.TrimSpace(os.Getenv("OLLAMA_API_KEY"))
}
cache := model.NewCloudOllamaLimitCache("", key, nil)

🟠 resolveContextTokens silently disables compaction on network errors with no log

error-handling · flagged by 2 models

🪰 Gadfly · advisory

🟠 **resolveContextTokens silently disables compaction on network errors with no log** _error-handling · flagged by 2 models_ <sub>🪰 Gadfly · advisory</sub>
@@ -411,0 +432,4 @@
// amplifier behind the large-PR token burn. The full diff stays reachable by
// paging with start_line, or scoped per file via the path arg.
func windowDiff(diff, scope string, start, limit int) string {
lines := strings.Split(diff, "\n")

🟡 windowDiff re-splits the whole diff per get_diff call; paging a large diff is O(n^2/limit) on the large-PR hot path

correctness, performance · flagged by 2 models

  • cmd/gadfly/tools.go:435 (windowDiff) re-splits the entire diff on every get_diff call, even to return one 800-line window. strings.Split(diff, "\n") materializes a []string over the whole diff regardless of start/limit. On exactly the large PRs this feature targets, the agent is steered to call get_diff repeatedly — paging via start_line and re-fetching per path. Each call is O(total-diff-lines) work plus a fresh line-slice allocation, so **paging a large diff to its…

🪰 Gadfly · advisory

🟡 **windowDiff re-splits the whole diff per get_diff call; paging a large diff is O(n^2/limit) on the large-PR hot path** _correctness, performance · flagged by 2 models_ - **`cmd/gadfly/tools.go:435` (`windowDiff`) re-splits the entire diff on every `get_diff` call, even to return one 800-line window.** `strings.Split(diff, "\n")` materializes a `[]string` over the *whole* diff regardless of `start`/`limit`. On exactly the large PRs this feature targets, the agent is steered to call `get_diff` repeatedly — paging via `start_line` and re-fetching per `path`. Each call is O(total-diff-lines) work plus a fresh line-slice allocation, so **paging a large diff to its… <sub>🪰 Gadfly · advisory</sub>
@@ -411,0 +458,4 @@
emitted++
}
if i < total {
fmt.Fprintf(&b, "... (%s truncated at line %d of %d; call get_diff again with start_line=%d for more, or pass a `path` to scope to one file)\n", scope, i, total, i+1)

🟡 windowDiff truncation message off-by-one: says 'truncated at line N' when N was the last emitted line

error-handling · flagged by 1 model

  • windowDiff truncation message off-by-one (cmd/gadfly/tools.go:461): The message says truncated at line %d using i (0-based index of the first unemitted line), but the last emitted line was i (1-based). For example, after emitting lines 1–800, i=800 (0-based), and the message reads "truncated at line 800 of 850" — but line 800 was the last line shown; truncation actually happened at line 801. The start_line hint correctly uses i+1 (801), so the paging instruction is righ…

🪰 Gadfly · advisory

🟡 **windowDiff truncation message off-by-one: says 'truncated at line N' when N was the last emitted line** _error-handling · flagged by 1 model_ - **`windowDiff` truncation message off-by-one** (`cmd/gadfly/tools.go:461`): The message says `truncated at line %d` using `i` (0-based index of the first *unemitted* line), but the last *emitted* line was `i` (1-based). For example, after emitting lines 1–800, `i=800` (0-based), and the message reads "truncated at line 800 of 850" — but line 800 was the last line shown; truncation actually happened at line 801. The `start_line` hint correctly uses `i+1` (801), so the paging instruction is righ… <sub>🪰 Gadfly · advisory</sub>
@@ -411,0 +467,4 @@
// `diff --git` header that names the path through to the next `diff --git`
// header (or end). Matching is a substring on the header line, so either side
// (a/<path> or b/<path>) hits, including a rename's two paths.
func diffForPath(diff, path string) string {

🟠 diffForPath re-splits the full diff on every scoped get_diff call, allocating O(diff_size) per invocation

correctness, error-handling, performance · flagged by 5 models

The finding is confirmed. The code at cmd/gadfly/tools.go:470-483 matches the draft's description exactly: inSection = strings.Contains(ln, path) is an unanchored substring match against the full diff --git a/… b/… header line, so a path that is a substring of another changed file's path (e.g. foo.go vs vendor/foo.go) pulls in both sections. This survives verification.

🪰 Gadfly · advisory

🟠 **diffForPath re-splits the full diff on every scoped get_diff call, allocating O(diff_size) per invocation** _correctness, error-handling, performance · flagged by 5 models_ The finding is confirmed. The code at `cmd/gadfly/tools.go:470-483` matches the draft's description exactly: `inSection = strings.Contains(ln, path)` is an unanchored substring match against the full `diff --git a/… b/…` header line, so a `path` that is a substring of another changed file's path (e.g. `foo.go` vs `vendor/foo.go`) pulls in both sections. This survives verification. <sub>🪰 Gadfly · advisory</sub>
entrypoint.sh Outdated
@@ -186,0 +205,4 @@
export GADFLY_MAX_STEPS="${GADFLY_HUGE_DIFF_MAX_STEPS:-12}"
export GADFLY_RECHECK_MAX_STEPS="${GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS:-8}"
export GADFLY_RECHECK="${GADFLY_HUGE_DIFF_RECHECK:-0}" # skip recheck on huge PRs
export MAX_DIFF_CHARS="${GADFLY_HUGE_DIFF_MAX_DIFF_CHARS:-20000}" # run.sh -> GADFLY_MAX_DIFF_CHARS

downshift exports MAX_DIFF_CHARS (not GADFLY_MAX_DIFF_CHARS); the aliasing lives only in run.sh and is invisible from the Go side

maintainability · flagged by 1 model

  • entrypoint.sh:208MAX_DIFF_CHARS env name is a shell-only alias invisible from the Go side. The downshift sets export MAX_DIFF_CHARS=... with a comment "run.sh -> GADFLY_MAX_DIFF_CHARS." Confirmed: scripts/run.sh:180 translates MAX_DIFF_CHARS to GADFLY_MAX_DIFF_CHARS when invoking the binary, and the Go tree reads only GADFLY_MAX_DIFF_CHARS (main.go:361) / GADFLY_RECHECK_DIFF_CHARS (recheck.go). MAX_DIFF_CHARS appears nowhere in Go. The indirection lives entirely in she…

🪰 Gadfly · advisory

⚪ **downshift exports MAX_DIFF_CHARS (not GADFLY_MAX_DIFF_CHARS); the aliasing lives only in run.sh and is invisible from the Go side** _maintainability · flagged by 1 model_ - **`entrypoint.sh:208` — `MAX_DIFF_CHARS` env name is a shell-only alias invisible from the Go side.** The downshift sets `export MAX_DIFF_CHARS=...` with a comment "run.sh -> GADFLY_MAX_DIFF_CHARS." Confirmed: `scripts/run.sh:180` translates `MAX_DIFF_CHARS` to `GADFLY_MAX_DIFF_CHARS` when invoking the binary, and the Go tree reads only `GADFLY_MAX_DIFF_CHARS` (main.go:361) / `GADFLY_RECHECK_DIFF_CHARS` (recheck.go). `MAX_DIFF_CHARS` appears nowhere in Go. The indirection lives entirely in she… <sub>🪰 Gadfly · advisory</sub>
@@ -300,0 +342,4 @@
pkill -TERM -f '/usr/local/bin/gadfly' 2>/dev/null || true
pkill -TERM -f "${SCRIPTS_DIR}/run.sh" 2>/dev/null || true
sleep 5
pkill -KILL -f '/usr/local/bin/gadfly' 2>/dev/null || true

🟠 Backstop KILL pkill can race the disarm and kill the consolidation gadfly pass

error-handling · flagged by 1 model

🪰 Gadfly · advisory

🟠 **Backstop KILL pkill can race the disarm and kill the consolidation gadfly pass** _error-handling · flagged by 1 model_ <sub>🪰 Gadfly · advisory</sub>
entrypoint.sh Outdated
@@ -310,1 +361,4 @@
# Reviews finished (or the watchdog killed them): disarm the watchdog so its
# delayed SIGKILL can't catch the consolidation pass that runs next.
if [ -n "$KILLER_PID" ]; then kill "$KILLER_PID" 2>/dev/null || true; fi

🟠 Budget watchdog kill may not prevent delayed SIGKILL from catching consolidation pass

error-handling · flagged by 1 model

  • Budget watchdog race: kill $KILLER_PID may not prevent the delayed SIGKILL (entrypoint.sh:364): The watchdog subshell runs sleep $BUDGET; pkill -TERM ...; sleep 5; pkill -KILL .... If the budget expires and the watchdog enters its sleep 5 between SIGTERM and SIGKILL, and the reviews then finish naturally during that 5-second window, the main script's kill "$KILLER_PID" sends SIGTERM to the subshell. However, the subshell is a bash process running sleep 5 — SIGTERM will interrup…

🪰 Gadfly · advisory

🟠 **Budget watchdog kill may not prevent delayed SIGKILL from catching consolidation pass** _error-handling · flagged by 1 model_ - **Budget watchdog race: `kill $KILLER_PID` may not prevent the delayed SIGKILL** (`entrypoint.sh:364`): The watchdog subshell runs `sleep $BUDGET; pkill -TERM ...; sleep 5; pkill -KILL ...`. If the budget expires and the watchdog enters its `sleep 5` between SIGTERM and SIGKILL, and the reviews then finish naturally during that 5-second window, the main script's `kill "$KILLER_PID"` sends SIGTERM to the subshell. However, the subshell is a bash process running `sleep 5` — SIGTERM will interrup… <sub>🪰 Gadfly · advisory</sub>

🪰 Gadfly review — consensus across 6 models

Verdict: Blocking issues found · 15 findings (6 with multi-model agreement)

Finding Where Models Lens
🟠 diffForPath re-splits the full diff on every scoped get_diff call, allocating O(diff_size) per invocation cmd/gadfly/tools.go:470 5/6 correctness, error-handling, performance
🟠 Package doc still advertises the tool-free finalization fallback this PR removed cmd/gadfly/main.go:44 3/6 maintainability
🟡 Per-PR token budget only updated after each pass (addUsage post-run), so Budget.Check can't stop a single runaway pass mid-flight — token guard is pass-granular, weaker than its name implies cmd/gadfly/executus.go:283 3/6 error-handling, maintainability
🟠 resolveContextTokens silently disables compaction on network errors with no log cmd/gadfly/executus.go:357 2/6 error-handling
🟡 Test sets obsolete GADFLY_MAX_DIFF_CHARS instead of GADFLY_RECHECK_DIFF_CHARS cmd/gadfly/recheck_test.go:52 2/6 maintainability
🟡 windowDiff re-splits the whole diff per get_diff call; paging a large diff is O(n^2/limit) on the large-PR hot path cmd/gadfly/tools.go:435 2/6 correctness, performance
9 single-model findings (lower confidence)
Finding Where Model Lens
🟠 Backstop KILL pkill can race the disarm and kill the consolidation gadfly pass entrypoint.sh:345 glm-5.2:cloud error-handling
🟠 Budget watchdog kill may not prevent delayed SIGKILL from catching consolidation pass entrypoint.sh:364 deepseek-v4-pro:cloud error-handling
🟡 gadflyToolRegistry re-derives fsTools()+delegateTool, duplicating toolbox()'s membership logic (two sources of truth) cmd/gadfly/executus.go:82 glm-5.2:cloud maintainability
🟡 GADFLY_TIMEOUT_SECS/MAX_STEPS configured in multiple places; Config.Defaults.MaxRuntime/MaxIterations duplicated verbatim with per-run RunnableAgent values cmd/gadfly/executus.go:258 claude-code/opus:max maintainability
🟡 compactionEnabled() is a verbatim copy of recheckEnabled(); extract a shared env-bool helper cmd/gadfly/executus.go:319 claude-code/opus maintainability
🟡 toolbox() is now dead in production (only tests call it); its tool-membership logic is duplicated in gadflyToolRegistry cmd/gadfly/tools.go:115 glm-5.2:cloud maintainability
🟡 windowDiff truncation message off-by-one: says 'truncated at line N' when N was the last emitted line cmd/gadfly/tools.go:461 deepseek-v4-pro:cloud error-handling
file docstring claims 'degrades to today's behavior' but the old forced-finalization fallback was removed by this PR, so empty-output exhaustion no longer matches prior behavior cmd/gadfly/executus.go:19 glm-5.2:cloud maintainability
downshift exports MAX_DIFF_CHARS (not GADFLY_MAX_DIFF_CHARS); the aliasing lives only in run.sh and is invisible from the Go side entrypoint.sh:208 glm-5.2:cloud maintainability
Per-model detail
claude-code/opus (claude-code) — Minor issues

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

I've established the key facts:

  1. The gadfly-side code is confirmed exactly as the draft describesresolveContextTokens (cmd/gadfly/executus.go:349-364) reads GADFLY_API_KEY/OLLAMA_API_KEY and passes it into model.NewCloudOllamaLimitCache("", key, nil) with no provider gate, on the default review path.

  2. The executus module source is genuinely inaccessible from this sandbox — it's in the module cache outside the working directory (the ls/find calls are blocked for security), there's no vendor/ directory, and no network. This is the same wall the original reviewer hit.

The finding's entire materiality hinges on a behavior the draft explicitly could not verify: whether executus/model.MaxContextTokensResolving actually performs a keyed /api/show network fetch for non-Ollama model specs. The draft author marked this as unverifiable ("If executus only fetches when the spec resolves to an ollama-cloud model, this is benign"), assigned it low confidence, and the verification gate cannot positively confirm the egress mechanism against the code.

The available evidence actually leans against the finding: the cache is named CloudOllamaLimitCache, the author's own contract comment scopes the fetch to "an Ollama Cloud model's limit via /api/show" and states it "returns 0 — disabling compaction — for an unknown model," consistent with a static-catalog-first resolver that doesn't blindly POST the key for arbitrary specs. Per the verification discipline — keep only what I can positively confirm — this finding must be dropped.


VERDICT: No material issues found

Security review (🔒) — re-platform onto executus + large-PR cost controls

I traced the attacker-controlled input — the PR diff under review — through the new code and found no injection/path/SSRF sink: diffForPath/windowDiff only substring-match and slice the in-memory diff string and return it to the model; get_diff's new path/start_line/limit args never touch the filesystem (cmd/gadfly/tools.go:411-464). The pkill -f backstop matches fixed, operator-controlled path patterns, not anything from the diff. Budget/critic code handles no untrusted data. The budget caps, caller-id derivation (prCallerID, numeric PR + owner/repo), and env parsing introduce no new trust-boundary issues, and no secrets are written to logs/error strings.

The one drafted concern (a possible egress of a non-Ollama provider's secret to api.ollama.com via resolveContextTokens) depends entirely on executus/model.MaxContextTokensResolving performing a keyed network fetch for non-Ollama model specs — behavior that cannot be confirmed against the code from this checkout and which the surrounding contract comments suggest does not occur. Unconfirmable, so dropped.

🎯 Correctness — Minor issues

Confirmed. Line 475 uses strings.Contains(ln, path), an unanchored substring match against the diff --git header. A path arg that is a substring of another changed file's header (e.g. api.gointernal/api.go, or any short path that appears in another file's a/b sides) pulls that file's hunks in too. Impact is as described — extra context only, never a missing/wrong hunk — and the _test infix spares the common tools.go/tools_test.go collision. The finding holds.

VERDICT: Minor issues

  • diffForPath substring matching over-includes files (cmd/gadfly/tools.go:475). The section selector is strings.Contains(headerLine, path). The arg is documented as a full path (e.g. pkg/foo/bar.go), but substring matching means a path that is a substring of another changed file's diff --git header pulls in that file's hunks too. Concrete case: a PR touching both api.go and internal/api.go, with get_diff(path="api.go"), returns both sections. Impact is mild — extra context, never a missing/wrong hunk — but it dilutes the per-file scoping that exists precisely to cut tokens on huge PRs (the PR's whole motivation). The common test-file collision (tools.go vs tools_test.go) is not hit thanks to the _test infix, so this is low-frequency. Suggested fix: anchor on the header tokens, e.g. match " a/"+path or " b/"+path (still covers renames), instead of a bare Contains. Severity small.
🧹 Code cleanliness & maintainability — Minor issues

All three findings are confirmed against the actual code:

  1. main.go:44-45 — the doc sentence about the "tool-free finalization fallback" survives, and forceFinalAnswer/finalizeInstruction are entirely gone from non-test code (grep returns nothing). The comment now contradicts the code.
  2. executus.go:319-326compactionEnabled() is byte-for-byte identical to recheckEnabled() (recheck.go:68-75) modulo the env var name.
  3. Timeout expressiontime.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second appears at main.go:325, executus.go:258, and executus.go:285 — exactly three identical copies.

All three survive verification. Outputting the corrected review.

VERDICT: Minor issues

Reviewing strictly through the code cleanliness & maintainability lens. The new executus.go is well-structured and thoroughly commented, and the runSpecialists fanout rewrite is a clean swap. Three maintainability nits, all verified by reading the code:

  • cmd/gadfly/main.go:44 — stale package doc describes removed behavior. The package-level doc still says GADFLY_WRAPUP_RESERVE is backed by "a tool-free finalization fallback [that] guarantees a step-exhausted pass still emits output." This PR deletes that fallback (forceFinalAnswer/finalizeInstruction are gone — a repo grep finds no non-test references left — and wrapup_test.go is updated to document that exhaustion now surfaces as an error). main.go:44 is the only non-test mention of the fallback left, and it now directly contradicts the new behavior. A reader trusting this comment will believe a guarantee that no longer holds. Fix: drop the "Plus a tool-free finalization fallback…" sentence (the wrap-up nudge is now the only "always emit something" mechanism).

  • cmd/gadfly/executus.go:319-326compactionEnabled() is a verbatim copy of recheckEnabled(). It is byte-for-byte the same "on unless explicitly falsey" pattern as recheck.go:68-75 (switch strings.ToLower(strings.TrimSpace(os.Getenv(...))) over "0","false","no","off"default: true). This is exactly the copy-paste the lens targets: two identical env-bool parsers that will drift. Fix: extract one helper (e.g. envBoolDefaultTrue(name string) bool) and have both call it.

  • cmd/gadfly/executus.go:258 & :285 (and main.go:325) — the per-lens timeout expression is written three times. time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second appears identically in Defaults.MaxRuntime, in reviewExecutor.run's RunnableAgent.MaxRuntime, and in reviewWithSpecialist. The two inside executus.go are also redundant with each other (the RunnableAgent value re-specifies what Defaults already sets for the same executor). A small lensTimeout() helper (mirroring the existing wrapUpReserve()/lensConcurrency() style) would remove the repetition and the chance of one copy being changed without the others. Trivial, but it's the kind of duplication that rots.

Nothing else in my lane: naming, function length, and the wrappedTool/gadflyBudget/wrapUpCritic abstractions read clearly and match the surrounding style, and the new tests are well-named for the renamed symbols.

Performance — Minor issues

I've verified both findings against the actual code in cmd/gadfly/tools.go.

Finding 1 (windowDiff, line 435): Confirmed. windowDiff calls strings.Split(diff, "\n") over the whole input at line 435, and the start/limit bounds are applied only afterward (the loop at 448–459 indexes into the fully-materialized lines slice). Paging to the end of a large diff re-scans from line 1 each call → O(n²/limit). The magnitude caveat is also accurate: Go's strings.Split substrings share the original backing bytes, so the per-call cost is the line-descriptor slice plus the O(n) scan, not a byte copy.

Finding 2 (diffForPath + windowDiff, lines 415/470): Confirmed. A path-scoped call runs diffForPath (full strings.Split at line 473) then windowDiff (a second split at line 435) — two splits per path-scoped call.

Both findings positively reproduce against the code and are honestly scoped as minor. Both survive.

VERDICT: Minor issues

Performance lens — findings

  • cmd/gadfly/tools.go:435 (windowDiff) re-splits the entire diff on every get_diff call, even to return one 800-line window. strings.Split(diff, "\n") materializes a []string over the whole diff regardless of start/limit. On exactly the large PRs this feature targets, the agent is steered to call get_diff repeatedly — paging via start_line and re-fetching per path. Each call is O(total-diff-lines) work plus a fresh line-slice allocation, so paging a large diff to its end is O(n²/limit): page k still scans/splits from line 1. The start/limit bounds are applied only after the full split, and the loop indexes lines[i] from a slice built over everything.

    • Magnitude is bounded: strings.Split substrings share the original string's backing bytes (no byte copy), so the per-call cost is the line-descriptor slice (~16 B × line-count) plus the O(n) scan — GC/CPU churn on the large-PR path, not a token blowup. A real efficiency-by-omission, not a blocker.
    • Suggested fix: precompute the line index once (split r.diff into a cached []string on the repoFS, guarded for the immutable diff) and have windowDiff slice the cached lines; paging then becomes O(window) per call.
  • cmd/gadfly/tools.go:415 + :470 — a path-scoped get_diff splits the diff twice. diffForPath does a full strings.Split + section rebuild (O(n)), then windowDiff splits that section again. Two full passes/allocations per path-scoped call. Minor on top of the above; the cached-lines approach would let diffForPath return line indices and avoid the second split.

🧯 Error handling & edge cases — No material issues found

I've established the key facts:

  1. The ordering inversion is real and confirmed from the diff: new executus.go:run checks res.Err != nil (returns error) before checking res.Output, whereas the old runAgent returned non-empty res.Output first and only surfaced runErr when output was empty.

  2. The impact is unverifiable. The finding is only a real regression if executus's run.Executor.Run can return a populated Output together with a non-nil Err. The executus module source lives in the Go module cache outside the working directory, and every attempt to read it (Read, grep, go doc, cat, even with the sandbox disabled) is blocked by the working-directory allowlist. I cannot positively confirm that executus ever returns Output+Err together.

The draft itself concedes this: it rates confidence "medium" and states the impact "hinges on whether executus ever returns Output+Err together, which I could not confirm." A verification gate keeps a finding only when it can be positively confirmed against the code. The ordering inversion alone is not a defect — it is harmful only under an executus semantic I cannot establish, and a reasonable reading (a run kernel that returns empty Output whenever Err is set, with the wrap-up critic ensuring output is emitted before the cap on the success path) makes it a non-issue. This is exactly the kind of speculative, unconfirmable impact the gate must drop.

All other items in the draft were explicitly not flagged by the draft itself.


VERDICT: No material issues found

The new executus re-platform code is well-guarded against the error/edge-case lens: gadflyBudget is nil-safe on every method (executus.go:123-155, covered by executus_test.go), compactRatio clamps to (0,1] and falls back on parse error, resolveContextTokens degrades to 0→compaction-off on an unknown model, a bad GADFLY_COMPACT_MODEL disables compaction rather than sinking the review (main.go:172-176), and the lens-panic recover survived the fanout rewrite (main.go:255-265).

The one drafted concern — that reviewExecutor.run checks res.Err before res.Output and could discard a completed review — was dropped: the ordering inversion is real, but its impact depends entirely on whether executus's run.Executor.Run returns a populated Output together with a non-nil Err, which cannot be confirmed against the available source (the executus module is outside the readable tree). Without that confirmation the finding is speculative, not a verified defect.

claude-code/opus:max (claude-code) — Minor issues

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

I've completed verification. Here is the outcome:

Confirmed in-repo (sub-claims): GADFLY_API_KEY is indeed the generic provider key (model.go:36–39), and resolveContextTokens (executus.go:349–364) does unconditionally build model.NewCloudOllamaLimitCache("", key, nil) with that generic key and an empty base URL, regardless of provider, when called with os.Getenv("GADFLY_MODEL") from newReviewExecutor.

Not confirmable (the finding's actual impact): Whether model.MaxContextTokensResolving actually issues an authenticated network request to ollama.com carrying that key for a non-Ollama, catalog-miss model lives entirely in the executus/model package. That source is outside the sandbox-allowed path (/tmp/gadfly/repo only, no vendor dir, module cache access blocked), so I cannot positively confirm any of the three conditions the leak requires: that the call fires for non-Ollama models, that an empty base URL resolves to the default cloud host rather than a no-op, or that the key is attached. The draft itself rated this low-confidence and explicitly unverified.

Per the gate rule — keep only what can be positively confirmed — this finding must be dropped: its entire impact rests on unverifiable cross-module behavior.


VERDICT: No material issues found

This PR is largely a refactor (executus re-platform) plus cost-control plumbing. Through the Security lens, the read-only tool surface stays read-only and no new injection/authz sink is introduced.

Checked and clear

  • entrypoint.sh pkill backstop: fixed, non-attacker-controlled patterns inside the CI container; GADFLY_PR_BUDGET_SECS numerically guarded before use. No injection.
  • cmd/gadfly/tools.go get_diff / diffForPath / windowDiff: the new path/start_line/limit args operate purely on the in-memory diff string (substring match + line slicing) — no filesystem access, no path traversal, no new injection sink.
  • gadflyBudget error messages / prCallerID (executus.go): expose only token/second counts and a repo#PR audit key — no secrets, not used in any command.
  • resolveContextTokens (executus.go:349): builds an Ollama limit cache from the generic GADFLY_API_KEY; the claimed credential-exposure path depends on executus/model network behavior that cannot be confirmed from this checkout and is gated behind multiple unverified conditions — not a confirmable finding.
🎯 Correctness — Minor issues

The finding is confirmed. The code at cmd/gadfly/tools.go:470-483 matches the draft's description exactly: inSection = strings.Contains(ln, path) is an unanchored substring match against the full diff --git a/… b/… header line, so a path that is a substring of another changed file's path (e.g. foo.go vs vendor/foo.go) pulls in both sections. This survives verification.


VERDICT: Minor issues

  • cmd/gadfly/tools.go:470 (diffForPath) — substring path match over-includes unrelated files. The per-file scoper toggles inSection = strings.Contains(ln, path) against the whole diff --git a/… b/… header line. Because it's an unanchored substring over the entire line, a path that is a substring of another changed file's path pulls in both files' hunks. Concretely, with changed files foo.go and vendor/foo.go, calling get_diff(path="foo.go") returns the vendor/foo.go section too (its header diff --git a/vendor/foo.go b/vendor/foo.go contains "foo.go"). This partially defeats the per-file scoping precisely on the large PRs the feature targets (more files → more collisions → more diff re-injected than intended — the exact token-burn this PR fights). The existing test only covers a non-colliding y.go case so it doesn't exercise this. Severity is small in practice since the model usually passes full paths from the changed-files list. Suggested fix: match on a path-boundary token, e.g. require a/<path> or b/<path> rather than a bare strings.Contains.
🧹 Code cleanliness & maintainability — Minor issues

The executus run package source isn't reachable from this sandbox (bash is restricted to the repo, and the module isn't vendored). The draft itself flagged this same limitation — it couldn't confirm executus's Defaults vs per-RunnableAgent precedence.

However, the finding does not depend on that precedence: the verbatim duplication of the timeout expression across executus.go:258, executus.go:285, and main.go:325, plus Defaults.MaxIterations coexisting with an always-set per-run MaxIterations, is fully confirmed in the repo. The redundancy (multiple sources of truth for one knob) holds whichever layer wins precedence. The finding survives. I'll drop its "Verification" sub-bullet (process meta-commentary) per output rules.


VERDICT: Minor issues

🧹 Code cleanliness & maintainability

  • cmd/gadfly/executus.go:258 & :285 (with cmd/gadfly/main.go:325) — GADFLY_TIMEOUT_SECS / GADFLY_MAX_STEPS are configured in three/two places, and the Config.Defaults copies are redundant with the per-run values. The expression time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second is written verbatim twice in executus.go (once in Config.Defaults.MaxRuntime, once in the per-run RunnableAgent.MaxRuntime), and the same timeout is also applied as the context.WithTimeout deadline in reviewWithSpecialist (main.go:325). Likewise Defaults.MaxIterations is set from GADFLY_MAX_STEPS, but every run() passes an explicit MaxIterations: maxSteps on the RunnableAgent (always non-zero, via GADFLY_MAX_STEPS/GADFLY_RECHECK_MAX_STEPS defaults). The other Defaults fields — MaxConsecutiveToolErrors, MaxSameToolCallRepeats, CompactionThresholdRatio, FallbackTierare needed, so Defaults can't simply be dropped wholesale.
    • Impact: two/three sources of truth for one knob. Whichever layer executus gives precedence, one set of these assignments is dead: a maintainer who later retunes the timeout/step cap via Config.Defaults (the natural-looking place) may see no effect, or change one site and not the others. Pure maintainability hazard, not a runtime bug today.
    • Suggested fix: pick one layer — drop MaxIterations/MaxRuntime from Config.Defaults and let the per-RunnableAgent values be the single source, or set them only in Defaults and leave the per-run fields zero. If the per-run MaxRuntime is kept, hoist envInt("GADFLY_TIMEOUT_SECS", …) * time.Second into a small helper so the expression isn't copy-pasted across the three sites.
Performance — No material issues found

VERDICT: No material issues found

Reviewed strictly through the Performance lens. This PR is, on the whole, a performance/cost improvement — its entire purpose is to kill the large-PR token-burn amplifier. I verified the hot paths against the checked-out code and found no material regression.

What I checked:

  • get_diff no longer dumps the whole diff (tools.go:434-483). The old tool returned r.diff verbatim, which on a ~1MB diff was re-transmitted every step × lens × pass. windowDiff now caps each call to maxGetDiffLines/maxGetDiffBytes (800 lines / 64KB). This is a strict reduction in per-step transmitted bytes — the core win. Confirmed.
  • Embedded diff is bounded once per pass (main.go:361-377, recheck.go): review embed stays at 60K, recheck embed dropped 60K→20K. Built once per task string, not per step. The per-step re-send of that bounded string is inherent to the agent loop and unchanged; compaction (executus.go) now folds the runaway transcript middle, which is a net reduction. No new amplification.
  • resolveContextTokens (executus.go:349-363) does a network /api/show call, but it's invoked once at executor-build time (per process, via newReviewExecutor called once in run()), not per pass/step. 15s timeout bounds it. Not on a hot path.
  • gadflyBudget lock/addUsage/Check are O(1) mutex ops; Check gating before model calls is negligible.
  • fanout migration (main.go runSpecialists) preserves MaxConcurrent: lensConcurrency() (default 1 → sequential); no oversubscription change.

Non-blocking observation (a micro-optimization, not flagged as a finding): windowDiff and diffForPath each do strings.Split(r.diff, "\n") on the entire diff per get_diff call — and a path-scoped call splits the full diff twice (once in diffForPath, again in windowDiff on the extracted section). On a huge PR, a model paging through it re-allocates the full line slice on every call. The magnitude (a few ms / a transient slice of N strings) is trivial next to model inference and network latency, and it still strictly beats the old "send the whole diff every step" behavior, so I'm not flagging it as a regression. Worth a comment only if get_diff ever moves into a tight programmatic loop.

🧯 Error handling & edge cases — Minor issues

I've verified the single finding against the actual code. The claim is confirmed:

  • gadflyBudget.Check (executus.go:123–136) reads b.tokens.
  • b.tokens is only mutated by addUsage (executus.go:147–155).
  • reviewExecutor.run (executus.go:279–304) calls r.ex.Run(...) and then r.budget.addUsage(res.Usage) afterward (line 295).

So token usage is fed to the budget only once per completed pass, and Check sees a static token count within any single run — the token gate cannot stop a runaway first pass mid-flight, only subsequent passes. The finding is accurate, and the mitigations it cites (per-pass Commit seconds, per-lens MaxRuntime, entrypoint GADFLY_PR_BUDGET_SECS) all exist in the code. The finding survives.

VERDICT: Minor issues

The new code is, on the whole, defensively written — nil-safe budget methods, guarded start/limit clamping in windowDiff, friendly (non-error) returns for empty/unknown get_diff paths, panic recovery inside the fanout closure, and graceful "reviewer failed/did not run" notices instead of crashes. One genuine edge-case gap is worth noting.

Findings

  • cmd/gadfly/executus.go:283 (and :123–155, the gadflyBudget token gate) — the per-PR token budget can't bite mid-pass, only between passes. Budget.Check is the executor's pre-call gate, but it reads b.tokens, which is only updated by reviewExecutor.run calling r.budget.addUsage(res.Usage) after ex.Run returns (the run.Ports.Budget interface only carries Commit(seconds), with no per-call token feed — tokens are fed "out-of-band" once per finished pass). Consequence: the very first lens/pass on a pathological PR runs to completion regardless of GADFLY_PR_TOKEN_BUDGET; the cap only stops subsequent lenses. Since the whole PR is about stopping one PR from draining a usage block, the in-binary token guard is weaker than its name suggests.
    • Mitigation that lowers severity: within a single pass, tokens are still bounded by GADFLY_MAX_STEPS × the (compacted) context window, and the wall-clock layers (GADFLY_PR_TIME_BUDGET_SECS per pass via Commit, the per-lens GADFLY_TIMEOUT_SECS MaxRuntime, and entrypoint's GADFLY_PR_BUDGET_SECS) are the real swarm backstop. So this is a real-but-mitigated granularity gap, not a hole.
    • Suggested fix / or just document: either note in the comment + README that GADFLY_PR_TOKEN_BUDGET is enforced at pass granularity (a single runaway pass can overshoot), or, if executus exposes an incremental usage hook, feed addUsage per step rather than per pass.
deepseek-v4-pro:cloud (ollama-cloud) — Blocking issues found

Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

Verdict: No material issues found

I traced the full tainted-input flow through the new and modified code — from environment variables and API responses through to where they're consumed (URL construction, shell commands, markdown output, filesystem paths, and network calls). Here's what I checked and why nothing rises to a material security finding:

  • entrypoint.sh PR in API URLs (line 197, new; also pre-existing at lines 78, 89, 130, 146, 391): PR is interpolated into Gitea API URL paths without validation. If it contained ../ sequences, it could traverse the API surface. However, PR originates from the CI event context (${PR:?PR required}), which is a pull-request number set by Gitea Actions — not attacker-controlled input. This is the same pattern used throughout the pre-existing script. Not a new risk.

  • entrypoint.sh pkill -f backstop (lines 342–345, new): The hardcoded patterns /usr/local/bin/gadfly and ${SCRIPTS_DIR}/run.sh (where SCRIPTS_DIR=/app/scripts, hardcoded on line 58) are used with pkill -f. No shell injection is possible because neither pattern incorporates any external input. The -f flag matches full command lines; in a single-purpose container this is safe.

  • entrypoint.sh GADFLY_NOTICE in PR comments (lines 210, 369, 396; run.sh lines 236–237): The notice text is constructed entirely by the script itself. The only interpolated values are PR_DIFF_BYTES (from wc -c | tr -d '[:space:]', always numeric) and GADFLY_PR_BUDGET_SECS (validated as integer on line 337). No user or API content enters the notice. Markdown injection is not possible here.

  • entrypoint.sh GITEA_TOKEN in clone URL (line 153, pre-existing): The token is embedded in CLONE_URL. This is a pre-existing pattern; stderr is redirected to /dev/null on the git clone call (line 157), so the URL isn't echoed. Not new.

  • tools.go diffForPath substring matching (lines 470–483, new): The path argument from the LLM is used only in strings.Contains against diff header lines — no filesystem access, no regex interpretation. A malicious/hallucinated model could pass a short path like a to match many files, but that's a correctness concern, not a security one.

  • executus.go resolveContextTokens (lines 349–363, new): Makes a one-time network call to Ollama Cloud's /api/show using the API key from the environment. The modelSpec comes from GADFLY_MODEL (CI config), not from untrusted input. The call has a 15-second timeout. No SSRF vector from attacker-controlled input.

  • executus.go prCallerID (lines 308–315, new): Constructs a caller ID from GADFLY_REPO and GADFLY_PR, both set by entrypoint.sh from trusted sources (REPO_PATH extracted from GITEA_API, and PR from the CI event). Used only as an audit/budget key, not in any security decision.

  • All envInt/strconv.ParseFloat parsing of environment variables: Every new env var (GADFLY_PR_TOKEN_BUDGET, GADFLY_PR_TIME_BUDGET_SECS, GADFLY_COMPACT_RATIO, GADFLY_MODEL_CONTEXT_TOKENS, GADFLY_COMPACT_KEEP_RECENT, GADFLY_COMPACT_SUMMARY_WORDS, GADFLY_HUGE_DIFF_*, GADFLY_PR_BUDGET_SECS) is parsed with proper type conversion and bounds-checking. Invalid values fall back to safe defaults.

🎯 Correctness — Minor issues

Verdict: Minor issues

  • diffForPath substring false-positive matching (cmd/gadfly/tools.go:475): The function uses strings.Contains(ln, path) against the full diff --git a/<path> b/<path> header line. This produces false positives when one file's path is a substring of another's — e.g., path="foo" matches both foo.go and foobar.go, and path="a" matches every diff header (all contain a/). The recheck model uses this to pull specific file hunks for verification; a false positive feeds it the wrong file's diff, potentially causing it to confirm a finding against incorrect code. The fix is to anchor the match, e.g., check for "a/" + path or "b/" + path as distinct substrings: strings.Contains(ln, "a/"+path) || strings.Contains(ln, "b/"+path). Verified by reading tools.go:470-483 and tracing the substring logic against the known diff --git a/<path> b/<path> header format.

  • windowDiff trailing-empty-line artifact (cmd/gadfly/tools.go:435,478-479): diffForPath appends "\n" after every line including the last, so the scoped diff always ends with a trailing newline. windowDiff then calls strings.Split(diff, "\n"), which produces a spurious empty final element. This empty string is emitted as a numbered blank line (e.g., "42\t\n") and inflates the total line count by 1. The paging hint then reports an off-by-one total. The full-diff path (r.diff without diffForPath) may or may not end with a newline, making the behavior inconsistent. Impact is cosmetic (a blank line in the tool output and a slightly wrong line count in the truncation notice) but it is a logic error. Fix: strings.TrimSuffix(diff, "\n") before splitting, or skip empty final elements. Verified by reading diffForPath at tools.go:478-479 (always appends \n) and windowDiff at tools.go:435 (splits on \n).

🧹 Code cleanliness & maintainability — Minor issues

Both findings confirmed. Let me output the corrected review.

Verdict: Minor issues

  • Stale doc comment references removed fallback (cmd/gadfly/main.go:44-45): The package-level doc comment still says "Plus a tool-free finalization fallback guarantees a step-exhausted pass still emits output." But forceFinalAnswer, finalizeInstruction, and the entire fallback mechanism were removed in this PR (the executus run.Result doesn't expose the transcript). The comment is now misleading — a future reader would expect a fallback that no longer exists. Verified by reading main.go:42-45 and confirming forceFinalAnswer/finalizeInstruction are absent from the file.

  • Test targets wrong env var (cmd/gadfly/recheck_test.go:52): TestBuildRecheckTask calls t.Setenv("GADFLY_MAX_DIFF_CHARS", "") to clear any override, but buildRecheckTask was changed to read GADFLY_RECHECK_DIFF_CHARS instead (see recheck.go:94). The test passes only because the fixture diff is tiny and fits within the 20,000-char default. The Setenv line is dead — it clears a variable the function no longer consults. Verified by reading both recheck.go:94 and recheck_test.go:51-63.

Performance — Minor issues

The finding is confirmed. diffForPath at line 473 does strings.Split(diff, "\n") on the full diff every time get_diff is called with a path arg (line 415), and newRepoFS (line 48-66) stores the diff as a raw string with no pre-index. No caching or indexing exists anywhere else in the codebase. The finding survives.


Verdict: Minor issues

  • diffForPath re-splits the full diff on every scoped get_diff call (cmd/gadfly/tools.go:470-483). When the model calls get_diff with a path arg (which the tool description now encourages for large PRs), diffForPath does strings.Split(diff, "\n") on the entire diff — for a ~1MB / ~250K-line PR, that allocates a ~250K-element string-header slice (~4 MB) just to extract one file's section. If the model investigates N files across its steps, this allocation repeats N times (O(N × diff_size)). A one-time pre-index at newRepoFS time (e.g., a map[string]string of path→hunk, or a line-offset index) would make each scoped lookup O(1) instead of O(diff_size). The pagination itself is the primary burn fix and this is gated by the model's step budget, so impact is moderate — but it's the one place the large diff is still fully materialized on every tool call.
🧯 Error handling & edge cases — Blocking issues found

Blocking issues found

  • diffForPath substring match returns wrong file's diff (cmd/gadfly/tools.go:475): strings.Contains(ln, path) matches the path as a substring anywhere in the diff --git header line. If a PR changes both foo.go and foo_test.go, asking for path: "foo.go" will also match the foo_test.go header (since "foo.go" is a substring of "a/foo_test.go"), returning both files' hunks instead of just the requested one. This defeats the per-file scoping that's the whole point of the paginated get_diff for large PRs — the model gets extra (potentially large) diff content it didn't ask for, wasting tokens. The fix is to match the path as a whole path component, e.g. by checking that the path appears bounded by / or whitespace/quote in the header line, or by splitting the header on spaces and checking each path segment. Verified by reading tools.go:470-482 and confirming the test at tools_test.go:200 uses non-colliding names (x and y.go), so this edge case is untested.

  • resolveContextTokens silently disables compaction on network errors (cmd/gadfly/executus.go:357-360): When the model.MaxContextTokensResolving call fails (network timeout, API error, DNS failure), the function returns 0, which silently disables compaction. The caller at line 232 treats window > 0 as the gate — a 0 means "no compaction." But a network error is not the same as "model unknown"; the model may be perfectly valid and the network just temporarily unavailable. The result is that compaction is silently disabled for the entire process lifetime with no log message, and the large-PR burn protection is gone. The function should at least log the failure to stderr (as other non-fatal degradations in this codebase do, e.g. main.go:176). Verified by reading executus.go:349-363 and confirming no error logging path exists.

  • windowDiff truncation message off-by-one (cmd/gadfly/tools.go:461): The message says truncated at line %d using i (0-based index of the first unemitted line), but the last emitted line was i (1-based). For example, after emitting lines 1–800, i=800 (0-based), and the message reads "truncated at line 800 of 850" — but line 800 was the last line shown; truncation actually happened at line 801. The start_line hint correctly uses i+1 (801), so the paging instruction is right, but the "truncated at" number is misleading. Should be i+1 for consistency. Verified by reading tools.go:434-463.

  • Budget watchdog race: kill $KILLER_PID may not prevent the delayed SIGKILL (entrypoint.sh:364): The watchdog subshell runs sleep $BUDGET; pkill -TERM ...; sleep 5; pkill -KILL .... If the budget expires and the watchdog enters its sleep 5 between SIGTERM and SIGKILL, and the reviews then finish naturally during that 5-second window, the main script's kill "$KILLER_PID" sends SIGTERM to the subshell. However, the subshell is a bash process running sleep 5 — SIGTERM will interrupt sleep but the subshell will then continue to the next command (pkill -KILL), which would fire against the consolidation pass that's about to start. The comment on line 362–363 says the kill is meant to "disarm the watchdog so its delayed SIGKILL can't catch the consolidation pass," but the disarm is not atomic with respect to the sleep 5; pkill -KILL sequence. A more robust approach would be to use a sentinel file that the watchdog checks before firing SIGKILL, or to kill -KILL the watchdog subshell itself rather than SIGTERM. Verified by reading entrypoint.sh:337-364.

glm-5.2:cloud (ollama-cloud) — Minor issues

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

VERDICT: No material issues found

Through the 🔒 Security lens, I verified the taint paths in this PR and found nothing material:

  • get_diff path arg (tools.go:407-426) — the model-supplied path is not run through repoFS.resolve()/contains(), but it doesn't need to be: it never touches the filesystem. diffForPath only filters the already-loaded r.diff string (in memory) and the result is then bounded by windowDiff (maxGetDiffLines/maxGetDiffBytes). No sandbox-escape or path-traversal vector is introduced; a hostile PR can't read outside the checkout via this tool. The substring-matching in diffForPath (tools.go:470-482) can over-scope (e.g. path:"go" matches every *.go header), but that only returns diff content the model could already page through — bounded by the same caps — so it's a correctness nit, not a security leak. Verified by reading tools.go:401-483.
  • Sandboxed read tools unchangedread_file/grep/find_files still go through repoFS.resolve()contains() (tools.go:68-100), and the worker sub-agent (delegate.go:42) uses workerToolbox() which builds on the same sandboxed tools. The PR doesn't weaken containment. Verified via grep/read_file.
  • API key handling (executus.go:349-364)resolveContextTokens forwards GADFLY_API_KEY/OLLAMA_API_KEY to model.NewCloudOllamaLimitCache. The key is sent to introspect the model's context window; the target is the same provider that already receives the model traffic, and it's never logged or returned in review output. This is expected behavior, not a credential leak. (I could not inspect the executus-side default base URL for the "" first arg since the dependency source isn't in this repo, but that's executus's contract to honor, not a gadfly-introduced exposure — low confidence only.)
  • Budget/backstop (entrypoint.sh:335-349, executus.go:114-155) — the wall-clock killer uses hardcoded pkill -f patterns (/usr/local/bin/gadfly, /app/scripts/run.sh), no env interpolation into the process pattern, so no command-injection surface. The Go-side gadflyBudget is purely advisory internal accounting with no authz decisions. Verified by reading both.
  • Secret leakage into review output — token/second counts appear only in internal budget error strings (executus.go:130-133), which surface as run errors, not as posted review markdown; the caller (reviewWithSpecialist) renders a generic "failed to complete" notice. No usage figures reach the PR comment. Verified via executus.go:279-304 and the existing reviewWithSpecialist path.

No injection, SSRF, authz gap, secret leak, or untrusted-data mishandling issue introduced by this change in my lane.

🎯 Correctness — Minor issues

The code at line 475 confirms inSection = strings.Contains(ln, path) — an unanchored substring match on the diff --git header. The tool description (line 402) promises "returns ONLY that file's diff hunks," but a short/ambiguous path substring (e.g. foo) would match every file whose path contains foo (e.g. pkg/foobar.go). The existing test only exercises a full filename (y.go) that happens to match cleanly, so it doesn't catch the over-broadness. The finding survives.

The other two draft notes are self-described as "not a finding" / "no bug," so they don't belong in the findings list.

VERDICT: Minor issues

  • cmd/gadfly/tools.go:475 (diffForPath unanchored substring match breaks the path-scoping contract). diffForPath matches the path argument as a substring of the diff --git header line via strings.Contains(ln, path), with no path-boundary anchoring. The get_diff tool's path arg description promises "returns ONLY that file's diff hunks," but a short/ambiguous path (e.g. foo) returns hunks for every file whose path merely contains that substring (e.g. diff --git a/pkg/foobar.go b/pkg/foobar.go). The existing test only exercises a full filename (y.go) that happens to match exactly, so it doesn't surface the over-broadness. Fix: split the header and compare against the exact a/<path>/b/<path> tokens, or at minimum anchor with delimiters.
🧹 Code cleanliness & maintainability — Minor issues

VERDICT: Minor issues

  • cmd/gadfly/tools.go:113-127toolbox() is now dead in production; its tool-membership logic is duplicated in gadflyToolRegistry. The deleted runAgent was the only production caller of (*repoFS).toolbox(). Verified by grep: only tools_test.go:38 and tools_test.go:284 remain. The production path now goes through gadflyToolRegistry (executus.go:80-94), which re-implements the same fsTools() + conditional delegateTool() assembly. Two sources of truth for "which tools the reviewer exposes."

  • cmd/gadfly/executus.go:82-84 vs cmd/gadfly/tools.go:117-119 — duplicated tool-list construction. gadflyToolRegistry re-derives fs.fsTools() + if fs.worker != nil { append(fs.delegateTool()) }, byte-for-byte the membership logic of toolbox(). Verified by reading both functions. Adding a new built-in tool to fsTools() requires both paths to stay in sync by hand.

  • cmd/gadfly/executus.go:285GADFLY_TIMEOUT_SECS re-read per pass, duplicating the Defaults.MaxRuntime set at build time. newReviewExecutor sets cfg.Defaults.MaxRuntime from envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs) (line 258), and run() re-reads the same env with the same default to build MaxRuntime for each RunnableAgent (line 285). Verified: both sites resolve the same knob. Duplicated env resolution for one logical value; a drift hazard if either is ever cached or changed independently.

  • cmd/gadfly/executus.go:19-21 — docstring claims "degrades to today's behavior," but the old forced-finalization fallback was removed by this PR. The file-level comment says "Everything degrades to today's behavior when unconfigured," yet the prior transcript-based forceFinalAnswer/finalizeInstruction path was deleted (confirmed in the diff and the renamed TestReviewExecutor_ExhaustionWithoutAnswerIsError test comment). On empty-output exhaustion the new code returns an error where the old code forced a final answer, so it does not fully degrade to pre-PR behavior. Wording overpromises; trivial but misleading.

  • entrypoint.sh:208MAX_DIFF_CHARS env name is a shell-only alias invisible from the Go side. The downshift sets export MAX_DIFF_CHARS=... with a comment "run.sh -> GADFLY_MAX_DIFF_CHARS." Confirmed: scripts/run.sh:180 translates MAX_DIFF_CHARS to GADFLY_MAX_DIFF_CHARS when invoking the binary, and the Go tree reads only GADFLY_MAX_DIFF_CHARS (main.go:361) / GADFLY_RECHECK_DIFF_CHARS (recheck.go). MAX_DIFF_CHARS appears nowhere in Go. The indirection lives entirely in shell, documented only by the inline comment.

None are blocking; the code compiles and the structure is broadly clean. The main maintainability risk is the duplicated reviewer-tool membership logic between toolbox() and gadflyToolRegistry.

Performance — No material issues found

Verdict: No material issues found

Through the performance lens I checked the hot paths this PR touches:

  • get_diff pagination (cmd/gadfly/tools.go:434-464): windowDiff does strings.Split(diff, "\n") on every call, allocating the full line slice even when only an 800-line window is emitted. For a huge (~1 MB) diff this is O(diff) work per call rather than O(window). I considered flagging it, but it's a micro-optimization: the split is microseconds against model-call latencies of seconds, and get_diff is called a handful of times per pass, not in a tight loop. The output is correctly bounded (maxGetDiffBytes/maxGetDiffLines), so there's no unbounded growth. Not material.
  • diffForPath (tools.go:470-483): single linear scan per path-scoped call; fine.
  • Compaction / budget wiring (executus.go): resolveContextTokens runs once at executor build (memoized via the closure), addUsage/Commit are O(1) under a short mutex per pass — no per-step or per-token hot-path overhead introduced.
  • runSpecialists fan-out (main.go:247-286): concurrency semantics are unchanged (still bounded by lensConcurrency()); fanout.Run returns one result per input with no extra copying of note.

No efficiency regressions (N+1, unbounded growth, quadratic behavior, hot-loop work, or missing limits) were found in this change.

🧯 Error handling & edge cases — Minor issues

All three findings verified against the actual source.

VERDICT: Minor issues found

  • Backstop race can kill the consolidation pass (entrypoint.sh:338-364). The watchdog subshell runs sleep $BUDGET; pkill -TERM …; sleep 5; pkill -KILL -f '/usr/local/bin/gadfly'. Disarm is kill "$KILLER_PID" (line 364) after wait "${LANE_PIDS[@]}" returns (line 360). If the budget expires at the same moment the lanes finish, the subshell has already passed its second sleep and is executing pkill -KILL; killing the subshell PID does not reliably cancel an already-running pkill child. The subsequent consolidation step (line 392-393) invokes /usr/local/bin/gadfly, so a late KILL pkill can terminate it — defeating the "partial findings still posted" guarantee. Suggested fix: gate the KILL pkill on the .budget_killed sentinel, or have the disarm also pkill the subshell's children / set a flag the pkill block checks before issuing KILL.

  • diffForPath substring match over-selects files (cmd/gadfly/tools.go:470-483). Matching is strings.Contains(ln, path) on the diff --git header line, so a short path like x also matches diff --git a/xy b/xy (and ax, etc.). The tool returns an inflated/incorrect diff window for that "file" rather than the requested one, with no error. For a reviewer scoping a finding to one file this silently feeds it the wrong hunks. Suggested fix: match the path against the a//b/ path tokens exactly.

  • Startup-blocking network call for context window (cmd/gadfly/executus.go:349-364). resolveContextTokens makes a live call (up to 15s timeout via context.WithTimeout) during newReviewExecutor, called on the review startup path (main.go:178) before any lens runs. When the Ollama Cloud endpoint is unreachable/slow, every model lane eats up to 15s of dead time before compaction silently disables (returns 0). It degrades safely, but the blocking-on-startup placement is an edge case worth noting; a cached/async resolution would avoid stalling small PRs that happen to point at a flaky endpoint.

kimi-k2.6:cloud (ollama-cloud) — Minor issues

Verdict: Minor issues · ⚠️ 1/5 lens(es) errored — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

Verdict: No material issues found

Reviewed through the Security lens. I traced all new code paths that touch untrusted input (the PR diff, LLM tool arguments, environment variables) and checked for authn/authz gaps, injection vectors, SSRF, credential leakage, and unsafe data handling.

Key areas verified:

  • getDiffArgs.Path / diffForPath (cmd/gadfly/tools.go:414–482): The new Path parameter is not used for filesystem access — it is only passed to strings.Contains as a substring filter against the already-in-memory unified diff. No path traversal, directory escape, or file-read amplification is possible. The repoFS.resolve sandbox is never invoked.
  • windowDiff pagination (cmd/gadfly/tools.go:434–463): start and limit are clamped to valid ranges; negative values and overflows are handled.
  • GADFLY_NOTICE in entrypoint.sh / run.sh (entrypoint.sh:210, scripts/run.sh:233–236): PR_DIFF_BYTES is produced by wc -c | tr -d '[:space:]', which is strictly numeric. The notice is therefore not injectable with attacker-controlled content in the intended flow.
  • resolveContextTokens API key handling (cmd/gadfly/executus.go:349–363): Keys are read from env and passed to model.NewCloudOllamaLimitCache; no logging or exposure of the key.
  • Backstop pkill patterns (entrypoint.sh): The pkill -f arguments are hardcoded literals (/usr/local/bin/gadfly) or derived from hardcoded SCRIPTS_DIR; no user-controlled path components.
  • Budget / caller ID isolation (cmd/gadfly/executus.go:306–315): The prCallerID function keys the budget on repo#pr or falls back to "local". The budget struct itself is per-process, so there is no cross-PR budget leakage.

The change is clean from a security perspective.

🎯 Correctness⚠️ could not complete

⚠️ This reviewer failed to complete: agent: step 11: all chain targets failed
ollama-cloud/kimi-k2.6☁️ model returned an empty response

🧹 Code cleanliness & maintainability — Minor issues

Minor issues

  • cmd/gadfly/main.go:44 — Stale comment about removed finalization fallback. The file-header comment still says "Plus a tool-free finalization fallback guarantees a step-exhausted pass still emits output." That fallback (finalizeInstruction + forceFinalAnswer) was explicitly dropped in this PR (the executus re-platform removes the transcript-based forced-finalization path, confirmed by the updated wrapup_test.go comment and the deletion of both the const and function from main.go). The comment now misleads readers about behavior that no longer exists. Fix: Remove that sentence from the comment, or rephrase to say the wrap-up critic is the sole mechanism.

  • cmd/gadfly/recheck_test.go:52 — Test sets the wrong environment variable. TestBuildRecheckTask calls t.Setenv("GADFLY_MAX_DIFF_CHARS", ""), but buildRecheckTask was changed in this PR to read GADFLY_RECHECK_DIFF_CHARS (see recheck.go:94). Because the test diff is tiny, truncation never triggers and the test still passes, but the test is now exercising a dead env var instead of the new one. Fix: Change the Setenv call to GADFLY_RECHECK_DIFF_CHARS, and add a short diff + explicit override value so the truncation path is actually covered.

Performance — No material issues found

No material issues found

The diff is entirely additive performance protections for large PRs: context compaction at a deliberately aggressive 0.45 threshold, paginated get_diff capped at 800 lines/64 KB per call, a smaller 20 k-char recheck diff embed, per-PR token/time budget gates, and fanout.Run replacing the existing semaphore-based lens fan-out with equivalent concurrency behavior. No allocations or work were added to hot paths for small PRs, and no quadratic or unbounded-growth patterns were introduced.```gadfly-findings
[]


</details>

<details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary>

**Minor issues**

- **`cmd/gadfly/tools.go:475` — `diffForPath` substring matching can return wrong file's hunks.** `diffForPath` uses `strings.Contains(ln, path)` on the `diff --git` header. If one filename is a substring of another (e.g., the model passes `bar.go` and the diff also contains `bar_test.go`), both sections match and are concatenated into the returned diff. The model may then verify findings against the wrong code. The fix is to match complete path components (e.g., require the match to be bounded by `/` or space) rather than a bare substring.

</details>

</details>

<details><summary><b>qwen3.5:397b-cloud</b> (ollama-cloud) — No material issues found</summary>

**Verdict: No material issues found · ⚠️ 3/5 lens(es) errored** — 5 reviewers: security, correctness, maintainability, performance, error-handling

<details><summary><b>🔒 Security</b> — ⚠️ could not complete</summary>

⚠️ This reviewer failed to complete: agent: step 5: context deadline exceeded

</details>

<details><summary><b>🎯 Correctness</b> — ⚠️ could not complete</summary>

⚠️ This reviewer failed to complete: agent: step 9: all chain targets failed
ollama-cloud/qwen3.5:397b-cloud: ollama ollama-cloud/qwen3.5:397b-cloud: do request: Post "https://ollama.com/api/chat": context deadline exceeded

</details>

<details><summary><b>🧹 Code cleanliness & maintainability</b> — No material issues found</summary>

## VERDICT: No material issues found

After thorough review through the **code cleanliness & maintainability** lens, I found no material issues. The code is well-structured and follows good practices:

**Positive observations (verified by reading the code):**

1. **No dead/duplicated code**: The old `runAgent`, `forceFinalAnswer`, and finalize-related code was properly removed when replaced by the executus-based `reviewExecutor.run()` (confirmed in `cmd/gadfly/main.go` and `cmd/gadfly/executus.go`).

2. **Clear naming and structure**: New types like `reviewExecutor`, `gadflyBudget`, `wrapUpCritic`, and `wrapUpHandle` have descriptive names and focused responsibilities. The `wrappedTool` adapter cleanly bridges majordomo tools to executus's interface.

3. **Consistent patterns**: The code follows existing patterns—env var resolution via `envInt()`, graceful degradation when optional features are disabled (nil budget, nil summarizer), and proper mutex usage in `gadflyBudget`.

4. **Good documentation**: Comments explain the "why" behind design choices (e.g., why `defaultCompactRatio` is 0.45 instead of executus's 0.7, why the transcript-based fallback was dropped).

5. **Tests updated appropriately**: Test functions were renamed from `TestRunAgent_*` to `TestReviewExecutor_*` and use the new `newTestReviewExecutor` helper, maintaining test coverage for the new execution path.

**Minor note (not a finding, just an observation):**
- The `diffForPath` function in `cmd/gadfly/tools.go:470` uses substring matching (`strings.Contains`) on the full `diff --git` header line. This could theoretically match unintended files if paths share substrings (e.g., searching for "test" might match both "test.go" and "latest.go"), but in practice users provide full paths and the test at `tools_test.go:197` validates the common case works correctly.

</details>

<details><summary><b>⚡ Performance</b> — No material issues found</summary>

**No material issues found**

The performance-focused changes in this PR directly address the stated problem (large-PR token burn) without introducing efficiency regressions:

1. **Paginated `get_diff`** (`cmd/gadfly/tools.go:407-483`): Correctly limits output to 800 lines / 64KB per call instead of dumping the entire diff. The `strings.Split` allocation is acceptable since the diff is already held in memory by `repoFS`, and the output caps prevent token explosion.

2. **Context compaction** (`cmd/gadfly/executus.go:240-254`): Wired with a deliberately low threshold (0.45× context window) that fires before transcripts balloon. The `resolveContextTokens` function makes exactly one network call at executor build time (cached), not per-pass.

3. **Downshift logic** (`entrypoint.sh:186-212`): Single diff size check (`wc -c`) triggers reduced work (one model, fewer lenses/steps, no recheck) for large PRs.

4. **Budget gate** (`cmd/gadfly/executus.go:108-143`): Simple mutex-protected counters; negligible overhead.

Verified by reading `cmd/gadfly/tools.go:429-483` (windowDiff/diffForPath implementation) and `cmd/gadfly/executus.go:342-364` (context token resolution). No N+1 patterns, unbounded growth, or hot-path blocking introduced.

</details>

<details><summary><b>🧯 Error handling & edge cases</b> — ⚠️ could not complete</summary>

⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
ollama-cloud/qwen3.5:397b-cloud: skipped (backed off until 15:05:25.229)

</details>

</details>

</details>

<sub>Automated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.</sub>
<!-- gadfly-consensus --> ## 🪰 Gadfly review — consensus across 6 models **Verdict: Blocking issues found** · 15 findings (6 with multi-model agreement) | | Finding | Where | Models | Lens | |--|--|--|--|--| | 🟠 | diffForPath re-splits the full diff on every scoped get_diff call, allocating O(diff_size) per invocation | `cmd/gadfly/tools.go:470` | 5/6 | correctness, error-handling, performance | | 🟠 | Package doc still advertises the tool-free finalization fallback this PR removed | `cmd/gadfly/main.go:44` | 3/6 | maintainability | | 🟡 | Per-PR token budget only updated after each pass (addUsage post-run), so Budget.Check can't stop a single runaway pass mid-flight — token guard is pass-granular, weaker than its name implies | `cmd/gadfly/executus.go:283` | 3/6 | error-handling, maintainability | | 🟠 | resolveContextTokens silently disables compaction on network errors with no log | `cmd/gadfly/executus.go:357` | 2/6 | error-handling | | 🟡 | Test sets obsolete GADFLY_MAX_DIFF_CHARS instead of GADFLY_RECHECK_DIFF_CHARS | `cmd/gadfly/recheck_test.go:52` | 2/6 | maintainability | | 🟡 | windowDiff re-splits the whole diff per get_diff call; paging a large diff is O(n^2/limit) on the large-PR hot path | `cmd/gadfly/tools.go:435` | 2/6 | correctness, performance | <details><summary>9 single-model findings (lower confidence)</summary> | | Finding | Where | Model | Lens | |--|--|--|--|--| | 🟠 | Backstop KILL pkill can race the disarm and kill the consolidation gadfly pass | `entrypoint.sh:345` | glm-5.2:cloud | error-handling | | 🟠 | Budget watchdog kill may not prevent delayed SIGKILL from catching consolidation pass | `entrypoint.sh:364` | deepseek-v4-pro:cloud | error-handling | | 🟡 | gadflyToolRegistry re-derives fsTools()+delegateTool, duplicating toolbox()'s membership logic (two sources of truth) | `cmd/gadfly/executus.go:82` | glm-5.2:cloud | maintainability | | 🟡 | GADFLY_TIMEOUT_SECS/MAX_STEPS configured in multiple places; Config.Defaults.MaxRuntime/MaxIterations duplicated verbatim with per-run RunnableAgent values | `cmd/gadfly/executus.go:258` | claude-code/opus:max | maintainability | | 🟡 | compactionEnabled() is a verbatim copy of recheckEnabled(); extract a shared env-bool helper | `cmd/gadfly/executus.go:319` | claude-code/opus | maintainability | | 🟡 | toolbox() is now dead in production (only tests call it); its tool-membership logic is duplicated in gadflyToolRegistry | `cmd/gadfly/tools.go:115` | glm-5.2:cloud | maintainability | | 🟡 | windowDiff truncation message off-by-one: says 'truncated at line N' when N was the last emitted line | `cmd/gadfly/tools.go:461` | deepseek-v4-pro:cloud | error-handling | | ⚪ | file docstring claims 'degrades to today's behavior' but the old forced-finalization fallback was removed by this PR, so empty-output exhaustion no longer matches prior behavior | `cmd/gadfly/executus.go:19` | glm-5.2:cloud | maintainability | | ⚪ | downshift exports MAX_DIFF_CHARS (not GADFLY_MAX_DIFF_CHARS); the aliasing lives only in run.sh and is invisible from the Go side | `entrypoint.sh:208` | glm-5.2:cloud | maintainability | </details> <details><summary>Per-model detail</summary> <details><summary><b>claude-code/opus</b> (claude-code) — Minor issues</summary> **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> I've established the key facts: 1. **The gadfly-side code is confirmed exactly as the draft describes** — `resolveContextTokens` (`cmd/gadfly/executus.go:349-364`) reads `GADFLY_API_KEY`/`OLLAMA_API_KEY` and passes it into `model.NewCloudOllamaLimitCache("", key, nil)` with no provider gate, on the default review path. 2. **The executus module source is genuinely inaccessible** from this sandbox — it's in the module cache outside the working directory (the `ls`/`find` calls are blocked for security), there's no `vendor/` directory, and no network. This is the same wall the original reviewer hit. The finding's entire materiality hinges on a behavior the draft explicitly **could not verify**: whether `executus/model.MaxContextTokensResolving` actually performs a keyed `/api/show` network fetch *for non-Ollama model specs*. The draft author marked this as unverifiable ("If executus only fetches when the spec resolves to an ollama-cloud model, this is benign"), assigned it **low confidence**, and the verification gate cannot positively confirm the egress mechanism against the code. The available evidence actually leans *against* the finding: the cache is named `CloudOllamaLimitCache`, the author's own contract comment scopes the fetch to "an Ollama Cloud model's limit via `/api/show`" and states it "returns 0 — disabling compaction — for an unknown model," consistent with a static-catalog-first resolver that doesn't blindly POST the key for arbitrary specs. Per the verification discipline — keep only what I can positively confirm — this finding must be dropped. --- ## VERDICT: No material issues found **Security review (🔒) — re-platform onto executus + large-PR cost controls** I traced the attacker-controlled input — the PR diff under review — through the new code and found no injection/path/SSRF sink: `diffForPath`/`windowDiff` only substring-match and slice the in-memory diff string and return it to the model; `get_diff`'s new `path`/`start_line`/`limit` args never touch the filesystem (`cmd/gadfly/tools.go:411-464`). The `pkill -f` backstop matches fixed, operator-controlled path patterns, not anything from the diff. Budget/critic code handles no untrusted data. The budget caps, caller-id derivation (`prCallerID`, numeric PR + `owner/repo`), and env parsing introduce no new trust-boundary issues, and no secrets are written to logs/error strings. The one drafted concern (a possible egress of a non-Ollama provider's secret to `api.ollama.com` via `resolveContextTokens`) depends entirely on `executus/model.MaxContextTokensResolving` performing a keyed network fetch for non-Ollama model specs — behavior that cannot be confirmed against the code from this checkout and which the surrounding contract comments suggest does not occur. Unconfirmable, so dropped. </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> Confirmed. Line 475 uses `strings.Contains(ln, path)`, an unanchored substring match against the `diff --git` header. A `path` arg that is a substring of another changed file's header (e.g. `api.go` ⊂ `internal/api.go`, or any short path that appears in another file's a/b sides) pulls that file's hunks in too. Impact is as described — extra context only, never a missing/wrong hunk — and the `_test` infix spares the common `tools.go`/`tools_test.go` collision. The finding holds. ## VERDICT: Minor issues - **`diffForPath` substring matching over-includes files** (`cmd/gadfly/tools.go:475`). The section selector is `strings.Contains(headerLine, path)`. The arg is documented as a full path (`e.g. pkg/foo/bar.go`), but substring matching means a `path` that is a substring of another changed file's `diff --git` header pulls in *that* file's hunks too. Concrete case: a PR touching both `api.go` and `internal/api.go`, with `get_diff(path="api.go")`, returns *both* sections. Impact is mild — extra context, never a missing/wrong hunk — but it dilutes the per-file scoping that exists precisely to cut tokens on huge PRs (the PR's whole motivation). The common test-file collision (`tools.go` vs `tools_test.go`) is *not* hit thanks to the `_test` infix, so this is low-frequency. Suggested fix: anchor on the header tokens, e.g. match `" a/"+path` or `" b/"+path` (still covers renames), instead of a bare `Contains`. Severity small. </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> All three findings are confirmed against the actual code: 1. **main.go:44-45** — the doc sentence about the "tool-free finalization fallback" survives, and `forceFinalAnswer`/`finalizeInstruction` are entirely gone from non-test code (grep returns nothing). The comment now contradicts the code. 2. **executus.go:319-326** — `compactionEnabled()` is byte-for-byte identical to `recheckEnabled()` (recheck.go:68-75) modulo the env var name. 3. **Timeout expression** — `time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second` appears at main.go:325, executus.go:258, and executus.go:285 — exactly three identical copies. All three survive verification. Outputting the corrected review. ## VERDICT: Minor issues Reviewing strictly through the **code cleanliness & maintainability** lens. The new `executus.go` is well-structured and thoroughly commented, and the `runSpecialists` fanout rewrite is a clean swap. Three maintainability nits, all verified by reading the code: - **`cmd/gadfly/main.go:44` — stale package doc describes removed behavior.** The package-level doc still says `GADFLY_WRAPUP_RESERVE` is backed by "*a tool-free finalization fallback [that] guarantees a step-exhausted pass still emits output.*" This PR **deletes** that fallback (`forceFinalAnswer`/`finalizeInstruction` are gone — a repo grep finds no non-test references left — and `wrapup_test.go` is updated to document that exhaustion now surfaces as an error). main.go:44 is the only non-test mention of the fallback left, and it now directly contradicts the new behavior. A reader trusting this comment will believe a guarantee that no longer holds. Fix: drop the "Plus a tool-free finalization fallback…" sentence (the wrap-up nudge is now the only "always emit something" mechanism). - **`cmd/gadfly/executus.go:319-326` — `compactionEnabled()` is a verbatim copy of `recheckEnabled()`.** It is byte-for-byte the same "on unless explicitly falsey" pattern as `recheck.go:68-75` (`switch strings.ToLower(strings.TrimSpace(os.Getenv(...)))` over `"0","false","no","off"` → `default: true`). This is exactly the copy-paste the lens targets: two identical env-bool parsers that will drift. Fix: extract one helper (e.g. `envBoolDefaultTrue(name string) bool`) and have both call it. - **`cmd/gadfly/executus.go:258` & `:285` (and `main.go:325`) — the per-lens timeout expression is written three times.** `time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second` appears identically in `Defaults.MaxRuntime`, in `reviewExecutor.run`'s `RunnableAgent.MaxRuntime`, and in `reviewWithSpecialist`. The two inside `executus.go` are also redundant with each other (the `RunnableAgent` value re-specifies what `Defaults` already sets for the same executor). A small `lensTimeout()` helper (mirroring the existing `wrapUpReserve()`/`lensConcurrency()` style) would remove the repetition and the chance of one copy being changed without the others. Trivial, but it's the kind of duplication that rots. Nothing else in my lane: naming, function length, and the `wrappedTool`/`gadflyBudget`/`wrapUpCritic` abstractions read clearly and match the surrounding style, and the new tests are well-named for the renamed symbols. </details> <details><summary><b>⚡ Performance</b> — Minor issues</summary> I've verified both findings against the actual code in `cmd/gadfly/tools.go`. **Finding 1 (`windowDiff`, line 435):** Confirmed. `windowDiff` calls `strings.Split(diff, "\n")` over the whole input at line 435, and the `start`/`limit` bounds are applied only afterward (the loop at 448–459 indexes into the fully-materialized `lines` slice). Paging to the end of a large diff re-scans from line 1 each call → O(n²/limit). The magnitude caveat is also accurate: Go's `strings.Split` substrings share the original backing bytes, so the per-call cost is the line-descriptor slice plus the O(n) scan, not a byte copy. **Finding 2 (`diffForPath` + `windowDiff`, lines 415/470):** Confirmed. A path-scoped call runs `diffForPath` (full `strings.Split` at line 473) then `windowDiff` (a second split at line 435) — two splits per path-scoped call. Both findings positively reproduce against the code and are honestly scoped as minor. Both survive. ## VERDICT: Minor issues **Performance lens — findings** - **`cmd/gadfly/tools.go:435` (`windowDiff`) re-splits the entire diff on every `get_diff` call, even to return one 800-line window.** `strings.Split(diff, "\n")` materializes a `[]string` over the *whole* diff regardless of `start`/`limit`. On exactly the large PRs this feature targets, the agent is steered to call `get_diff` repeatedly — paging via `start_line` and re-fetching per `path`. Each call is O(total-diff-lines) work plus a fresh line-slice allocation, so **paging a large diff to its end is O(n²/limit)**: page *k* still scans/splits from line 1. The `start`/`limit` bounds are applied only *after* the full split, and the loop indexes `lines[i]` from a slice built over everything. - Magnitude is bounded: `strings.Split` substrings share the original string's backing bytes (no byte copy), so the per-call cost is the line-descriptor slice (~16 B × line-count) plus the O(n) scan — GC/CPU churn on the large-PR path, not a token blowup. A real efficiency-by-omission, not a blocker. - Suggested fix: precompute the line index once (split `r.diff` into a cached `[]string` on the `repoFS`, guarded for the immutable diff) and have `windowDiff` slice the cached lines; paging then becomes O(window) per call. - **`cmd/gadfly/tools.go:415` + `:470` — a path-scoped `get_diff` splits the diff twice.** `diffForPath` does a full `strings.Split` + section rebuild (O(n)), then `windowDiff` splits that section again. Two full passes/allocations per path-scoped call. Minor on top of the above; the cached-lines approach would let `diffForPath` return line indices and avoid the second split. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> I've established the key facts: 1. **The ordering inversion is real and confirmed** from the diff: new `executus.go:run` checks `res.Err != nil` (returns error) *before* checking `res.Output`, whereas the old `runAgent` returned non-empty `res.Output` first and only surfaced `runErr` when output was empty. 2. **The impact is unverifiable.** The finding is only a real regression *if* executus's `run.Executor.Run` can return a populated `Output` together with a non-nil `Err`. The executus module source lives in the Go module cache outside the working directory, and every attempt to read it (Read, grep, `go doc`, `cat`, even with the sandbox disabled) is blocked by the working-directory allowlist. I cannot positively confirm that executus ever returns `Output`+`Err` together. The draft itself concedes this: it rates confidence "medium" and states the impact "hinges on whether executus ever returns Output+Err together, which I could not confirm." A verification gate keeps a finding only when it can be positively confirmed against the code. The ordering inversion alone is not a defect — it is harmful *only* under an executus semantic I cannot establish, and a reasonable reading (a run kernel that returns empty `Output` whenever `Err` is set, with the wrap-up critic ensuring output is emitted before the cap on the success path) makes it a non-issue. This is exactly the kind of speculative, unconfirmable impact the gate must drop. All other items in the draft were explicitly *not* flagged by the draft itself. --- ## VERDICT: No material issues found The new executus re-platform code is well-guarded against the error/edge-case lens: `gadflyBudget` is nil-safe on every method (`executus.go:123-155`, covered by `executus_test.go`), `compactRatio` clamps to `(0,1]` and falls back on parse error, `resolveContextTokens` degrades to `0`→compaction-off on an unknown model, a bad `GADFLY_COMPACT_MODEL` disables compaction rather than sinking the review (`main.go:172-176`), and the lens-panic recover survived the fanout rewrite (`main.go:255-265`). The one drafted concern — that `reviewExecutor.run` checks `res.Err` before `res.Output` and could discard a completed review — was dropped: the ordering inversion is real, but its impact depends entirely on whether executus's `run.Executor.Run` returns a populated `Output` together with a non-nil `Err`, which cannot be confirmed against the available source (the executus module is outside the readable tree). Without that confirmation the finding is speculative, not a verified defect. </details> </details> <details><summary><b>claude-code/opus:max</b> (claude-code) — Minor issues</summary> **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> I've completed verification. Here is the outcome: **Confirmed in-repo (sub-claims):** `GADFLY_API_KEY` is indeed the generic provider key (model.go:36–39), and `resolveContextTokens` (executus.go:349–364) does unconditionally build `model.NewCloudOllamaLimitCache("", key, nil)` with that generic key and an empty base URL, regardless of provider, when called with `os.Getenv("GADFLY_MODEL")` from `newReviewExecutor`. **Not confirmable (the finding's actual impact):** Whether `model.MaxContextTokensResolving` actually issues an authenticated network request to ollama.com carrying that key for a non-Ollama, catalog-miss model lives entirely in the `executus/model` package. That source is outside the sandbox-allowed path (`/tmp/gadfly/repo` only, no vendor dir, module cache access blocked), so I cannot positively confirm any of the three conditions the leak requires: that the call fires for non-Ollama models, that an empty base URL resolves to the default cloud host rather than a no-op, or that the key is attached. The draft itself rated this low-confidence and explicitly unverified. Per the gate rule — keep only what can be positively confirmed — this finding must be dropped: its entire impact rests on unverifiable cross-module behavior. --- ## VERDICT: No material issues found This PR is largely a refactor (executus re-platform) plus cost-control plumbing. Through the **Security** lens, the read-only tool surface stays read-only and no new injection/authz sink is introduced. ### Checked and clear - **`entrypoint.sh` pkill backstop**: fixed, non-attacker-controlled patterns inside the CI container; `GADFLY_PR_BUDGET_SECS` numerically guarded before use. No injection. - **`cmd/gadfly/tools.go` get_diff / `diffForPath` / `windowDiff`**: the new `path`/`start_line`/`limit` args operate purely on the in-memory diff string (substring match + line slicing) — no filesystem access, no path traversal, no new injection sink. - **`gadflyBudget` error messages / `prCallerID`** (`executus.go`): expose only token/second counts and a repo#PR audit key — no secrets, not used in any command. - **`resolveContextTokens`** (`executus.go:349`): builds an Ollama limit cache from the generic `GADFLY_API_KEY`; the claimed credential-exposure path depends on `executus/model` network behavior that cannot be confirmed from this checkout and is gated behind multiple unverified conditions — not a confirmable finding. </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> The finding is confirmed. The code at `cmd/gadfly/tools.go:470-483` matches the draft's description exactly: `inSection = strings.Contains(ln, path)` is an unanchored substring match against the full `diff --git a/… b/…` header line, so a `path` that is a substring of another changed file's path (e.g. `foo.go` vs `vendor/foo.go`) pulls in both sections. This survives verification. --- **VERDICT: Minor issues** - **`cmd/gadfly/tools.go:470` (`diffForPath`) — substring path match over-includes unrelated files.** The per-file scoper toggles `inSection = strings.Contains(ln, path)` against the whole `diff --git a/… b/…` header line. Because it's an unanchored substring over the entire line, a `path` that is a substring of another changed file's path pulls in *both* files' hunks. Concretely, with changed files `foo.go` and `vendor/foo.go`, calling `get_diff(path="foo.go")` returns the `vendor/foo.go` section too (its header `diff --git a/vendor/foo.go b/vendor/foo.go` contains `"foo.go"`). This partially defeats the per-file scoping precisely on the large PRs the feature targets (more files → more collisions → more diff re-injected than intended — the exact token-burn this PR fights). The existing test only covers a non-colliding `y.go` case so it doesn't exercise this. Severity is small in practice since the model usually passes full paths from the changed-files list. Suggested fix: match on a path-boundary token, e.g. require ` a/<path>` or ` b/<path>` rather than a bare `strings.Contains`. </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> The executus `run` package source isn't reachable from this sandbox (bash is restricted to the repo, and the module isn't vendored). The draft itself flagged this same limitation — it couldn't confirm executus's `Defaults` vs per-`RunnableAgent` precedence. However, the finding does not depend on that precedence: the verbatim duplication of the timeout expression across `executus.go:258`, `executus.go:285`, and `main.go:325`, plus `Defaults.MaxIterations` coexisting with an always-set per-run `MaxIterations`, is fully confirmed in the repo. The redundancy (multiple sources of truth for one knob) holds whichever layer wins precedence. The finding survives. I'll drop its "Verification" sub-bullet (process meta-commentary) per output rules. --- VERDICT: Minor issues **🧹 Code cleanliness & maintainability** - `cmd/gadfly/executus.go:258` & `:285` (with `cmd/gadfly/main.go:325`) — **`GADFLY_TIMEOUT_SECS` / `GADFLY_MAX_STEPS` are configured in three/two places, and the `Config.Defaults` copies are redundant with the per-run values.** The expression `time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second` is written verbatim twice in `executus.go` (once in `Config.Defaults.MaxRuntime`, once in the per-run `RunnableAgent.MaxRuntime`), and the same timeout is *also* applied as the `context.WithTimeout` deadline in `reviewWithSpecialist` (`main.go:325`). Likewise `Defaults.MaxIterations` is set from `GADFLY_MAX_STEPS`, but every `run()` passes an explicit `MaxIterations: maxSteps` on the `RunnableAgent` (always non-zero, via `GADFLY_MAX_STEPS`/`GADFLY_RECHECK_MAX_STEPS` defaults). The other `Defaults` fields — `MaxConsecutiveToolErrors`, `MaxSameToolCallRepeats`, `CompactionThresholdRatio`, `FallbackTier` — *are* needed, so `Defaults` can't simply be dropped wholesale. - *Impact:* two/three sources of truth for one knob. Whichever layer executus gives precedence, one set of these assignments is dead: a maintainer who later retunes the timeout/step cap via `Config.Defaults` (the natural-looking place) may see no effect, or change one site and not the others. Pure maintainability hazard, not a runtime bug today. - *Suggested fix:* pick one layer — drop `MaxIterations`/`MaxRuntime` from `Config.Defaults` and let the per-`RunnableAgent` values be the single source, or set them only in `Defaults` and leave the per-run fields zero. If the per-run `MaxRuntime` is kept, hoist `envInt("GADFLY_TIMEOUT_SECS", …) * time.Second` into a small helper so the expression isn't copy-pasted across the three sites. </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> ## VERDICT: No material issues found Reviewed strictly through the **⚡ Performance** lens. This PR is, on the whole, a performance/cost *improvement* — its entire purpose is to kill the large-PR token-burn amplifier. I verified the hot paths against the checked-out code and found no material regression. What I checked: - **`get_diff` no longer dumps the whole diff** (`tools.go:434-483`). The old tool returned `r.diff` verbatim, which on a ~1MB diff was re-transmitted every step × lens × pass. `windowDiff` now caps each call to `maxGetDiffLines`/`maxGetDiffBytes` (800 lines / 64KB). This is a strict reduction in per-step transmitted bytes — the core win. Confirmed. - **Embedded diff is bounded once per pass** (`main.go:361-377`, `recheck.go`): review embed stays at 60K, recheck embed dropped 60K→20K. Built once per task string, not per step. The per-step re-send of that bounded string is inherent to the agent loop and unchanged; compaction (`executus.go`) now folds the runaway transcript middle, which is a net reduction. No new amplification. - **`resolveContextTokens`** (`executus.go:349-363`) does a network `/api/show` call, but it's invoked **once at executor-build time** (per process, via `newReviewExecutor` called once in `run()`), not per pass/step. 15s timeout bounds it. Not on a hot path. - **`gadflyBudget`** lock/`addUsage`/`Check` are O(1) mutex ops; `Check` gating before model calls is negligible. - **fanout migration** (`main.go` `runSpecialists`) preserves `MaxConcurrent: lensConcurrency()` (default 1 → sequential); no oversubscription change. Non-blocking observation (a micro-optimization, not flagged as a finding): `windowDiff` and `diffForPath` each do `strings.Split(r.diff, "\n")` on the **entire** diff per `get_diff` call — and a `path`-scoped call splits the full diff twice (once in `diffForPath`, again in `windowDiff` on the extracted section). On a huge PR, a model paging through it re-allocates the full line slice on every call. The magnitude (a few ms / a transient slice of N strings) is trivial next to model inference and network latency, and it still strictly beats the old "send the whole diff every step" behavior, so I'm not flagging it as a regression. Worth a comment only if `get_diff` ever moves into a tight programmatic loop. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> I've verified the single finding against the actual code. The claim is confirmed: - `gadflyBudget.Check` (executus.go:123–136) reads `b.tokens`. - `b.tokens` is only mutated by `addUsage` (executus.go:147–155). - `reviewExecutor.run` (executus.go:279–304) calls `r.ex.Run(...)` and *then* `r.budget.addUsage(res.Usage)` afterward (line 295). So token usage is fed to the budget only once per completed pass, and `Check` sees a static token count within any single run — the token gate cannot stop a runaway first pass mid-flight, only subsequent passes. The finding is accurate, and the mitigations it cites (per-pass `Commit` seconds, per-lens `MaxRuntime`, entrypoint `GADFLY_PR_BUDGET_SECS`) all exist in the code. The finding survives. ## VERDICT: Minor issues The new code is, on the whole, defensively written — nil-safe budget methods, guarded `start`/`limit` clamping in `windowDiff`, friendly (non-error) returns for empty/unknown `get_diff` paths, panic recovery inside the fanout closure, and graceful "reviewer failed/did not run" notices instead of crashes. One genuine edge-case gap is worth noting. ### Findings - **`cmd/gadfly/executus.go:283` (and `:123–155`, the `gadflyBudget` token gate) — the per-PR *token* budget can't bite mid-pass, only between passes.** `Budget.Check` is the executor's pre-call gate, but it reads `b.tokens`, which is only updated by `reviewExecutor.run` calling `r.budget.addUsage(res.Usage)` *after* `ex.Run` returns (the `run.Ports.Budget` interface only carries `Commit(seconds)`, with no per-call token feed — tokens are fed "out-of-band" once per finished pass). Consequence: the very first lens/pass on a pathological PR runs to completion regardless of `GADFLY_PR_TOKEN_BUDGET`; the cap only stops *subsequent* lenses. Since the whole PR is about stopping one PR from draining a usage block, the in-binary token guard is weaker than its name suggests. - *Mitigation that lowers severity:* within a single pass, tokens are still bounded by `GADFLY_MAX_STEPS` × the (compacted) context window, and the wall-clock layers (`GADFLY_PR_TIME_BUDGET_SECS` per pass via `Commit`, the per-lens `GADFLY_TIMEOUT_SECS` `MaxRuntime`, and entrypoint's `GADFLY_PR_BUDGET_SECS`) are the real swarm backstop. So this is a real-but-mitigated granularity gap, not a hole. - *Suggested fix / or just document:* either note in the comment + README that `GADFLY_PR_TOKEN_BUDGET` is enforced at *pass* granularity (a single runaway pass can overshoot), or, if executus exposes an incremental usage hook, feed `addUsage` per step rather than per pass. </details> </details> <details><summary><b>deepseek-v4-pro:cloud</b> (ollama-cloud) — Blocking issues found</summary> **Verdict: Blocking issues found** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> **Verdict: No material issues found** I traced the full tainted-input flow through the new and modified code — from environment variables and API responses through to where they're consumed (URL construction, shell commands, markdown output, filesystem paths, and network calls). Here's what I checked and why nothing rises to a material security finding: - **`entrypoint.sh` `PR` in API URLs** (line 197, new; also pre-existing at lines 78, 89, 130, 146, 391): `PR` is interpolated into Gitea API URL paths without validation. If it contained `../` sequences, it could traverse the API surface. However, `PR` originates from the CI event context (`${PR:?PR required}`), which is a pull-request number set by Gitea Actions — not attacker-controlled input. This is the same pattern used throughout the pre-existing script. Not a new risk. - **`entrypoint.sh` `pkill -f` backstop** (lines 342–345, new): The hardcoded patterns `/usr/local/bin/gadfly` and `${SCRIPTS_DIR}/run.sh` (where `SCRIPTS_DIR=/app/scripts`, hardcoded on line 58) are used with `pkill -f`. No shell injection is possible because neither pattern incorporates any external input. The `-f` flag matches full command lines; in a single-purpose container this is safe. - **`entrypoint.sh` `GADFLY_NOTICE` in PR comments** (lines 210, 369, 396; `run.sh` lines 236–237): The notice text is constructed entirely by the script itself. The only interpolated values are `PR_DIFF_BYTES` (from `wc -c | tr -d '[:space:]'`, always numeric) and `GADFLY_PR_BUDGET_SECS` (validated as integer on line 337). No user or API content enters the notice. Markdown injection is not possible here. - **`entrypoint.sh` `GITEA_TOKEN` in clone URL** (line 153, pre-existing): The token is embedded in `CLONE_URL`. This is a pre-existing pattern; stderr is redirected to `/dev/null` on the `git clone` call (line 157), so the URL isn't echoed. Not new. - **`tools.go` `diffForPath` substring matching** (lines 470–483, new): The `path` argument from the LLM is used only in `strings.Contains` against diff header lines — no filesystem access, no regex interpretation. A malicious/hallucinated model could pass a short path like `a` to match many files, but that's a correctness concern, not a security one. - **`executus.go` `resolveContextTokens`** (lines 349–363, new): Makes a one-time network call to Ollama Cloud's `/api/show` using the API key from the environment. The `modelSpec` comes from `GADFLY_MODEL` (CI config), not from untrusted input. The call has a 15-second timeout. No SSRF vector from attacker-controlled input. - **`executus.go` `prCallerID`** (lines 308–315, new): Constructs a caller ID from `GADFLY_REPO` and `GADFLY_PR`, both set by `entrypoint.sh` from trusted sources (`REPO_PATH` extracted from `GITEA_API`, and `PR` from the CI event). Used only as an audit/budget key, not in any security decision. - **All `envInt`/`strconv.ParseFloat` parsing** of environment variables: Every new env var (`GADFLY_PR_TOKEN_BUDGET`, `GADFLY_PR_TIME_BUDGET_SECS`, `GADFLY_COMPACT_RATIO`, `GADFLY_MODEL_CONTEXT_TOKENS`, `GADFLY_COMPACT_KEEP_RECENT`, `GADFLY_COMPACT_SUMMARY_WORDS`, `GADFLY_HUGE_DIFF_*`, `GADFLY_PR_BUDGET_SECS`) is parsed with proper type conversion and bounds-checking. Invalid values fall back to safe defaults. </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> **Verdict:** Minor issues - **`diffForPath` substring false-positive matching** (`cmd/gadfly/tools.go:475`): The function uses `strings.Contains(ln, path)` against the full `diff --git a/<path> b/<path>` header line. This produces false positives when one file's path is a substring of another's — e.g., `path="foo"` matches both `foo.go` and `foobar.go`, and `path="a"` matches **every** diff header (all contain `a/`). The recheck model uses this to pull specific file hunks for verification; a false positive feeds it the wrong file's diff, potentially causing it to confirm a finding against incorrect code. The fix is to anchor the match, e.g., check for `"a/" + path` or `"b/" + path` as distinct substrings: `strings.Contains(ln, "a/"+path) || strings.Contains(ln, "b/"+path)`. **Verified** by reading `tools.go:470-483` and tracing the substring logic against the known `diff --git a/<path> b/<path>` header format. - **`windowDiff` trailing-empty-line artifact** (`cmd/gadfly/tools.go:435,478-479`): `diffForPath` appends `"\n"` after every line including the last, so the scoped diff always ends with a trailing newline. `windowDiff` then calls `strings.Split(diff, "\n")`, which produces a spurious empty final element. This empty string is emitted as a numbered blank line (e.g., `"42\t\n"`) and inflates the `total` line count by 1. The paging hint then reports an off-by-one total. The full-diff path (`r.diff` without `diffForPath`) may or may not end with a newline, making the behavior inconsistent. Impact is cosmetic (a blank line in the tool output and a slightly wrong line count in the truncation notice) but it is a logic error. Fix: `strings.TrimSuffix(diff, "\n")` before splitting, or skip empty final elements. **Verified** by reading `diffForPath` at `tools.go:478-479` (always appends `\n`) and `windowDiff` at `tools.go:435` (splits on `\n`). </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> Both findings confirmed. Let me output the corrected review. **Verdict:** Minor issues - **Stale doc comment references removed fallback** (`cmd/gadfly/main.go:44-45`): The package-level doc comment still says "Plus a tool-free finalization fallback guarantees a step-exhausted pass still emits output." But `forceFinalAnswer`, `finalizeInstruction`, and the entire fallback mechanism were removed in this PR (the executus `run.Result` doesn't expose the transcript). The comment is now misleading — a future reader would expect a fallback that no longer exists. Verified by reading `main.go:42-45` and confirming `forceFinalAnswer`/`finalizeInstruction` are absent from the file. - **Test targets wrong env var** (`cmd/gadfly/recheck_test.go:52`): `TestBuildRecheckTask` calls `t.Setenv("GADFLY_MAX_DIFF_CHARS", "")` to clear any override, but `buildRecheckTask` was changed to read `GADFLY_RECHECK_DIFF_CHARS` instead (see `recheck.go:94`). The test passes only because the fixture diff is tiny and fits within the 20,000-char default. The `Setenv` line is dead — it clears a variable the function no longer consults. Verified by reading both `recheck.go:94` and `recheck_test.go:51-63`. </details> <details><summary><b>⚡ Performance</b> — Minor issues</summary> The finding is confirmed. `diffForPath` at line 473 does `strings.Split(diff, "\n")` on the full diff every time `get_diff` is called with a `path` arg (line 415), and `newRepoFS` (line 48-66) stores the diff as a raw string with no pre-index. No caching or indexing exists anywhere else in the codebase. The finding survives. --- **Verdict: Minor issues** - **`diffForPath` re-splits the full diff on every scoped `get_diff` call** (`cmd/gadfly/tools.go:470-483`). When the model calls `get_diff` with a `path` arg (which the tool description now encourages for large PRs), `diffForPath` does `strings.Split(diff, "\n")` on the *entire* diff — for a ~1MB / ~250K-line PR, that allocates a ~250K-element string-header slice (~4 MB) just to extract one file's section. If the model investigates N files across its steps, this allocation repeats N times (O(N × diff_size)). A one-time pre-index at `newRepoFS` time (e.g., a `map[string]string` of path→hunk, or a line-offset index) would make each scoped lookup O(1) instead of O(diff_size). The pagination itself is the primary burn fix and this is gated by the model's step budget, so impact is moderate — but it's the one place the large diff is still fully materialized on every tool call. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Blocking issues found</summary> **Blocking issues found** - **`diffForPath` substring match returns wrong file's diff** (`cmd/gadfly/tools.go:475`): `strings.Contains(ln, path)` matches the path as a substring anywhere in the `diff --git` header line. If a PR changes both `foo.go` and `foo_test.go`, asking for `path: "foo.go"` will also match the `foo_test.go` header (since `"foo.go"` is a substring of `"a/foo_test.go"`), returning both files' hunks instead of just the requested one. This defeats the per-file scoping that's the whole point of the paginated `get_diff` for large PRs — the model gets extra (potentially large) diff content it didn't ask for, wasting tokens. The fix is to match the path as a whole path component, e.g. by checking that the path appears bounded by `/` or whitespace/quote in the header line, or by splitting the header on spaces and checking each path segment. *Verified by reading `tools.go:470-482` and confirming the test at `tools_test.go:200` uses non-colliding names (`x` and `y.go`), so this edge case is untested.* - **`resolveContextTokens` silently disables compaction on network errors** (`cmd/gadfly/executus.go:357-360`): When the `model.MaxContextTokensResolving` call fails (network timeout, API error, DNS failure), the function returns 0, which silently disables compaction. The caller at line 232 treats `window > 0` as the gate — a 0 means "no compaction." But a network error is not the same as "model unknown"; the model may be perfectly valid and the network just temporarily unavailable. The result is that compaction is silently disabled for the entire process lifetime with no log message, and the large-PR burn protection is gone. The function should at least log the failure to stderr (as other non-fatal degradations in this codebase do, e.g. `main.go:176`). *Verified by reading `executus.go:349-363` and confirming no error logging path exists.* - **`windowDiff` truncation message off-by-one** (`cmd/gadfly/tools.go:461`): The message says `truncated at line %d` using `i` (0-based index of the first *unemitted* line), but the last *emitted* line was `i` (1-based). For example, after emitting lines 1–800, `i=800` (0-based), and the message reads "truncated at line 800 of 850" — but line 800 was the last line shown; truncation actually happened at line 801. The `start_line` hint correctly uses `i+1` (801), so the paging instruction is right, but the "truncated at" number is misleading. Should be `i+1` for consistency. *Verified by reading `tools.go:434-463`.* - **Budget watchdog race: `kill $KILLER_PID` may not prevent the delayed SIGKILL** (`entrypoint.sh:364`): The watchdog subshell runs `sleep $BUDGET; pkill -TERM ...; sleep 5; pkill -KILL ...`. If the budget expires and the watchdog enters its `sleep 5` between SIGTERM and SIGKILL, and the reviews then finish naturally during that 5-second window, the main script's `kill "$KILLER_PID"` sends SIGTERM to the subshell. However, the subshell is a bash process running `sleep 5` — SIGTERM will interrupt `sleep` but the subshell will then continue to the next command (`pkill -KILL`), which would fire against the consolidation pass that's about to start. The comment on line 362–363 says the kill is meant to "disarm the watchdog so its delayed SIGKILL can't catch the consolidation pass," but the disarm is not atomic with respect to the `sleep 5; pkill -KILL` sequence. A more robust approach would be to use a sentinel file that the watchdog checks before firing SIGKILL, or to `kill -KILL` the watchdog subshell itself rather than SIGTERM. *Verified by reading `entrypoint.sh:337-364`.* </details> </details> <details><summary><b>glm-5.2:cloud</b> (ollama-cloud) — Minor issues</summary> **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> **VERDICT: No material issues found** Through the 🔒 Security lens, I verified the taint paths in this PR and found nothing material: - **`get_diff` `path` arg (`tools.go:407-426`)** — the model-supplied `path` is *not* run through `repoFS.resolve()`/`contains()`, but it doesn't need to be: it never touches the filesystem. `diffForPath` only filters the already-loaded `r.diff` string (in memory) and the result is then bounded by `windowDiff` (`maxGetDiffLines`/`maxGetDiffBytes`). No sandbox-escape or path-traversal vector is introduced; a hostile PR can't read outside the checkout via this tool. The substring-matching in `diffForPath` (`tools.go:470-482`) can over-scope (e.g. `path:"go"` matches every `*.go` header), but that only returns diff content the model could already page through — bounded by the same caps — so it's a correctness nit, not a security leak. Verified by reading `tools.go:401-483`. - **Sandboxed read tools unchanged** — `read_file`/`grep`/`find_files` still go through `repoFS.resolve()` → `contains()` (`tools.go:68-100`), and the worker sub-agent (`delegate.go:42`) uses `workerToolbox()` which builds on the same sandboxed tools. The PR doesn't weaken containment. Verified via `grep`/`read_file`. - **API key handling (`executus.go:349-364`)** — `resolveContextTokens` forwards `GADFLY_API_KEY`/`OLLAMA_API_KEY` to `model.NewCloudOllamaLimitCache`. The key is sent to introspect the model's context window; the target is the same provider that already receives the model traffic, and it's never logged or returned in review output. This is expected behavior, not a credential leak. (I could not inspect the executus-side default base URL for the `""` first arg since the dependency source isn't in this repo, but that's executus's contract to honor, not a gadfly-introduced exposure — low confidence only.) - **Budget/backstop (`entrypoint.sh:335-349`, `executus.go:114-155`)** — the wall-clock killer uses hardcoded `pkill -f` patterns (`/usr/local/bin/gadfly`, `/app/scripts/run.sh`), no env interpolation into the process pattern, so no command-injection surface. The Go-side `gadflyBudget` is purely advisory internal accounting with no authz decisions. Verified by reading both. - **Secret leakage into review output** — token/second counts appear only in internal budget error strings (`executus.go:130-133`), which surface as run errors, not as posted review markdown; the caller (`reviewWithSpecialist`) renders a generic "failed to complete" notice. No usage figures reach the PR comment. Verified via `executus.go:279-304` and the existing `reviewWithSpecialist` path. No injection, SSRF, authz gap, secret leak, or untrusted-data mishandling issue introduced by this change in my lane. </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> The code at line 475 confirms `inSection = strings.Contains(ln, path)` — an unanchored substring match on the `diff --git` header. The tool description (line 402) promises "returns ONLY that file's diff hunks," but a short/ambiguous path substring (e.g. `foo`) would match every file whose path contains `foo` (e.g. `pkg/foobar.go`). The existing test only exercises a full filename (`y.go`) that happens to match cleanly, so it doesn't catch the over-broadness. The finding survives. The other two draft notes are self-described as "not a finding" / "no bug," so they don't belong in the findings list. **VERDICT: Minor issues** - **`cmd/gadfly/tools.go:475` (diffForPath unanchored substring match breaks the path-scoping contract).** `diffForPath` matches the `path` argument as a *substring* of the `diff --git` header line via `strings.Contains(ln, path)`, with no path-boundary anchoring. The `get_diff` tool's `path` arg description promises "returns ONLY that file's diff hunks," but a short/ambiguous path (e.g. `foo`) returns hunks for every file whose path merely *contains* that substring (e.g. `diff --git a/pkg/foobar.go b/pkg/foobar.go`). The existing test only exercises a full filename (`y.go`) that happens to match exactly, so it doesn't surface the over-broadness. Fix: split the header and compare against the exact `a/<path>`/`b/<path>` tokens, or at minimum anchor with delimiters. </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> ## VERDICT: Minor issues - **`cmd/gadfly/tools.go:113-127` — `toolbox()` is now dead in production; its tool-membership logic is duplicated in `gadflyToolRegistry`.** The deleted `runAgent` was the only production caller of `(*repoFS).toolbox()`. Verified by grep: only `tools_test.go:38` and `tools_test.go:284` remain. The production path now goes through `gadflyToolRegistry` (executus.go:80-94), which re-implements the same `fsTools()` + conditional `delegateTool()` assembly. Two sources of truth for "which tools the reviewer exposes." - **`cmd/gadfly/executus.go:82-84` vs `cmd/gadfly/tools.go:117-119` — duplicated tool-list construction.** `gadflyToolRegistry` re-derives `fs.fsTools()` + `if fs.worker != nil { append(fs.delegateTool()) }`, byte-for-byte the membership logic of `toolbox()`. Verified by reading both functions. Adding a new built-in tool to `fsTools()` requires both paths to stay in sync by hand. - **`cmd/gadfly/executus.go:285` — `GADFLY_TIMEOUT_SECS` re-read per pass, duplicating the `Defaults.MaxRuntime` set at build time.** `newReviewExecutor` sets `cfg.Defaults.MaxRuntime` from `envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)` (line 258), and `run()` re-reads the same env with the same default to build `MaxRuntime` for each `RunnableAgent` (line 285). Verified: both sites resolve the same knob. Duplicated env resolution for one logical value; a drift hazard if either is ever cached or changed independently. - **`cmd/gadfly/executus.go:19-21` — docstring claims "degrades to today's behavior," but the old forced-finalization fallback was removed by this PR.** The file-level comment says "Everything degrades to today's behavior when unconfigured," yet the prior transcript-based `forceFinalAnswer`/`finalizeInstruction` path was deleted (confirmed in the diff and the renamed `TestReviewExecutor_ExhaustionWithoutAnswerIsError` test comment). On empty-output exhaustion the new code returns an error where the old code forced a final answer, so it does not fully degrade to pre-PR behavior. Wording overpromises; trivial but misleading. - **`entrypoint.sh:208` — `MAX_DIFF_CHARS` env name is a shell-only alias invisible from the Go side.** The downshift sets `export MAX_DIFF_CHARS=...` with a comment "run.sh -> GADFLY_MAX_DIFF_CHARS." Confirmed: `scripts/run.sh:180` translates `MAX_DIFF_CHARS` to `GADFLY_MAX_DIFF_CHARS` when invoking the binary, and the Go tree reads only `GADFLY_MAX_DIFF_CHARS` (main.go:361) / `GADFLY_RECHECK_DIFF_CHARS` (recheck.go). `MAX_DIFF_CHARS` appears nowhere in Go. The indirection lives entirely in shell, documented only by the inline comment. None are blocking; the code compiles and the structure is broadly clean. The main maintainability risk is the duplicated reviewer-tool membership logic between `toolbox()` and `gadflyToolRegistry`. </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> **Verdict: No material issues found** Through the performance lens I checked the hot paths this PR touches: - **`get_diff` pagination** (`cmd/gadfly/tools.go:434-464`): `windowDiff` does `strings.Split(diff, "\n")` on every call, allocating the full line slice even when only an 800-line window is emitted. For a huge (~1 MB) diff this is O(diff) work per call rather than O(window). I considered flagging it, but it's a micro-optimization: the split is microseconds against model-call latencies of seconds, and `get_diff` is called a handful of times per pass, not in a tight loop. The output is correctly bounded (`maxGetDiffBytes`/`maxGetDiffLines`), so there's no unbounded growth. Not material. - **`diffForPath`** (`tools.go:470-483`): single linear scan per path-scoped call; fine. - **Compaction / budget wiring** (`executus.go`): `resolveContextTokens` runs once at executor build (memoized via the closure), `addUsage`/`Commit` are O(1) under a short mutex per pass — no per-step or per-token hot-path overhead introduced. - **`runSpecialists` fan-out** (`main.go:247-286`): concurrency semantics are unchanged (still bounded by `lensConcurrency()`); `fanout.Run` returns one result per input with no extra copying of note. No efficiency regressions (N+1, unbounded growth, quadratic behavior, hot-loop work, or missing limits) were found in this change. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> All three findings verified against the actual source. **VERDICT: Minor issues found** - **Backstop race can kill the consolidation pass** (`entrypoint.sh:338-364`). The watchdog subshell runs `sleep $BUDGET; pkill -TERM …; sleep 5; pkill -KILL -f '/usr/local/bin/gadfly'`. Disarm is `kill "$KILLER_PID"` (line 364) after `wait "${LANE_PIDS[@]}"` returns (line 360). If the budget expires at the same moment the lanes finish, the subshell has already passed its second `sleep` and is executing `pkill -KILL`; killing the subshell PID does not reliably cancel an already-running `pkill` child. The subsequent consolidation step (line 392-393) invokes `/usr/local/bin/gadfly`, so a late KILL pkill can terminate it — defeating the "partial findings still posted" guarantee. Suggested fix: gate the KILL pkill on the `.budget_killed` sentinel, or have the disarm also `pkill` the subshell's children / set a flag the pkill block checks before issuing KILL. - **`diffForPath` substring match over-selects files** (`cmd/gadfly/tools.go:470-483`). Matching is `strings.Contains(ln, path)` on the `diff --git` header line, so a short `path` like `x` also matches `diff --git a/xy b/xy` (and `ax`, etc.). The tool returns an inflated/incorrect diff window for that "file" rather than the requested one, with no error. For a reviewer scoping a finding to one file this silently feeds it the wrong hunks. Suggested fix: match the path against the `a/`/`b/` path tokens exactly. - **Startup-blocking network call for context window** (`cmd/gadfly/executus.go:349-364`). `resolveContextTokens` makes a live call (up to 15s timeout via `context.WithTimeout`) during `newReviewExecutor`, called on the review startup path (`main.go:178`) before any lens runs. When the Ollama Cloud endpoint is unreachable/slow, every model lane eats up to 15s of dead time before compaction silently disables (returns 0). It degrades safely, but the blocking-on-startup placement is an edge case worth noting; a cached/async resolution would avoid stalling small PRs that happen to point at a flaky endpoint. </details> </details> <details><summary><b>kimi-k2.6:cloud</b> (ollama-cloud) — Minor issues</summary> **Verdict: Minor issues · ⚠️ 1/5 lens(es) errored** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> **Verdict: No material issues found** Reviewed through the **Security** lens. I traced all new code paths that touch untrusted input (the PR diff, LLM tool arguments, environment variables) and checked for authn/authz gaps, injection vectors, SSRF, credential leakage, and unsafe data handling. Key areas verified: - **`getDiffArgs.Path` / `diffForPath`** (cmd/gadfly/tools.go:414–482): The new `Path` parameter is **not** used for filesystem access — it is only passed to `strings.Contains` as a substring filter against the already-in-memory unified diff. No path traversal, directory escape, or file-read amplification is possible. The `repoFS.resolve` sandbox is never invoked. - **`windowDiff` pagination** (cmd/gadfly/tools.go:434–463): `start` and `limit` are clamped to valid ranges; negative values and overflows are handled. - **`GADFLY_NOTICE` in entrypoint.sh / run.sh** (entrypoint.sh:210, scripts/run.sh:233–236): `PR_DIFF_BYTES` is produced by `wc -c | tr -d '[:space:]'`, which is strictly numeric. The notice is therefore not injectable with attacker-controlled content in the intended flow. - **`resolveContextTokens` API key handling** (cmd/gadfly/executus.go:349–363): Keys are read from env and passed to `model.NewCloudOllamaLimitCache`; no logging or exposure of the key. - **Backstop `pkill` patterns** (entrypoint.sh): The `pkill -f` arguments are hardcoded literals (`/usr/local/bin/gadfly`) or derived from hardcoded `SCRIPTS_DIR`; no user-controlled path components. - **Budget / caller ID isolation** (cmd/gadfly/executus.go:306–315): The `prCallerID` function keys the budget on `repo#pr` or falls back to `"local"`. The budget struct itself is per-process, so there is no cross-PR budget leakage. The change is clean from a security perspective. </details> <details><summary><b>🎯 Correctness</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 11: all chain targets failed ollama-cloud/kimi-k2.6:cloud: model returned an empty response </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> Minor issues - **`cmd/gadfly/main.go:44` — Stale comment about removed finalization fallback.** The file-header comment still says "Plus a tool-free finalization fallback guarantees a step-exhausted pass still emits output." That fallback (`finalizeInstruction` + `forceFinalAnswer`) was explicitly dropped in this PR (the executus re-platform removes the transcript-based forced-finalization path, confirmed by the updated `wrapup_test.go` comment and the deletion of both the const and function from `main.go`). The comment now misleads readers about behavior that no longer exists. **Fix:** Remove that sentence from the comment, or rephrase to say the wrap-up critic is the sole mechanism. - **`cmd/gadfly/recheck_test.go:52` — Test sets the wrong environment variable.** `TestBuildRecheckTask` calls `t.Setenv("GADFLY_MAX_DIFF_CHARS", "")`, but `buildRecheckTask` was changed in this PR to read `GADFLY_RECHECK_DIFF_CHARS` (see `recheck.go:94`). Because the test diff is tiny, truncation never triggers and the test still passes, but the test is now exercising a dead env var instead of the new one. **Fix:** Change the `Setenv` call to `GADFLY_RECHECK_DIFF_CHARS`, and add a short diff + explicit override value so the truncation path is actually covered. </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> No material issues found The diff is entirely additive performance protections for large PRs: context compaction at a deliberately aggressive 0.45 threshold, paginated `get_diff` capped at 800 lines/64 KB per call, a smaller 20 k-char recheck diff embed, per-PR token/time budget gates, and `fanout.Run` replacing the existing semaphore-based lens fan-out with equivalent concurrency behavior. No allocations or work were added to hot paths for small PRs, and no quadratic or unbounded-growth patterns were introduced.```gadfly-findings [] ``` </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> **Minor issues** - **`cmd/gadfly/tools.go:475` — `diffForPath` substring matching can return wrong file's hunks.** `diffForPath` uses `strings.Contains(ln, path)` on the `diff --git` header. If one filename is a substring of another (e.g., the model passes `bar.go` and the diff also contains `bar_test.go`), both sections match and are concatenated into the returned diff. The model may then verify findings against the wrong code. The fix is to match complete path components (e.g., require the match to be bounded by `/` or space) rather than a bare substring. </details> </details> <details><summary><b>qwen3.5:397b-cloud</b> (ollama-cloud) — No material issues found</summary> **Verdict: No material issues found · ⚠️ 3/5 lens(es) errored** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 5: context deadline exceeded </details> <details><summary><b>🎯 Correctness</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 9: all chain targets failed ollama-cloud/qwen3.5:397b-cloud: ollama ollama-cloud/qwen3.5:397b-cloud: do request: Post "https://ollama.com/api/chat": context deadline exceeded </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — No material issues found</summary> ## VERDICT: No material issues found After thorough review through the **code cleanliness & maintainability** lens, I found no material issues. The code is well-structured and follows good practices: **Positive observations (verified by reading the code):** 1. **No dead/duplicated code**: The old `runAgent`, `forceFinalAnswer`, and finalize-related code was properly removed when replaced by the executus-based `reviewExecutor.run()` (confirmed in `cmd/gadfly/main.go` and `cmd/gadfly/executus.go`). 2. **Clear naming and structure**: New types like `reviewExecutor`, `gadflyBudget`, `wrapUpCritic`, and `wrapUpHandle` have descriptive names and focused responsibilities. The `wrappedTool` adapter cleanly bridges majordomo tools to executus's interface. 3. **Consistent patterns**: The code follows existing patterns—env var resolution via `envInt()`, graceful degradation when optional features are disabled (nil budget, nil summarizer), and proper mutex usage in `gadflyBudget`. 4. **Good documentation**: Comments explain the "why" behind design choices (e.g., why `defaultCompactRatio` is 0.45 instead of executus's 0.7, why the transcript-based fallback was dropped). 5. **Tests updated appropriately**: Test functions were renamed from `TestRunAgent_*` to `TestReviewExecutor_*` and use the new `newTestReviewExecutor` helper, maintaining test coverage for the new execution path. **Minor note (not a finding, just an observation):** - The `diffForPath` function in `cmd/gadfly/tools.go:470` uses substring matching (`strings.Contains`) on the full `diff --git` header line. This could theoretically match unintended files if paths share substrings (e.g., searching for "test" might match both "test.go" and "latest.go"), but in practice users provide full paths and the test at `tools_test.go:197` validates the common case works correctly. </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> **No material issues found** The performance-focused changes in this PR directly address the stated problem (large-PR token burn) without introducing efficiency regressions: 1. **Paginated `get_diff`** (`cmd/gadfly/tools.go:407-483`): Correctly limits output to 800 lines / 64KB per call instead of dumping the entire diff. The `strings.Split` allocation is acceptable since the diff is already held in memory by `repoFS`, and the output caps prevent token explosion. 2. **Context compaction** (`cmd/gadfly/executus.go:240-254`): Wired with a deliberately low threshold (0.45× context window) that fires before transcripts balloon. The `resolveContextTokens` function makes exactly one network call at executor build time (cached), not per-pass. 3. **Downshift logic** (`entrypoint.sh:186-212`): Single diff size check (`wc -c`) triggers reduced work (one model, fewer lenses/steps, no recheck) for large PRs. 4. **Budget gate** (`cmd/gadfly/executus.go:108-143`): Simple mutex-protected counters; negligible overhead. Verified by reading `cmd/gadfly/tools.go:429-483` (windowDiff/diffForPath implementation) and `cmd/gadfly/executus.go:342-364` (context token resolution). No N+1 patterns, unbounded growth, or hot-path blocking introduced. </details> <details><summary><b>🧯 Error handling & edge cases</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed ollama-cloud/qwen3.5:397b-cloud: skipped (backed off until 15:05:25.229) </details> </details> </details> <sub>Automated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.</sub>
steve added 1 commit 2026-06-30 15:32:26 +00:00
fix: address gadfly swarm review of the executus re-platform
Build & push image / build-and-push (pull_request) Successful in 8s
f5ca813af3
Folds in the real findings the dogfood swarm raised on PR #20 (21 graded
real, 1 false positive):

- tools.go: anchor get_diff `path` matching to whole path tokens (foo.go no
  longer pulls in barfoo.go; a trailing "/" still scopes a directory); split
  the diff once + cache it; drop the spurious trailing blank line; fix the
  "truncated after line N" off-by-one wording. Share one tool-list source
  (allTools) between toolbox() and the executus registry.
- executus.go: drop the dead Config.Defaults caps (per-run RunnableAgent always
  overrides them); shared envBool/reviewTimeout helpers; resolveContextTokens
  logs a failed lookup and uses a 5s timeout (was 15s); note the budget guard
  is pass-granular (the wall-clock backstop covers mid-pass).
- main.go/recheck.go: shared envBool; fix package-doc drift (the removed
  finalization fallback, the paginated get_diff).
- entrypoint.sh/run.sh: export GADFLY_MAX_DIFF_CHARS directly (run.sh prefers
  it); guard the watchdog's delayed SIGKILL on a .disarmed marker so it can't
  catch the consolidation pass.
- tests: anchoring test; corrected obsolete env var + truncation-wording asserts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
steve merged commit ac6ce06cdd into main 2026-06-30 15:41:03 +00:00
steve deleted branch feat/executus-replatform 2026-06-30 15:41:03 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: steve/gadfly#20