feat: re-platform agentic review onto executus + large-PR cost controls #20
Reference in New Issue
Block a user
Delete Branch "feat/executus-replatform"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Why
A very large PR (mort #1367: ~1.01 MB / ~253K-token diff) burned a full 5-hour Ollama Cloud block in ~15 minutes. Root cause: the agent loop re-sends the whole transcript every step, and the diff entered the prompt three ways (review embed, the recheck embed, and
get_diffdumping it untruncated), so it was re-transmitted acrosssteps × lenses × passes × models(~30 cloud passes). Small PRs were never affected.What
Makes gadfly a real consumer of executus (its canonical light-tier host) and fixes the burn in size-gated layers — small PRs are untouched.
executus re-platform
executus/run.Executor: context compaction (executus/compact, threshold from the model's real context window viaexecutus/model), run-bounding, a per-PR budget gate (Ports.Budget), and the wrap-up nudge re-expressed as arun.Critic.executus/fanout.model.go→GADFLY_ENDPOINT_<NAME>http aliases + the claude-code engine are unaffected. No majordomo bump; the binary stays static (executus core is majordomo+stdlib only).Burn fix (three layers + the per-PR budget)
get_diff(per-filepath+start_line/limit) instead of dumping the whole diff; recheck diff embed trimmed 60k → 20k. Compaction at a deliberately lowGADFLY_COMPACT_RATIO=0.45(the embedded path rarely reaches 0.7×context).entrypoint.sh) aboveGADFLY_HUGE_DIFF_BYTES→ one cheap model, focused lenses, fewer steps, no recheck — with a self-explaining notice on the comment.GADFLY_PR_BUDGET_SECS— swarm-wide wall-clock kill (addsprocps). All advisory; CI never fails.Notes / trade-off
run.Resultexposes no transcript, so the old transcript-based forced-finalization fallback is dropped; the wrap-up critic nudge is the remaining "always emit something" mechanism. (Could be restored via a small executus enhancement.)gofmt/go vet/go test -race/staticCGO_ENABLED=0build all green (85 tests).Advisory-only invariant re-confirmed: budget rejections and the backstop kill post advisory notices; the job never fails.
🤖 Generated with Claude Code
🪰 Gadfly — live review status
6/6 reviewers finished · updated 2026-06-30 15:10:27Z
claude-code/opus· claude-code — ✅ doneclaude-code/opus:max· claude-code — ✅ donedeepseek-v4-pro:cloud· ollama-cloud — ✅ doneglm-5.2:cloud· ollama-cloud — ✅ donekimi-k2.6:cloud· ollama-cloud — ✅ doneqwen3.5:397b-cloud· ollama-cloud — ✅ doneLive status board. Findings are posted in each model's own comment. Advisory only — does not block merge.
🪰 Gadfly consensus review — 12 inline findings on changed lines. See the consensus comment for the full ranked summary.
Advisory only — does not block merge.
@@ -0,0 +16,4 @@// steer that tells a step-hungry model to stop investigating and write its// answer is now the critic seam, not a bespoke RunOption.//// Everything degrades to today's behavior when unconfigured: nil summarizer or a⚪ file docstring claims 'degrades to today's behavior' but the old forced-finalization fallback was removed by this PR, so empty-output exhaustion no longer matches prior behavior
maintainability · flagged by 1 model
cmd/gadfly/executus.go:19-21— docstring claims "degrades to today's behavior," but the old forced-finalization fallback was removed by this PR. The file-level comment says "Everything degrades to today's behavior when unconfigured," yet the prior transcript-basedforceFinalAnswer/finalizeInstructionpath was deleted (confirmed in the diff and the renamedTestReviewExecutor_ExhaustionWithoutAnswerIsErrortest comment). On empty-output exhaustion the new code returns an error where…🪰 Gadfly · advisory
@@ -0,0 +79,4 @@// returns it along with the tool names for RunnableAgent.LowLevelTools.func gadflyToolRegistry(fs *repoFS) (exectool.Registry, []string, error) {reg := exectool.NewRegistry()tools := fs.fsTools()🟡 gadflyToolRegistry re-derives fsTools()+delegateTool, duplicating toolbox()'s membership logic (two sources of truth)
maintainability · flagged by 1 model
cmd/gadfly/executus.go:82-84vscmd/gadfly/tools.go:117-119— duplicated tool-list construction.gadflyToolRegistryre-derivesfs.fsTools()+if fs.worker != nil { append(fs.delegateTool()) }, byte-for-byte the membership logic oftoolbox(). Verified by reading both functions. Adding a new built-in tool tofsTools()requires both paths to stay in sync by hand.🪰 Gadfly · advisory
@@ -0,0 +255,4 @@ContextTokens: ctxTokens,Defaults: exrun.Defaults{MaxIterations: envInt("GADFLY_MAX_STEPS", defaultMaxSteps),MaxRuntime: time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second,🟡 GADFLY_TIMEOUT_SECS/MAX_STEPS configured in multiple places; Config.Defaults.MaxRuntime/MaxIterations duplicated verbatim with per-run RunnableAgent values
maintainability · flagged by 1 model
cmd/gadfly/executus.go:258&:285(withcmd/gadfly/main.go:325) —GADFLY_TIMEOUT_SECS/GADFLY_MAX_STEPSare configured in three/two places, and theConfig.Defaultscopies are redundant with the per-run values. The expressiontime.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Secondis written verbatim twice inexecutus.go(once inConfig.Defaults.MaxRuntime, once in the per-runRunnableAgent.MaxRuntime), and the same timeout is also applied as th…🪰 Gadfly · advisory
@@ -0,0 +280,4 @@res := r.ex.Run(ctx, exrun.RunnableAgent{Name: "gadfly-review",SystemPrompt: system,ModelTier: r.modelSpec,🟡 Per-PR token budget only updated after each pass (addUsage post-run), so Budget.Check can't stop a single runaway pass mid-flight — token guard is pass-granular, weaker than its name implies
error-handling, maintainability · flagged by 3 models
cmd/gadfly/executus.go:283(and:123–155, thegadflyBudgettoken gate) — the per-PR token budget can't bite mid-pass, only between passes.Budget.Checkis the executor's pre-call gate, but it readsb.tokens, which is only updated byreviewExecutor.runcallingr.budget.addUsage(res.Usage)afterex.Runreturns (therun.Ports.Budgetinterface only carriesCommit(seconds), with no per-call token feed — tokens are fed "out-of-band" once per finished pass). Consequence: t…🪰 Gadfly · advisory
@@ -0,0 +316,4 @@// compactionEnabled reports whether context compaction should be wired. On// unless GADFLY_COMPACT is explicitly falsey.func compactionEnabled() bool {🟡 compactionEnabled() is a verbatim copy of recheckEnabled(); extract a shared env-bool helper
maintainability · flagged by 1 model
cmd/gadfly/executus.go:319-326—compactionEnabled()is a verbatim copy ofrecheckEnabled(). It is byte-for-byte the same "on unless explicitly falsey" pattern asrecheck.go:68-75(switch strings.ToLower(strings.TrimSpace(os.Getenv(...)))over"0","false","no","off"→default: true). This is exactly the copy-paste the lens targets: two identical env-bool parsers that will drift. Fix: extract one helper (e.g.envBoolDefaultTrue(name string) bool) and have both call it.🪰 Gadfly · advisory
@@ -0,0 +354,4 @@if key == "" {key = strings.TrimSpace(os.Getenv("OLLAMA_API_KEY"))}cache := model.NewCloudOllamaLimitCache("", key, nil)🟠 resolveContextTokens silently disables compaction on network errors with no log
error-handling · flagged by 2 models
🪰 Gadfly · advisory
@@ -411,0 +432,4 @@// amplifier behind the large-PR token burn. The full diff stays reachable by// paging with start_line, or scoped per file via the path arg.func windowDiff(diff, scope string, start, limit int) string {lines := strings.Split(diff, "\n")🟡 windowDiff re-splits the whole diff per get_diff call; paging a large diff is O(n^2/limit) on the large-PR hot path
correctness, performance · flagged by 2 models
cmd/gadfly/tools.go:435(windowDiff) re-splits the entire diff on everyget_diffcall, even to return one 800-line window.strings.Split(diff, "\n")materializes a[]stringover the whole diff regardless ofstart/limit. On exactly the large PRs this feature targets, the agent is steered to callget_diffrepeatedly — paging viastart_lineand re-fetching perpath. Each call is O(total-diff-lines) work plus a fresh line-slice allocation, so **paging a large diff to its…🪰 Gadfly · advisory
@@ -411,0 +458,4 @@emitted++}if i < total {fmt.Fprintf(&b, "... (%s truncated at line %d of %d; call get_diff again with start_line=%d for more, or pass a `path` to scope to one file)\n", scope, i, total, i+1)🟡 windowDiff truncation message off-by-one: says 'truncated at line N' when N was the last emitted line
error-handling · flagged by 1 model
windowDifftruncation message off-by-one (cmd/gadfly/tools.go:461): The message saystruncated at line %dusingi(0-based index of the first unemitted line), but the last emitted line wasi(1-based). For example, after emitting lines 1–800,i=800(0-based), and the message reads "truncated at line 800 of 850" — but line 800 was the last line shown; truncation actually happened at line 801. Thestart_linehint correctly usesi+1(801), so the paging instruction is righ…🪰 Gadfly · advisory
@@ -411,0 +467,4 @@// `diff --git` header that names the path through to the next `diff --git`// header (or end). Matching is a substring on the header line, so either side// (a/<path> or b/<path>) hits, including a rename's two paths.func diffForPath(diff, path string) string {🟠 diffForPath re-splits the full diff on every scoped get_diff call, allocating O(diff_size) per invocation
correctness, error-handling, performance · flagged by 5 models
The finding is confirmed. The code at
cmd/gadfly/tools.go:470-483matches the draft's description exactly:inSection = strings.Contains(ln, path)is an unanchored substring match against the fulldiff --git a/… b/…header line, so apaththat is a substring of another changed file's path (e.g.foo.govsvendor/foo.go) pulls in both sections. This survives verification.🪰 Gadfly · advisory
@@ -186,0 +205,4 @@export GADFLY_MAX_STEPS="${GADFLY_HUGE_DIFF_MAX_STEPS:-12}"export GADFLY_RECHECK_MAX_STEPS="${GADFLY_HUGE_DIFF_RECHECK_MAX_STEPS:-8}"export GADFLY_RECHECK="${GADFLY_HUGE_DIFF_RECHECK:-0}" # skip recheck on huge PRsexport MAX_DIFF_CHARS="${GADFLY_HUGE_DIFF_MAX_DIFF_CHARS:-20000}" # run.sh -> GADFLY_MAX_DIFF_CHARS⚪ downshift exports MAX_DIFF_CHARS (not GADFLY_MAX_DIFF_CHARS); the aliasing lives only in run.sh and is invisible from the Go side
maintainability · flagged by 1 model
entrypoint.sh:208—MAX_DIFF_CHARSenv name is a shell-only alias invisible from the Go side. The downshift setsexport MAX_DIFF_CHARS=...with a comment "run.sh -> GADFLY_MAX_DIFF_CHARS." Confirmed:scripts/run.sh:180translatesMAX_DIFF_CHARStoGADFLY_MAX_DIFF_CHARSwhen invoking the binary, and the Go tree reads onlyGADFLY_MAX_DIFF_CHARS(main.go:361) /GADFLY_RECHECK_DIFF_CHARS(recheck.go).MAX_DIFF_CHARSappears nowhere in Go. The indirection lives entirely in she…🪰 Gadfly · advisory
@@ -300,0 +342,4 @@pkill -TERM -f '/usr/local/bin/gadfly' 2>/dev/null || truepkill -TERM -f "${SCRIPTS_DIR}/run.sh" 2>/dev/null || truesleep 5pkill -KILL -f '/usr/local/bin/gadfly' 2>/dev/null || true🟠 Backstop KILL pkill can race the disarm and kill the consolidation gadfly pass
error-handling · flagged by 1 model
🪰 Gadfly · advisory
@@ -310,1 +361,4 @@# Reviews finished (or the watchdog killed them): disarm the watchdog so its# delayed SIGKILL can't catch the consolidation pass that runs next.if [ -n "$KILLER_PID" ]; then kill "$KILLER_PID" 2>/dev/null || true; fi🟠 Budget watchdog kill may not prevent delayed SIGKILL from catching consolidation pass
error-handling · flagged by 1 model
kill $KILLER_PIDmay not prevent the delayed SIGKILL (entrypoint.sh:364): The watchdog subshell runssleep $BUDGET; pkill -TERM ...; sleep 5; pkill -KILL .... If the budget expires and the watchdog enters itssleep 5between SIGTERM and SIGKILL, and the reviews then finish naturally during that 5-second window, the main script'skill "$KILLER_PID"sends SIGTERM to the subshell. However, the subshell is a bash process runningsleep 5— SIGTERM will interrup…🪰 Gadfly · advisory
🪰 Gadfly review — consensus across 6 models
Verdict: Blocking issues found · 15 findings (6 with multi-model agreement)
cmd/gadfly/tools.go:470cmd/gadfly/main.go:44cmd/gadfly/executus.go:283cmd/gadfly/executus.go:357cmd/gadfly/recheck_test.go:52cmd/gadfly/tools.go:4359 single-model findings (lower confidence)
entrypoint.sh:345entrypoint.sh:364cmd/gadfly/executus.go:82cmd/gadfly/executus.go:258cmd/gadfly/executus.go:319cmd/gadfly/tools.go:115cmd/gadfly/tools.go:461cmd/gadfly/executus.go:19entrypoint.sh:208Per-model detail
claude-code/opus (claude-code) — Minor issues
Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
I've established the key facts:
The gadfly-side code is confirmed exactly as the draft describes —
resolveContextTokens(cmd/gadfly/executus.go:349-364) readsGADFLY_API_KEY/OLLAMA_API_KEYand passes it intomodel.NewCloudOllamaLimitCache("", key, nil)with no provider gate, on the default review path.The executus module source is genuinely inaccessible from this sandbox — it's in the module cache outside the working directory (the
ls/findcalls are blocked for security), there's novendor/directory, and no network. This is the same wall the original reviewer hit.The finding's entire materiality hinges on a behavior the draft explicitly could not verify: whether
executus/model.MaxContextTokensResolvingactually performs a keyed/api/shownetwork fetch for non-Ollama model specs. The draft author marked this as unverifiable ("If executus only fetches when the spec resolves to an ollama-cloud model, this is benign"), assigned it low confidence, and the verification gate cannot positively confirm the egress mechanism against the code.The available evidence actually leans against the finding: the cache is named
CloudOllamaLimitCache, the author's own contract comment scopes the fetch to "an Ollama Cloud model's limit via/api/show" and states it "returns 0 — disabling compaction — for an unknown model," consistent with a static-catalog-first resolver that doesn't blindly POST the key for arbitrary specs. Per the verification discipline — keep only what I can positively confirm — this finding must be dropped.VERDICT: No material issues found
Security review (🔒) — re-platform onto executus + large-PR cost controls
I traced the attacker-controlled input — the PR diff under review — through the new code and found no injection/path/SSRF sink:
diffForPath/windowDiffonly substring-match and slice the in-memory diff string and return it to the model;get_diff's newpath/start_line/limitargs never touch the filesystem (cmd/gadfly/tools.go:411-464). Thepkill -fbackstop matches fixed, operator-controlled path patterns, not anything from the diff. Budget/critic code handles no untrusted data. The budget caps, caller-id derivation (prCallerID, numeric PR +owner/repo), and env parsing introduce no new trust-boundary issues, and no secrets are written to logs/error strings.The one drafted concern (a possible egress of a non-Ollama provider's secret to
api.ollama.comviaresolveContextTokens) depends entirely onexecutus/model.MaxContextTokensResolvingperforming a keyed network fetch for non-Ollama model specs — behavior that cannot be confirmed against the code from this checkout and which the surrounding contract comments suggest does not occur. Unconfirmable, so dropped.🎯 Correctness — Minor issues
Confirmed. Line 475 uses
strings.Contains(ln, path), an unanchored substring match against thediff --githeader. Apatharg that is a substring of another changed file's header (e.g.api.go⊂internal/api.go, or any short path that appears in another file's a/b sides) pulls that file's hunks in too. Impact is as described — extra context only, never a missing/wrong hunk — and the_testinfix spares the commontools.go/tools_test.gocollision. The finding holds.VERDICT: Minor issues
diffForPathsubstring matching over-includes files (cmd/gadfly/tools.go:475). The section selector isstrings.Contains(headerLine, path). The arg is documented as a full path (e.g. pkg/foo/bar.go), but substring matching means apaththat is a substring of another changed file'sdiff --githeader pulls in that file's hunks too. Concrete case: a PR touching bothapi.goandinternal/api.go, withget_diff(path="api.go"), returns both sections. Impact is mild — extra context, never a missing/wrong hunk — but it dilutes the per-file scoping that exists precisely to cut tokens on huge PRs (the PR's whole motivation). The common test-file collision (tools.govstools_test.go) is not hit thanks to the_testinfix, so this is low-frequency. Suggested fix: anchor on the header tokens, e.g. match" a/"+pathor" b/"+path(still covers renames), instead of a bareContains. Severity small.🧹 Code cleanliness & maintainability — Minor issues
All three findings are confirmed against the actual code:
forceFinalAnswer/finalizeInstructionare entirely gone from non-test code (grep returns nothing). The comment now contradicts the code.compactionEnabled()is byte-for-byte identical torecheckEnabled()(recheck.go:68-75) modulo the env var name.time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Secondappears at main.go:325, executus.go:258, and executus.go:285 — exactly three identical copies.All three survive verification. Outputting the corrected review.
VERDICT: Minor issues
Reviewing strictly through the code cleanliness & maintainability lens. The new
executus.gois well-structured and thoroughly commented, and therunSpecialistsfanout rewrite is a clean swap. Three maintainability nits, all verified by reading the code:cmd/gadfly/main.go:44— stale package doc describes removed behavior. The package-level doc still saysGADFLY_WRAPUP_RESERVEis backed by "a tool-free finalization fallback [that] guarantees a step-exhausted pass still emits output." This PR deletes that fallback (forceFinalAnswer/finalizeInstructionare gone — a repo grep finds no non-test references left — andwrapup_test.gois updated to document that exhaustion now surfaces as an error). main.go:44 is the only non-test mention of the fallback left, and it now directly contradicts the new behavior. A reader trusting this comment will believe a guarantee that no longer holds. Fix: drop the "Plus a tool-free finalization fallback…" sentence (the wrap-up nudge is now the only "always emit something" mechanism).cmd/gadfly/executus.go:319-326—compactionEnabled()is a verbatim copy ofrecheckEnabled(). It is byte-for-byte the same "on unless explicitly falsey" pattern asrecheck.go:68-75(switch strings.ToLower(strings.TrimSpace(os.Getenv(...)))over"0","false","no","off"→default: true). This is exactly the copy-paste the lens targets: two identical env-bool parsers that will drift. Fix: extract one helper (e.g.envBoolDefaultTrue(name string) bool) and have both call it.cmd/gadfly/executus.go:258&:285(andmain.go:325) — the per-lens timeout expression is written three times.time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Secondappears identically inDefaults.MaxRuntime, inreviewExecutor.run'sRunnableAgent.MaxRuntime, and inreviewWithSpecialist. The two insideexecutus.goare also redundant with each other (theRunnableAgentvalue re-specifies whatDefaultsalready sets for the same executor). A smalllensTimeout()helper (mirroring the existingwrapUpReserve()/lensConcurrency()style) would remove the repetition and the chance of one copy being changed without the others. Trivial, but it's the kind of duplication that rots.Nothing else in my lane: naming, function length, and the
wrappedTool/gadflyBudget/wrapUpCriticabstractions read clearly and match the surrounding style, and the new tests are well-named for the renamed symbols.⚡ Performance — Minor issues
I've verified both findings against the actual code in
cmd/gadfly/tools.go.Finding 1 (
windowDiff, line 435): Confirmed.windowDiffcallsstrings.Split(diff, "\n")over the whole input at line 435, and thestart/limitbounds are applied only afterward (the loop at 448–459 indexes into the fully-materializedlinesslice). Paging to the end of a large diff re-scans from line 1 each call → O(n²/limit). The magnitude caveat is also accurate: Go'sstrings.Splitsubstrings share the original backing bytes, so the per-call cost is the line-descriptor slice plus the O(n) scan, not a byte copy.Finding 2 (
diffForPath+windowDiff, lines 415/470): Confirmed. A path-scoped call runsdiffForPath(fullstrings.Splitat line 473) thenwindowDiff(a second split at line 435) — two splits per path-scoped call.Both findings positively reproduce against the code and are honestly scoped as minor. Both survive.
VERDICT: Minor issues
Performance lens — findings
cmd/gadfly/tools.go:435(windowDiff) re-splits the entire diff on everyget_diffcall, even to return one 800-line window.strings.Split(diff, "\n")materializes a[]stringover the whole diff regardless ofstart/limit. On exactly the large PRs this feature targets, the agent is steered to callget_diffrepeatedly — paging viastart_lineand re-fetching perpath. Each call is O(total-diff-lines) work plus a fresh line-slice allocation, so paging a large diff to its end is O(n²/limit): page k still scans/splits from line 1. Thestart/limitbounds are applied only after the full split, and the loop indexeslines[i]from a slice built over everything.strings.Splitsubstrings share the original string's backing bytes (no byte copy), so the per-call cost is the line-descriptor slice (~16 B × line-count) plus the O(n) scan — GC/CPU churn on the large-PR path, not a token blowup. A real efficiency-by-omission, not a blocker.r.diffinto a cached[]stringon therepoFS, guarded for the immutable diff) and havewindowDiffslice the cached lines; paging then becomes O(window) per call.cmd/gadfly/tools.go:415+:470— a path-scopedget_diffsplits the diff twice.diffForPathdoes a fullstrings.Split+ section rebuild (O(n)), thenwindowDiffsplits that section again. Two full passes/allocations per path-scoped call. Minor on top of the above; the cached-lines approach would letdiffForPathreturn line indices and avoid the second split.🧯 Error handling & edge cases — No material issues found
I've established the key facts:
The ordering inversion is real and confirmed from the diff: new
executus.go:runchecksres.Err != nil(returns error) before checkingres.Output, whereas the oldrunAgentreturned non-emptyres.Outputfirst and only surfacedrunErrwhen output was empty.The impact is unverifiable. The finding is only a real regression if executus's
run.Executor.Runcan return a populatedOutputtogether with a non-nilErr. The executus module source lives in the Go module cache outside the working directory, and every attempt to read it (Read, grep,go doc,cat, even with the sandbox disabled) is blocked by the working-directory allowlist. I cannot positively confirm that executus ever returnsOutput+Errtogether.The draft itself concedes this: it rates confidence "medium" and states the impact "hinges on whether executus ever returns Output+Err together, which I could not confirm." A verification gate keeps a finding only when it can be positively confirmed against the code. The ordering inversion alone is not a defect — it is harmful only under an executus semantic I cannot establish, and a reasonable reading (a run kernel that returns empty
OutputwheneverErris set, with the wrap-up critic ensuring output is emitted before the cap on the success path) makes it a non-issue. This is exactly the kind of speculative, unconfirmable impact the gate must drop.All other items in the draft were explicitly not flagged by the draft itself.
VERDICT: No material issues found
The new executus re-platform code is well-guarded against the error/edge-case lens:
gadflyBudgetis nil-safe on every method (executus.go:123-155, covered byexecutus_test.go),compactRatioclamps to(0,1]and falls back on parse error,resolveContextTokensdegrades to0→compaction-off on an unknown model, a badGADFLY_COMPACT_MODELdisables compaction rather than sinking the review (main.go:172-176), and the lens-panic recover survived the fanout rewrite (main.go:255-265).The one drafted concern — that
reviewExecutor.runchecksres.Errbeforeres.Outputand could discard a completed review — was dropped: the ordering inversion is real, but its impact depends entirely on whether executus'srun.Executor.Runreturns a populatedOutputtogether with a non-nilErr, which cannot be confirmed against the available source (the executus module is outside the readable tree). Without that confirmation the finding is speculative, not a verified defect.claude-code/opus:max (claude-code) — Minor issues
Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
I've completed verification. Here is the outcome:
Confirmed in-repo (sub-claims):
GADFLY_API_KEYis indeed the generic provider key (model.go:36–39), andresolveContextTokens(executus.go:349–364) does unconditionally buildmodel.NewCloudOllamaLimitCache("", key, nil)with that generic key and an empty base URL, regardless of provider, when called withos.Getenv("GADFLY_MODEL")fromnewReviewExecutor.Not confirmable (the finding's actual impact): Whether
model.MaxContextTokensResolvingactually issues an authenticated network request to ollama.com carrying that key for a non-Ollama, catalog-miss model lives entirely in theexecutus/modelpackage. That source is outside the sandbox-allowed path (/tmp/gadfly/repoonly, no vendor dir, module cache access blocked), so I cannot positively confirm any of the three conditions the leak requires: that the call fires for non-Ollama models, that an empty base URL resolves to the default cloud host rather than a no-op, or that the key is attached. The draft itself rated this low-confidence and explicitly unverified.Per the gate rule — keep only what can be positively confirmed — this finding must be dropped: its entire impact rests on unverifiable cross-module behavior.
VERDICT: No material issues found
This PR is largely a refactor (executus re-platform) plus cost-control plumbing. Through the Security lens, the read-only tool surface stays read-only and no new injection/authz sink is introduced.
Checked and clear
entrypoint.shpkill backstop: fixed, non-attacker-controlled patterns inside the CI container;GADFLY_PR_BUDGET_SECSnumerically guarded before use. No injection.cmd/gadfly/tools.goget_diff /diffForPath/windowDiff: the newpath/start_line/limitargs operate purely on the in-memory diff string (substring match + line slicing) — no filesystem access, no path traversal, no new injection sink.gadflyBudgeterror messages /prCallerID(executus.go): expose only token/second counts and a repo#PR audit key — no secrets, not used in any command.resolveContextTokens(executus.go:349): builds an Ollama limit cache from the genericGADFLY_API_KEY; the claimed credential-exposure path depends onexecutus/modelnetwork behavior that cannot be confirmed from this checkout and is gated behind multiple unverified conditions — not a confirmable finding.🎯 Correctness — Minor issues
The finding is confirmed. The code at
cmd/gadfly/tools.go:470-483matches the draft's description exactly:inSection = strings.Contains(ln, path)is an unanchored substring match against the fulldiff --git a/… b/…header line, so apaththat is a substring of another changed file's path (e.g.foo.govsvendor/foo.go) pulls in both sections. This survives verification.VERDICT: Minor issues
cmd/gadfly/tools.go:470(diffForPath) — substring path match over-includes unrelated files. The per-file scoper togglesinSection = strings.Contains(ln, path)against the wholediff --git a/… b/…header line. Because it's an unanchored substring over the entire line, apaththat is a substring of another changed file's path pulls in both files' hunks. Concretely, with changed filesfoo.goandvendor/foo.go, callingget_diff(path="foo.go")returns thevendor/foo.gosection too (its headerdiff --git a/vendor/foo.go b/vendor/foo.gocontains"foo.go"). This partially defeats the per-file scoping precisely on the large PRs the feature targets (more files → more collisions → more diff re-injected than intended — the exact token-burn this PR fights). The existing test only covers a non-collidingy.gocase so it doesn't exercise this. Severity is small in practice since the model usually passes full paths from the changed-files list. Suggested fix: match on a path-boundary token, e.g. requirea/<path>orb/<path>rather than a barestrings.Contains.🧹 Code cleanliness & maintainability — Minor issues
The executus
runpackage source isn't reachable from this sandbox (bash is restricted to the repo, and the module isn't vendored). The draft itself flagged this same limitation — it couldn't confirm executus'sDefaultsvs per-RunnableAgentprecedence.However, the finding does not depend on that precedence: the verbatim duplication of the timeout expression across
executus.go:258,executus.go:285, andmain.go:325, plusDefaults.MaxIterationscoexisting with an always-set per-runMaxIterations, is fully confirmed in the repo. The redundancy (multiple sources of truth for one knob) holds whichever layer wins precedence. The finding survives. I'll drop its "Verification" sub-bullet (process meta-commentary) per output rules.VERDICT: Minor issues
🧹 Code cleanliness & maintainability
cmd/gadfly/executus.go:258&:285(withcmd/gadfly/main.go:325) —GADFLY_TIMEOUT_SECS/GADFLY_MAX_STEPSare configured in three/two places, and theConfig.Defaultscopies are redundant with the per-run values. The expressiontime.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Secondis written verbatim twice inexecutus.go(once inConfig.Defaults.MaxRuntime, once in the per-runRunnableAgent.MaxRuntime), and the same timeout is also applied as thecontext.WithTimeoutdeadline inreviewWithSpecialist(main.go:325). LikewiseDefaults.MaxIterationsis set fromGADFLY_MAX_STEPS, but everyrun()passes an explicitMaxIterations: maxStepson theRunnableAgent(always non-zero, viaGADFLY_MAX_STEPS/GADFLY_RECHECK_MAX_STEPSdefaults). The otherDefaultsfields —MaxConsecutiveToolErrors,MaxSameToolCallRepeats,CompactionThresholdRatio,FallbackTier— are needed, soDefaultscan't simply be dropped wholesale.Config.Defaults(the natural-looking place) may see no effect, or change one site and not the others. Pure maintainability hazard, not a runtime bug today.MaxIterations/MaxRuntimefromConfig.Defaultsand let the per-RunnableAgentvalues be the single source, or set them only inDefaultsand leave the per-run fields zero. If the per-runMaxRuntimeis kept, hoistenvInt("GADFLY_TIMEOUT_SECS", …) * time.Secondinto a small helper so the expression isn't copy-pasted across the three sites.⚡ Performance — No material issues found
VERDICT: No material issues found
Reviewed strictly through the ⚡ Performance lens. This PR is, on the whole, a performance/cost improvement — its entire purpose is to kill the large-PR token-burn amplifier. I verified the hot paths against the checked-out code and found no material regression.
What I checked:
get_diffno longer dumps the whole diff (tools.go:434-483). The old tool returnedr.diffverbatim, which on a ~1MB diff was re-transmitted every step × lens × pass.windowDiffnow caps each call tomaxGetDiffLines/maxGetDiffBytes(800 lines / 64KB). This is a strict reduction in per-step transmitted bytes — the core win. Confirmed.main.go:361-377,recheck.go): review embed stays at 60K, recheck embed dropped 60K→20K. Built once per task string, not per step. The per-step re-send of that bounded string is inherent to the agent loop and unchanged; compaction (executus.go) now folds the runaway transcript middle, which is a net reduction. No new amplification.resolveContextTokens(executus.go:349-363) does a network/api/showcall, but it's invoked once at executor-build time (per process, vianewReviewExecutorcalled once inrun()), not per pass/step. 15s timeout bounds it. Not on a hot path.gadflyBudgetlock/addUsage/Checkare O(1) mutex ops;Checkgating before model calls is negligible.main.gorunSpecialists) preservesMaxConcurrent: lensConcurrency()(default 1 → sequential); no oversubscription change.Non-blocking observation (a micro-optimization, not flagged as a finding):
windowDiffanddiffForPatheach dostrings.Split(r.diff, "\n")on the entire diff perget_diffcall — and apath-scoped call splits the full diff twice (once indiffForPath, again inwindowDiffon the extracted section). On a huge PR, a model paging through it re-allocates the full line slice on every call. The magnitude (a few ms / a transient slice of N strings) is trivial next to model inference and network latency, and it still strictly beats the old "send the whole diff every step" behavior, so I'm not flagging it as a regression. Worth a comment only ifget_diffever moves into a tight programmatic loop.🧯 Error handling & edge cases — Minor issues
I've verified the single finding against the actual code. The claim is confirmed:
gadflyBudget.Check(executus.go:123–136) readsb.tokens.b.tokensis only mutated byaddUsage(executus.go:147–155).reviewExecutor.run(executus.go:279–304) callsr.ex.Run(...)and thenr.budget.addUsage(res.Usage)afterward (line 295).So token usage is fed to the budget only once per completed pass, and
Checksees a static token count within any single run — the token gate cannot stop a runaway first pass mid-flight, only subsequent passes. The finding is accurate, and the mitigations it cites (per-passCommitseconds, per-lensMaxRuntime, entrypointGADFLY_PR_BUDGET_SECS) all exist in the code. The finding survives.VERDICT: Minor issues
The new code is, on the whole, defensively written — nil-safe budget methods, guarded
start/limitclamping inwindowDiff, friendly (non-error) returns for empty/unknownget_diffpaths, panic recovery inside the fanout closure, and graceful "reviewer failed/did not run" notices instead of crashes. One genuine edge-case gap is worth noting.Findings
cmd/gadfly/executus.go:283(and:123–155, thegadflyBudgettoken gate) — the per-PR token budget can't bite mid-pass, only between passes.Budget.Checkis the executor's pre-call gate, but it readsb.tokens, which is only updated byreviewExecutor.runcallingr.budget.addUsage(res.Usage)afterex.Runreturns (therun.Ports.Budgetinterface only carriesCommit(seconds), with no per-call token feed — tokens are fed "out-of-band" once per finished pass). Consequence: the very first lens/pass on a pathological PR runs to completion regardless ofGADFLY_PR_TOKEN_BUDGET; the cap only stops subsequent lenses. Since the whole PR is about stopping one PR from draining a usage block, the in-binary token guard is weaker than its name suggests.GADFLY_MAX_STEPS× the (compacted) context window, and the wall-clock layers (GADFLY_PR_TIME_BUDGET_SECSper pass viaCommit, the per-lensGADFLY_TIMEOUT_SECSMaxRuntime, and entrypoint'sGADFLY_PR_BUDGET_SECS) are the real swarm backstop. So this is a real-but-mitigated granularity gap, not a hole.GADFLY_PR_TOKEN_BUDGETis enforced at pass granularity (a single runaway pass can overshoot), or, if executus exposes an incremental usage hook, feedaddUsageper step rather than per pass.deepseek-v4-pro:cloud (ollama-cloud) — Blocking issues found
Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
Verdict: No material issues found
I traced the full tainted-input flow through the new and modified code — from environment variables and API responses through to where they're consumed (URL construction, shell commands, markdown output, filesystem paths, and network calls). Here's what I checked and why nothing rises to a material security finding:
entrypoint.shPRin API URLs (line 197, new; also pre-existing at lines 78, 89, 130, 146, 391):PRis interpolated into Gitea API URL paths without validation. If it contained../sequences, it could traverse the API surface. However,PRoriginates from the CI event context (${PR:?PR required}), which is a pull-request number set by Gitea Actions — not attacker-controlled input. This is the same pattern used throughout the pre-existing script. Not a new risk.entrypoint.shpkill -fbackstop (lines 342–345, new): The hardcoded patterns/usr/local/bin/gadflyand${SCRIPTS_DIR}/run.sh(whereSCRIPTS_DIR=/app/scripts, hardcoded on line 58) are used withpkill -f. No shell injection is possible because neither pattern incorporates any external input. The-fflag matches full command lines; in a single-purpose container this is safe.entrypoint.shGADFLY_NOTICEin PR comments (lines 210, 369, 396;run.shlines 236–237): The notice text is constructed entirely by the script itself. The only interpolated values arePR_DIFF_BYTES(fromwc -c | tr -d '[:space:]', always numeric) andGADFLY_PR_BUDGET_SECS(validated as integer on line 337). No user or API content enters the notice. Markdown injection is not possible here.entrypoint.shGITEA_TOKENin clone URL (line 153, pre-existing): The token is embedded inCLONE_URL. This is a pre-existing pattern; stderr is redirected to/dev/nullon thegit clonecall (line 157), so the URL isn't echoed. Not new.tools.godiffForPathsubstring matching (lines 470–483, new): Thepathargument from the LLM is used only instrings.Containsagainst diff header lines — no filesystem access, no regex interpretation. A malicious/hallucinated model could pass a short path likeato match many files, but that's a correctness concern, not a security one.executus.goresolveContextTokens(lines 349–363, new): Makes a one-time network call to Ollama Cloud's/api/showusing the API key from the environment. ThemodelSpeccomes fromGADFLY_MODEL(CI config), not from untrusted input. The call has a 15-second timeout. No SSRF vector from attacker-controlled input.executus.goprCallerID(lines 308–315, new): Constructs a caller ID fromGADFLY_REPOandGADFLY_PR, both set byentrypoint.shfrom trusted sources (REPO_PATHextracted fromGITEA_API, andPRfrom the CI event). Used only as an audit/budget key, not in any security decision.All
envInt/strconv.ParseFloatparsing of environment variables: Every new env var (GADFLY_PR_TOKEN_BUDGET,GADFLY_PR_TIME_BUDGET_SECS,GADFLY_COMPACT_RATIO,GADFLY_MODEL_CONTEXT_TOKENS,GADFLY_COMPACT_KEEP_RECENT,GADFLY_COMPACT_SUMMARY_WORDS,GADFLY_HUGE_DIFF_*,GADFLY_PR_BUDGET_SECS) is parsed with proper type conversion and bounds-checking. Invalid values fall back to safe defaults.🎯 Correctness — Minor issues
Verdict: Minor issues
diffForPathsubstring false-positive matching (cmd/gadfly/tools.go:475): The function usesstrings.Contains(ln, path)against the fulldiff --git a/<path> b/<path>header line. This produces false positives when one file's path is a substring of another's — e.g.,path="foo"matches bothfoo.goandfoobar.go, andpath="a"matches every diff header (all containa/). The recheck model uses this to pull specific file hunks for verification; a false positive feeds it the wrong file's diff, potentially causing it to confirm a finding against incorrect code. The fix is to anchor the match, e.g., check for"a/" + pathor"b/" + pathas distinct substrings:strings.Contains(ln, "a/"+path) || strings.Contains(ln, "b/"+path). Verified by readingtools.go:470-483and tracing the substring logic against the knowndiff --git a/<path> b/<path>header format.windowDifftrailing-empty-line artifact (cmd/gadfly/tools.go:435,478-479):diffForPathappends"\n"after every line including the last, so the scoped diff always ends with a trailing newline.windowDiffthen callsstrings.Split(diff, "\n"), which produces a spurious empty final element. This empty string is emitted as a numbered blank line (e.g.,"42\t\n") and inflates thetotalline count by 1. The paging hint then reports an off-by-one total. The full-diff path (r.diffwithoutdiffForPath) may or may not end with a newline, making the behavior inconsistent. Impact is cosmetic (a blank line in the tool output and a slightly wrong line count in the truncation notice) but it is a logic error. Fix:strings.TrimSuffix(diff, "\n")before splitting, or skip empty final elements. Verified by readingdiffForPathattools.go:478-479(always appends\n) andwindowDiffattools.go:435(splits on\n).🧹 Code cleanliness & maintainability — Minor issues
Both findings confirmed. Let me output the corrected review.
Verdict: Minor issues
Stale doc comment references removed fallback (
cmd/gadfly/main.go:44-45): The package-level doc comment still says "Plus a tool-free finalization fallback guarantees a step-exhausted pass still emits output." ButforceFinalAnswer,finalizeInstruction, and the entire fallback mechanism were removed in this PR (the executusrun.Resultdoesn't expose the transcript). The comment is now misleading — a future reader would expect a fallback that no longer exists. Verified by readingmain.go:42-45and confirmingforceFinalAnswer/finalizeInstructionare absent from the file.Test targets wrong env var (
cmd/gadfly/recheck_test.go:52):TestBuildRecheckTaskcallst.Setenv("GADFLY_MAX_DIFF_CHARS", "")to clear any override, butbuildRecheckTaskwas changed to readGADFLY_RECHECK_DIFF_CHARSinstead (seerecheck.go:94). The test passes only because the fixture diff is tiny and fits within the 20,000-char default. TheSetenvline is dead — it clears a variable the function no longer consults. Verified by reading bothrecheck.go:94andrecheck_test.go:51-63.⚡ Performance — Minor issues
The finding is confirmed.
diffForPathat line 473 doesstrings.Split(diff, "\n")on the full diff every timeget_diffis called with apatharg (line 415), andnewRepoFS(line 48-66) stores the diff as a raw string with no pre-index. No caching or indexing exists anywhere else in the codebase. The finding survives.Verdict: Minor issues
diffForPathre-splits the full diff on every scopedget_diffcall (cmd/gadfly/tools.go:470-483). When the model callsget_diffwith apatharg (which the tool description now encourages for large PRs),diffForPathdoesstrings.Split(diff, "\n")on the entire diff — for a ~1MB / ~250K-line PR, that allocates a ~250K-element string-header slice (~4 MB) just to extract one file's section. If the model investigates N files across its steps, this allocation repeats N times (O(N × diff_size)). A one-time pre-index atnewRepoFStime (e.g., amap[string]stringof path→hunk, or a line-offset index) would make each scoped lookup O(1) instead of O(diff_size). The pagination itself is the primary burn fix and this is gated by the model's step budget, so impact is moderate — but it's the one place the large diff is still fully materialized on every tool call.🧯 Error handling & edge cases — Blocking issues found
Blocking issues found
diffForPathsubstring match returns wrong file's diff (cmd/gadfly/tools.go:475):strings.Contains(ln, path)matches the path as a substring anywhere in thediff --githeader line. If a PR changes bothfoo.goandfoo_test.go, asking forpath: "foo.go"will also match thefoo_test.goheader (since"foo.go"is a substring of"a/foo_test.go"), returning both files' hunks instead of just the requested one. This defeats the per-file scoping that's the whole point of the paginatedget_difffor large PRs — the model gets extra (potentially large) diff content it didn't ask for, wasting tokens. The fix is to match the path as a whole path component, e.g. by checking that the path appears bounded by/or whitespace/quote in the header line, or by splitting the header on spaces and checking each path segment. Verified by readingtools.go:470-482and confirming the test attools_test.go:200uses non-colliding names (xandy.go), so this edge case is untested.resolveContextTokenssilently disables compaction on network errors (cmd/gadfly/executus.go:357-360): When themodel.MaxContextTokensResolvingcall fails (network timeout, API error, DNS failure), the function returns 0, which silently disables compaction. The caller at line 232 treatswindow > 0as the gate — a 0 means "no compaction." But a network error is not the same as "model unknown"; the model may be perfectly valid and the network just temporarily unavailable. The result is that compaction is silently disabled for the entire process lifetime with no log message, and the large-PR burn protection is gone. The function should at least log the failure to stderr (as other non-fatal degradations in this codebase do, e.g.main.go:176). Verified by readingexecutus.go:349-363and confirming no error logging path exists.windowDifftruncation message off-by-one (cmd/gadfly/tools.go:461): The message saystruncated at line %dusingi(0-based index of the first unemitted line), but the last emitted line wasi(1-based). For example, after emitting lines 1–800,i=800(0-based), and the message reads "truncated at line 800 of 850" — but line 800 was the last line shown; truncation actually happened at line 801. Thestart_linehint correctly usesi+1(801), so the paging instruction is right, but the "truncated at" number is misleading. Should bei+1for consistency. Verified by readingtools.go:434-463.Budget watchdog race:
kill $KILLER_PIDmay not prevent the delayed SIGKILL (entrypoint.sh:364): The watchdog subshell runssleep $BUDGET; pkill -TERM ...; sleep 5; pkill -KILL .... If the budget expires and the watchdog enters itssleep 5between SIGTERM and SIGKILL, and the reviews then finish naturally during that 5-second window, the main script'skill "$KILLER_PID"sends SIGTERM to the subshell. However, the subshell is a bash process runningsleep 5— SIGTERM will interruptsleepbut the subshell will then continue to the next command (pkill -KILL), which would fire against the consolidation pass that's about to start. The comment on line 362–363 says the kill is meant to "disarm the watchdog so its delayed SIGKILL can't catch the consolidation pass," but the disarm is not atomic with respect to thesleep 5; pkill -KILLsequence. A more robust approach would be to use a sentinel file that the watchdog checks before firing SIGKILL, or tokill -KILLthe watchdog subshell itself rather than SIGTERM. Verified by readingentrypoint.sh:337-364.glm-5.2:cloud (ollama-cloud) — Minor issues
Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
VERDICT: No material issues found
Through the 🔒 Security lens, I verified the taint paths in this PR and found nothing material:
get_diffpatharg (tools.go:407-426) — the model-suppliedpathis not run throughrepoFS.resolve()/contains(), but it doesn't need to be: it never touches the filesystem.diffForPathonly filters the already-loadedr.diffstring (in memory) and the result is then bounded bywindowDiff(maxGetDiffLines/maxGetDiffBytes). No sandbox-escape or path-traversal vector is introduced; a hostile PR can't read outside the checkout via this tool. The substring-matching indiffForPath(tools.go:470-482) can over-scope (e.g.path:"go"matches every*.goheader), but that only returns diff content the model could already page through — bounded by the same caps — so it's a correctness nit, not a security leak. Verified by readingtools.go:401-483.read_file/grep/find_filesstill go throughrepoFS.resolve()→contains()(tools.go:68-100), and the worker sub-agent (delegate.go:42) usesworkerToolbox()which builds on the same sandboxed tools. The PR doesn't weaken containment. Verified viagrep/read_file.executus.go:349-364) —resolveContextTokensforwardsGADFLY_API_KEY/OLLAMA_API_KEYtomodel.NewCloudOllamaLimitCache. The key is sent to introspect the model's context window; the target is the same provider that already receives the model traffic, and it's never logged or returned in review output. This is expected behavior, not a credential leak. (I could not inspect the executus-side default base URL for the""first arg since the dependency source isn't in this repo, but that's executus's contract to honor, not a gadfly-introduced exposure — low confidence only.)entrypoint.sh:335-349,executus.go:114-155) — the wall-clock killer uses hardcodedpkill -fpatterns (/usr/local/bin/gadfly,/app/scripts/run.sh), no env interpolation into the process pattern, so no command-injection surface. The Go-sidegadflyBudgetis purely advisory internal accounting with no authz decisions. Verified by reading both.executus.go:130-133), which surface as run errors, not as posted review markdown; the caller (reviewWithSpecialist) renders a generic "failed to complete" notice. No usage figures reach the PR comment. Verified viaexecutus.go:279-304and the existingreviewWithSpecialistpath.No injection, SSRF, authz gap, secret leak, or untrusted-data mishandling issue introduced by this change in my lane.
🎯 Correctness — Minor issues
The code at line 475 confirms
inSection = strings.Contains(ln, path)— an unanchored substring match on thediff --githeader. The tool description (line 402) promises "returns ONLY that file's diff hunks," but a short/ambiguous path substring (e.g.foo) would match every file whose path containsfoo(e.g.pkg/foobar.go). The existing test only exercises a full filename (y.go) that happens to match cleanly, so it doesn't catch the over-broadness. The finding survives.The other two draft notes are self-described as "not a finding" / "no bug," so they don't belong in the findings list.
VERDICT: Minor issues
cmd/gadfly/tools.go:475(diffForPath unanchored substring match breaks the path-scoping contract).diffForPathmatches thepathargument as a substring of thediff --githeader line viastrings.Contains(ln, path), with no path-boundary anchoring. Theget_difftool'spatharg description promises "returns ONLY that file's diff hunks," but a short/ambiguous path (e.g.foo) returns hunks for every file whose path merely contains that substring (e.g.diff --git a/pkg/foobar.go b/pkg/foobar.go). The existing test only exercises a full filename (y.go) that happens to match exactly, so it doesn't surface the over-broadness. Fix: split the header and compare against the exacta/<path>/b/<path>tokens, or at minimum anchor with delimiters.🧹 Code cleanliness & maintainability — Minor issues
VERDICT: Minor issues
cmd/gadfly/tools.go:113-127—toolbox()is now dead in production; its tool-membership logic is duplicated ingadflyToolRegistry. The deletedrunAgentwas the only production caller of(*repoFS).toolbox(). Verified by grep: onlytools_test.go:38andtools_test.go:284remain. The production path now goes throughgadflyToolRegistry(executus.go:80-94), which re-implements the samefsTools()+ conditionaldelegateTool()assembly. Two sources of truth for "which tools the reviewer exposes."cmd/gadfly/executus.go:82-84vscmd/gadfly/tools.go:117-119— duplicated tool-list construction.gadflyToolRegistryre-derivesfs.fsTools()+if fs.worker != nil { append(fs.delegateTool()) }, byte-for-byte the membership logic oftoolbox(). Verified by reading both functions. Adding a new built-in tool tofsTools()requires both paths to stay in sync by hand.cmd/gadfly/executus.go:285—GADFLY_TIMEOUT_SECSre-read per pass, duplicating theDefaults.MaxRuntimeset at build time.newReviewExecutorsetscfg.Defaults.MaxRuntimefromenvInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)(line 258), andrun()re-reads the same env with the same default to buildMaxRuntimefor eachRunnableAgent(line 285). Verified: both sites resolve the same knob. Duplicated env resolution for one logical value; a drift hazard if either is ever cached or changed independently.cmd/gadfly/executus.go:19-21— docstring claims "degrades to today's behavior," but the old forced-finalization fallback was removed by this PR. The file-level comment says "Everything degrades to today's behavior when unconfigured," yet the prior transcript-basedforceFinalAnswer/finalizeInstructionpath was deleted (confirmed in the diff and the renamedTestReviewExecutor_ExhaustionWithoutAnswerIsErrortest comment). On empty-output exhaustion the new code returns an error where the old code forced a final answer, so it does not fully degrade to pre-PR behavior. Wording overpromises; trivial but misleading.entrypoint.sh:208—MAX_DIFF_CHARSenv name is a shell-only alias invisible from the Go side. The downshift setsexport MAX_DIFF_CHARS=...with a comment "run.sh -> GADFLY_MAX_DIFF_CHARS." Confirmed:scripts/run.sh:180translatesMAX_DIFF_CHARStoGADFLY_MAX_DIFF_CHARSwhen invoking the binary, and the Go tree reads onlyGADFLY_MAX_DIFF_CHARS(main.go:361) /GADFLY_RECHECK_DIFF_CHARS(recheck.go).MAX_DIFF_CHARSappears nowhere in Go. The indirection lives entirely in shell, documented only by the inline comment.None are blocking; the code compiles and the structure is broadly clean. The main maintainability risk is the duplicated reviewer-tool membership logic between
toolbox()andgadflyToolRegistry.⚡ Performance — No material issues found
Verdict: No material issues found
Through the performance lens I checked the hot paths this PR touches:
get_diffpagination (cmd/gadfly/tools.go:434-464):windowDiffdoesstrings.Split(diff, "\n")on every call, allocating the full line slice even when only an 800-line window is emitted. For a huge (~1 MB) diff this is O(diff) work per call rather than O(window). I considered flagging it, but it's a micro-optimization: the split is microseconds against model-call latencies of seconds, andget_diffis called a handful of times per pass, not in a tight loop. The output is correctly bounded (maxGetDiffBytes/maxGetDiffLines), so there's no unbounded growth. Not material.diffForPath(tools.go:470-483): single linear scan per path-scoped call; fine.executus.go):resolveContextTokensruns once at executor build (memoized via the closure),addUsage/Commitare O(1) under a short mutex per pass — no per-step or per-token hot-path overhead introduced.runSpecialistsfan-out (main.go:247-286): concurrency semantics are unchanged (still bounded bylensConcurrency());fanout.Runreturns one result per input with no extra copying of note.No efficiency regressions (N+1, unbounded growth, quadratic behavior, hot-loop work, or missing limits) were found in this change.
🧯 Error handling & edge cases — Minor issues
All three findings verified against the actual source.
VERDICT: Minor issues found
Backstop race can kill the consolidation pass (
entrypoint.sh:338-364). The watchdog subshell runssleep $BUDGET; pkill -TERM …; sleep 5; pkill -KILL -f '/usr/local/bin/gadfly'. Disarm iskill "$KILLER_PID"(line 364) afterwait "${LANE_PIDS[@]}"returns (line 360). If the budget expires at the same moment the lanes finish, the subshell has already passed its secondsleepand is executingpkill -KILL; killing the subshell PID does not reliably cancel an already-runningpkillchild. The subsequent consolidation step (line 392-393) invokes/usr/local/bin/gadfly, so a late KILL pkill can terminate it — defeating the "partial findings still posted" guarantee. Suggested fix: gate the KILL pkill on the.budget_killedsentinel, or have the disarm alsopkillthe subshell's children / set a flag the pkill block checks before issuing KILL.diffForPathsubstring match over-selects files (cmd/gadfly/tools.go:470-483). Matching isstrings.Contains(ln, path)on thediff --githeader line, so a shortpathlikexalso matchesdiff --git a/xy b/xy(andax, etc.). The tool returns an inflated/incorrect diff window for that "file" rather than the requested one, with no error. For a reviewer scoping a finding to one file this silently feeds it the wrong hunks. Suggested fix: match the path against thea//b/path tokens exactly.Startup-blocking network call for context window (
cmd/gadfly/executus.go:349-364).resolveContextTokensmakes a live call (up to 15s timeout viacontext.WithTimeout) duringnewReviewExecutor, called on the review startup path (main.go:178) before any lens runs. When the Ollama Cloud endpoint is unreachable/slow, every model lane eats up to 15s of dead time before compaction silently disables (returns 0). It degrades safely, but the blocking-on-startup placement is an edge case worth noting; a cached/async resolution would avoid stalling small PRs that happen to point at a flaky endpoint.kimi-k2.6:cloud (ollama-cloud) — Minor issues
Verdict: Minor issues · ⚠️ 1/5 lens(es) errored — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
Verdict: No material issues found
Reviewed through the Security lens. I traced all new code paths that touch untrusted input (the PR diff, LLM tool arguments, environment variables) and checked for authn/authz gaps, injection vectors, SSRF, credential leakage, and unsafe data handling.
Key areas verified:
getDiffArgs.Path/diffForPath(cmd/gadfly/tools.go:414–482): The newPathparameter is not used for filesystem access — it is only passed tostrings.Containsas a substring filter against the already-in-memory unified diff. No path traversal, directory escape, or file-read amplification is possible. TherepoFS.resolvesandbox is never invoked.windowDiffpagination (cmd/gadfly/tools.go:434–463):startandlimitare clamped to valid ranges; negative values and overflows are handled.GADFLY_NOTICEin entrypoint.sh / run.sh (entrypoint.sh:210, scripts/run.sh:233–236):PR_DIFF_BYTESis produced bywc -c | tr -d '[:space:]', which is strictly numeric. The notice is therefore not injectable with attacker-controlled content in the intended flow.resolveContextTokensAPI key handling (cmd/gadfly/executus.go:349–363): Keys are read from env and passed tomodel.NewCloudOllamaLimitCache; no logging or exposure of the key.pkillpatterns (entrypoint.sh): Thepkill -farguments are hardcoded literals (/usr/local/bin/gadfly) or derived from hardcodedSCRIPTS_DIR; no user-controlled path components.prCallerIDfunction keys the budget onrepo#pror falls back to"local". The budget struct itself is per-process, so there is no cross-PR budget leakage.The change is clean from a security perspective.
🎯 Correctness — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 11: all chain targets failed
ollama-cloud/kimi-k2.6☁️ model returned an empty response
🧹 Code cleanliness & maintainability — Minor issues
Minor issues
cmd/gadfly/main.go:44— Stale comment about removed finalization fallback. The file-header comment still says "Plus a tool-free finalization fallback guarantees a step-exhausted pass still emits output." That fallback (finalizeInstruction+forceFinalAnswer) was explicitly dropped in this PR (the executus re-platform removes the transcript-based forced-finalization path, confirmed by the updatedwrapup_test.gocomment and the deletion of both the const and function frommain.go). The comment now misleads readers about behavior that no longer exists. Fix: Remove that sentence from the comment, or rephrase to say the wrap-up critic is the sole mechanism.cmd/gadfly/recheck_test.go:52— Test sets the wrong environment variable.TestBuildRecheckTaskcallst.Setenv("GADFLY_MAX_DIFF_CHARS", ""), butbuildRecheckTaskwas changed in this PR to readGADFLY_RECHECK_DIFF_CHARS(seerecheck.go:94). Because the test diff is tiny, truncation never triggers and the test still passes, but the test is now exercising a dead env var instead of the new one. Fix: Change theSetenvcall toGADFLY_RECHECK_DIFF_CHARS, and add a short diff + explicit override value so the truncation path is actually covered.⚡ Performance — No material issues found
No material issues found
The diff is entirely additive performance protections for large PRs: context compaction at a deliberately aggressive 0.45 threshold, paginated
get_diffcapped at 800 lines/64 KB per call, a smaller 20 k-char recheck diff embed, per-PR token/time budget gates, andfanout.Runreplacing the existing semaphore-based lens fan-out with equivalent concurrency behavior. No allocations or work were added to hot paths for small PRs, and no quadratic or unbounded-growth patterns were introduced.```gadfly-findings[]