feat: cross-model consensus consolidation (one ranked comment, not N walls) #17
Reference in New Issue
Block a user
Delete Branch "feat/consensus-consolidation"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Phase 2 of the review-representation work. With ≥2 models, the swarm posts one consensus comment instead of a per-model comment each.
Every model writes its structured findings to a shared dir; after the swarm finishes, a consolidation pass clusters findings by location (±3 lines), counts how many models independently flagged each, escalates to the highest reported severity, and renders an agreement-ranked table:
Why
Cross-model agreement is the strongest real-vs-false-positive signal we have (the PR #16 review showed every model re-finding the same issue independently). Ranking by it surfaces the findings that matter and folds the lone nits — directly attacking both the representation noise and the false-positive rate.
How
consensus.go:modelFindingsartifact,collectFindings(shared with telemetry emit — one definition of "what a finding is"),writeFindingsOut,runConsolidate, clustering +renderConsensus. Lone low-severity findings fold away; a lone critical still surfaces; each model's full review is preserved folded inside the one comment.main.go:GADFLY_CONSOLIDATE_DIRdispatches to consolidation mode;writeFindingsOutafter a normal review.run.sh: when consolidating, write findings + skip the per-model comment (live progress still on the status board).entrypoint.sh: auto-enable for ≥2 models; run the consolidator after the barrier and upsert ONE comment; fall back to per-model comments if consolidation yields nothing — advisory invariant preserved, output is never lost.consolidateinput) + entrypoint docs updated.Controlled by
GADFLY_CONSOLIDATE:auto(default),1,0. Single-model runs unchanged.Tests
Clustering/agreement/tolerance, severity escalation, nit folding, lone-critical surfacing, write→consolidate round-trip, empty-dir error.
bash -nclean on both scripts; gofmt/vet clean.🤖 Generated with Claude Code
🪰 Gadfly — live review status
7/7 reviewers finished · updated 2026-06-28 22:51:38Z
claude-code/opus· claude-code — ✅ doneclaude-code/opus:max· claude-code — ✅ donedeepseek-v4-pro:cloud· ollama-cloud — ✅ donedevstral-2· ollama-cloud — ✅ doneglm-5.2:cloud· ollama-cloud — ✅ donekimi-k2.6:cloud· ollama-cloud — ✅ doneqwen3.5:397b-cloud· ollama-cloud — ✅ doneLive status board. Findings are posted in each model's own comment. Advisory only — does not block merge.
🪰 Gadfly review —
claude-code/opus(claude-code)Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
VERDICT: No material issues found
I traced the tainted-data path this PR introduces — PR diff → model review → structured findings → per-model JSON artifact → consolidation → single PR comment — and found no new security-relevant sink.
What I checked (by reading the actual code, not just the diff):
jq -n --arg b "$body" '{body:$b}'inupsert_comment_body(entrypoint.sh:71) and the fallback loop reads model fields withjq -r '.markdown // ""'then re-encodes via the same--argpath. Attacker-influenced model text (titles, detail, full markdown) is never interpolated into a JSON string or a shell command unquoted. Safe.findings_file_for(entrypoint.sh:209) sanitizes the model id withtr -c '[:alnum:]._-' '_', identical to the pre-existingstatus_file_for./is collapsed, so a../-style model id can't escapeFINDINGS_DIR.GADFLY_FINDINGS_OUT/GADFLY_CONSOLIDATE_DIRare operator-set (entrypoint), not PR-controlled.runConsolidateonly reads*.jsonunder the given dir andjson.Unmarshals them. Safe.mdCell(consensus.go) escapes|and collapses newlines before findings titles/lenses go into the markdown table, so a crafted finding title can't forge extra table columns/rows in the headline.modelFindingscarries only model/provider/verdict/markdown/findings; no keys or tokens are serialized to the artifact files (mode0o644, inside the container workdir).writeFindingsOutno-ops whenGADFLY_FINDINGS_OUTis empty (consensus.go:78).One pre-existing, out-of-scope note (not introduced here, not counted as a finding): the per-model full review is embedded raw inside
<details>in both the consensus body and the fallback comment, so model output containing</details>could break out of the fold and inject top-level markdown. This is identical to the existing per-model path, which already posts$REVIEWraw, so this PR does not widen exposure — and Gitea sanitizes rendered HTML. Worth a future hardening pass, but no regression here.🎯 Correctness — Minor issues
VERDICT: Minor issues
I reviewed the consensus consolidation logic through the Correctness lens, verifying cross-file symbols against the checked-out code (
consolidate.go,emit.go,findings.go,main.go,run.sh,entrypoint.sh). The clustering math, severity escalation, tolerance matching, lone-critical surfacing, and theauto/1/0resolution flow (GADFLY_CONSOLIDATEempty→autovia:-, then numericCONSOLIDATEpassed torun.sh) all check out and match the tests. ThecollectFindingsrefactor inemit.gois behavior-preserving. A few real issues:cmd/gadfly/consensus.go:88— an all-errored model is recorded as a clean participant, polluting the consensus denominator.writeFindingsOutstoresVerdict: worstVerdict(results).label(). I verifiedworstVerdict(consolidate.go:82) ignoresr.erroredand only takes the max ofr.verdict, starting fromverdictClean— so a model whose every lens errored (e.g. all timed out) yieldsverdictClean→ label"No material issues found", andcollectFindingsreturns nothing for errored lenses, so it writes a file with zero findings.runConsolidatethen counts it intotal(renderConsensus:177) and folds it intoworst. Net effect: a model that produced no actual review appears as a clean reviewer, and a real finding from the one working model in a 2-model run shows1/2(looks like weak agreement) instead of1/1. NoterenderConsolidateditself already distinguishes this case ("Review incomplete — all lenses errored",consolidate.go:107), so the storedVerdictfield is inconsistent with the model's own rendered markdown. Fix: when every lens errored, either skipwriteFindingsOutfor that model or stamp a non-clean/"incomplete" verdict and exclude such artifacts fromtotal. (Verified by readingworstVerdict,collectFindings, and thetotal/worstuse inrenderConsensus.)cmd/gadfly/consensus.go:273— tolerance window is anchored at the cluster's minimum line, so it doesn't slide; chains of nearby findings under-merge. The match test isabsInt(c.line - it.f.Line) <= lineToleranceandc.lineis pinned to the smallest line in the cluster (mergeIntoCluster:319only ever lowers it). Items are line-sorted, so findings at e.g. lines 1, 4, 7 produce two clusters (1+4 merge; 7 is 6 from the anchor and splits) even though 4 and 7 are within tolerance of each other. Since cross-model agreement is the headline signal the whole feature ranks on, this can undercount agreement for the same issue cited a few lines apart. Low severity / partly a design tradeoff (the ±3 covers the common "line or two apart" case), but flagging since it directly affects the agreement count. (Verified by tracingclusterFindings+mergeIntoCluster.)entrypoint.sh:317/scripts/run.sh:50— switching consolidation mode on/off (or model count crossing 2) on the same PR orphans the other style's comment. The consensus comment uses marker<!-- gadfly-consensus -->while per-model comments use<!-- gadfly-review:ollama:<model> -->; neither path deletes the other. A PR first reviewed with 1 model (per-model comment) then re-run with ≥2 models (consensus) leaves the stale per-model comment behind, and vice-versa. Low severity, but produces duplicate/contradictory comments. Suggest deleting the opposite-marker comment(s) when posting. (Verified markers in both scripts; confirmed no cleanup of the other marker.)I could not run
go build(privatemajordomomodule is gated in this sandbox), but I confirmed there are no duplicate definitions of the new helpers (absInt/plural/sevRank/etc.) and all referenced symbols (parseVerdict,verdictClean,renderConsolidated,extractStructuredFindingsOrScrape,modelProvider,findingfields) exist with matching signatures, so the new file should compile.🧹 Code cleanliness & maintainability — Minor issues
I've verified all three findings against
cmd/gadfly/consensus.go. Findings 1 and 3 are confirmed exactly as described. Finding 2's substance is confirmed (the anonymous struct type is duplicated and a named type would help), but its claimed count is wrong — the type is spelled out twice (map value type at:250, append element type at:259), not three times. I've corrected that detail and kept the finding.VERDICT: Minor issues
cmd/gadfly/consensus.go:164—cluster.detailis write-only dead state. The field is declared (:164), populated when a cluster is created (:285) and on severity escalation inmergeIntoCluster(:328), withoutFinding.Detailplumbed all the way into it — but it is never read or rendered. The consensus table emits onlytitle,severity,location, model count, and lenses; the per-model drill-down usesm.Markdown, not findings. Confirmed by grepping everydetail/Detailsite: the only reads are onoutFinding.Detail/finding.detail, never oncluster.detail. Suggested fix: dropcluster.detailand its two assignments (and optionally theoutFinding.Detailplumbing if the artifact has no other consumer), or actually surface it if that was the intent. Severity: trivial.cmd/gadfly/consensus.go:250-262— the anonymous struct typestruct { f outFinding; model string }is spelled out twice (map value type at:250, composite-literal element type in theappendat:259). A one-line named type (e.g.type fileFinding struct { f outFinding; model string }) declared once would make the grouping self-documenting and theappenda clean composite literal, removing the duplicate spelling that must be kept in sync by hand. Severity: trivial.cmd/gadfly/consensus.go:200-205— the multi-modelagreedcount re-loops overclustersimmediately after the partition loop at:189-196already walked the same slice and testedlen(c.models) >= 2. Minor redundancy; the count could be accumulated in the partition loop. Severity: trivial.Otherwise the new file follows the surrounding conventions well (doc comments per function, stdlib-only, helpers factored sensibly, and the
collectFindingsextraction genuinely de-duplicatesemit.go). Nothing blocking.⚡ Performance — No material issues found
I've reviewed the change through the Performance lens, verifying the clustering algorithm, the helper call paths, and the shell upsert loop against the actual code.
Review
VERDICT: No material issues found
Performance-relevant observations (all checked, none rise to a real concern):
clusterFindingsis ~O(F·C) (cmd/gadfly/consensus.go:267-293): for each finding it linearly scans every cluster created so far (including other files', filtered byc.file == file). This is quadratic in the worst case, but the operand set is tiny — total findings =models × findings-per-model, realistically a few dozen. The whole consolidation runs once, after the swarm, off the hot path; the cost is dominated entirely by LLM inference upstream. Not worth optimizing.collectFindingsnow runs twice per lens result — once inemit()and once inwriteFindingsOut()(cmd/gadfly/main.go:225thenemitat:224). Each call re-runsextractStructuredFindingsOrScrape. I readfindings.go:112— it only parses already-in-memory strings (regex/prose scrape), no I/O. Doubling a microsecond-scale string parse at end-of-run is negligible.renderConsolidatedcalled inwriteFindingsOutto populateMarkdown(consensus.go): builds markdown once per model from in-memory results. Fine.upsert_comment_bodypagination (entrypoint.sh:71-79): sequential API calls up to 10 pages × 50 comments. This mirrors the pre-existingupsert_commentpattern inrun.shand is network-bound CI glue, not a regression introduced here — and it's bounded.Nothing in the performance lens is materially wrong with this change.
🧯 Error handling & edge cases — Minor issues
I've verified both findings against the actual code. Let me confirm the mechanics:
Finding 1 — Confirmed. In
run.sh, the skip cases (no Ollama key atrun.sh:161, binary missing atrun.sh:165) setREVIEWwithout ever invoking the binary, soGADFLY_FINDINGS_OUTis never written; the crash case (run.sh:186) invokes the binary butwriteFindingsOut(main.go:229) only runs on the success path, so a non-zero exit writes no file either. In all three,run.sh:220(CONSOLIDATE=1) suppresses the per-model comment. Inentrypoint.sh, the fallback (entrypoint.sh:311-333) fires only whenCONSENSUSis entirely empty and even then iteratesfor f in "${FINDINGS_DIR}"/*.json— a model with no file is never reconstructed. Net: a failed/skipped model produces no PR signal (only a stderr line). Mechanically accurate.Finding 2 — Confirmed.
entrypoint.sh:238tr -c '[:alnum:]._-' '_'maps both:and/to_, sofoo:barandfoo/barcollide tofoo_bar.json; the laterwriteFindingsOutoverwrites, dropping a model from consensus. Mirrorsstatus_file_for(entrypoint.sh:190), so not a regression. Edge case, accurately scoped.Both survive verification; I've corrected the line anchors (main.go 230→229, entrypoint.sh 236→238).
VERDICT: Minor issues
The happy paths are well-guarded (best-effort
writeFindingsOut,|| truearound the consolidation call, junk-file skipping inrunConsolidate, nil-safe maps,line>0guards). The gaps are on the unhappy path — what happens to a model that errors or is skipped while consolidation is on.Findings
1. (medium) In consolidate mode, an errored/skipped model is silently dropped — its error notice is lost.
cmd/gadfly/main.go:229—writeFindingsOut(results)runs only at the end of the success path; if the binary returns an error,main()(main.go:118) doesos.Exit(1)and no findings file is written. The early "soft" notices inscripts/run.sh(⚠️ No Ollama Cloud key configuredatrun.sh:161,⚠️ Agentic reviewer binary not foundatrun.sh:165) setREVIEWwithout invoking the binary at all → again no findings file. Meanwhilerun.sh:220suppresses that model's per-model comment whenCONSOLIDATE=1. Net effect: in non-consolidate mode each of these produces a visible⚠️ …comment; in consolidate mode they produce silence (only a stderrmodel X failed (continuing)line, and for the skip cases not even that — run.sh exits 0). The fallback inentrypoint.sh:311only fires whenCONSENSUSis entirely empty, and even then it loopsfor f in "${FINDINGS_DIR}"/*.json— a model with no file is never reconstructed. So a swarm where the API key is misconfigured for every model, or several models crash, posts either nothing or a clean-looking consensus, and the operator gets no signal that models failed. Suggested fix: whenCONSOLIDATE=1and the binary fails or is skipped, still write a minimal artifact (or haveentrypoint.shenumerateMODEL_LISTvs. present files and surface the missing/errored ones), so failures stay visible.2. (small)
findings_file_forcan silently collide two distinct model specs onto one file, dropping one from consensus.entrypoint.sh:238—tr -c '[:alnum:]._-' '_'maps both:and/to_, so e.g.foo:bar(bare id) andfoo/bar(providerfoo) sanitize to the samefoo_bar.json; the second model'swriteFindingsOutoverwrites the first, and one model vanishes from the cluster counts/denominator with no error. This mirrors the pre-existingstatus_file_for(entrypoint.sh:190) pattern, so it's low-severity and not a regression, but consensus counting makes a silent drop more misleading here than on the status board. Edge-case; only bites unusual co-configured specs.Everything else in the error-handling lane checked out:
runConsolidatecorrectly errors on an empty dir and tolerates unreadable/unparseable/.json-less/empty-Modelfiles; clustering uses initialized maps and a smallest-line representative that can't drift past tolerance; clean models are correctly counted in theX/totaldenominator; the consolidate-mode early return inrun()sidesteps theGADFLY_REPO_DIR/DIFF_FILErequirement and the telemetry emit; no panic/index/division risks.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 17s
🪰 Gadfly review —
claude-code/opus:max(claude-code)Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — Minor issues
I've verified the finding against the actual code.
Confirmed:
consensus.go:213and:226interpolatelocation(c.file, c.line)directly into the table inside backticks, while every other cell (title,lensList,oneModel,model,provider) is routed throughmdCell(lines 213–214, 225–227, 239–240).c.filetraces to the model'sgadfly-findingsblock:outFinding.File←finding.file←sf.File, onlyTrimSpace'd (findings.go:60). The model reviews the attacker-controlled PR diff, so this value is untrusted.mdCell(consensus.go) escapes|and collapses newlines;location()does neither.|inside a code span still terminates the cell — the inconsistency is real.The finding is accurate and honestly scoped (low impact, not XSS). It survives.
VERDICT: Minor issues
The change is mostly plumbing (write per-model findings → cluster → render one comment); the untrusted-input paths are otherwise clean. One minor output-encoding gap.
Findings
cmd/gadfly/consensus.go:213/:226— attacker-influencedfileinterpolated into the consensus table without escaping.c.fileoriginates from the model'sgadfly-findingsblock (extractStructuredFindings,findings.go:60→sf.File, onlyTrimSpace'd). The model reviews the attacker-controlled PR diff and can be prompt-injected, sofileis untrusted. InrenderConsensus,title,lensList, andoneModelall pass throughmdCell(which escapes|and collapses newlines), butlocation(c.file, c.line)does not. Afilevalue containing|breaks the markdown table layout (a pipe inside a code span still terminates a GFM table cell), letting a crafted PR garble or smuggle text into the consensus comment. Impact is low — Gitea sanitizes raw HTML so this isn't XSS — but it's an inconsistency in untrusted-data handling. Fix: wrap it, e.g.mdCell(location(c.file, c.line)), or escape insidelocation.Checked and found safe (not reported):
entrypoint.sh, the fallback loop'sm/prov/mdcome from JSON viajq -rand are only passed asprintf '%s'args andjq -n --argvalues — nevereval'd.findings_file_foris called only with operator-configured model names and sanitizes viatr -c '[:alnum:]._-' '_'(slashes stripped).GADFLY_FINDINGS_OUT/GADFLY_CONSOLIDATE_DIRare operator-set, not PR-derived.upsert_comment_bodyonly curls the fixed${GITEA_API}with the existing token header.tail -c 500 consolidate.errcarries parse/IO errors only.mdCellis added hardening for title/lens/model; the embedded foldedm.Markdownis the same trust level as the pre-existing per-model comment, so no new exposure.🎯 Correctness — Minor issues
Both findings are confirmed against the actual code. Let me verify the line numbers precisely:
consensus.go:273isc.file == file && absInt(c.line-it.f.Line) <= lineTolerance— no title/lens/identity check.mergeIntoCluster(312–331) unions models/lenses and keeps only the top-severity title/detail. Confirmed.high, Blocking→high): The partition atconsensus.go:191islen(c.models) >= 2 || sevRank(c.severity) >= sevRank("high").collectFindings(67–71) fills missing severity from the lens verdict;verdict.severity()maps Blocking→"high" (consolidate.go:38–39). README/PR say "lone critical"; the only test usescritical. Confirmed.Both survive verification. Corrected line numbers below.
I've verified the dependencies and logic against the checked-out code. All helper functions (
parseVerdict,worstVerdict,modelProvider,renderConsolidated,verdict.severity()/label(),extractStructuredFindingsOrScrape) exist with matching signatures; theemit.gorefactor tocollectFindingsis behavior-preserving; and theGADFLY_FINDINGS_OUTplumbing reaches the binary on the only lane entrypoint uses. The round-trip is sound.Two semantic-correctness issues remain, both about the agreement metric — which is the entire point of this feature.
VERDICT: Minor issues
cmd/gadfly/consensus.go:273(clusterFindings / lineTolerance) — agreement count conflates distinct issues that merely sit near each other. Clustering matches on file + line within ±3 only — there is no check on title, lens, or issue identity (c.file == file && absInt(c.line-it.f.Line) <= lineToleranceis the sole predicate). So if model A flags a perf issue atfoo.go:40and model B flags an unrelated error-handling issue atfoo.go:42, they merge into one cluster reported as 2/N agreement, and only the higher-severity title is displayed while the other finding vanishes. Cross-model agreement is explicitly sold as "the strongest real-vs-false-positive signal," and it's what the table is ranked by — so a false merge both inflates the headline confidence and drops a real finding.mergeIntoCluster(consensus.go:312) unionsmodels/lensesand keeps only the top-severitytitle/detail; nothing guards that the merged findings describe the same problem. Suggested fix: also require lens match (or a title/keyword-similarity gate) before merging, or at minimum surface the distinct titles within a merged cluster rather than discarding all but one. Severity: medium.cmd/gadfly/consensus.go:191— lone-finding headline threshold ishigh, notcriticalas documented, and every finding from a Blocking lens is auto-taggedhigh. The partition islen(c.models) >= 2 || sevRank(c.severity) >= sevRank("high"). But the README ("a lone critical still surfaces") and the PR body both say the single-model carve-out is for critical. Worse,collectFindings(consensus.go:67) fills missing per-finding severity from the lens verdict, andverdict.severity()maps Blocking → "high" (consolidate.go:38-39). So every heuristic-scraped finding from any single model whose lens landed on "Blocking" getshighand surfaces as its own unfolded 🔴 row at1/N— exactly the per-model noise the folding is meant to suppress. The existing test only covers a lone critical (TestRenderConsensusHighSeverityLoneFindingStaysHeadline), so this lone-highpath is untested. Suggested fix: either change the threshold to>= sevRank("critical")to match the stated intent, or update the README/PR wording to "lone high-or-critical" — and decide intentionally whether verdict-derivedhighshould count. Severity: small.Nothing else in the correctness lens: the
emit.go→collectFindingsextraction is equivalent to the prior inline logic, severity escalation and representative-line selection (mergeIntoCluster) are correct, the empty-dir/junk-file/round-trip paths match the tests, and the entrypoint fallback preserves the advisory-never-lose-output invariant.🧹 Code cleanliness & maintainability — Minor issues
I've verified all three findings against the actual code. Here is the corrected review.
VERDICT: Minor issues
cluster.detailis dead state — collected and merged but never rendered (cmd/gadfly/consensus.go:164,:285,:327-329). Confirmed: thedetailfield is declared on theclusterstruct (:164), initialized in the cluster literal (:285), and conditionally overwritten inmergeIntoCluster("keep the highest-severity report's title/detail",:327-329), butrenderConsensus(:176-244) only ever readsc.title,c.severity,c.file,c.line,c.models, andc.lenses—c.detailis read nowhere, in neither the headline table, the folded table, nor the per-model section. So the detail-merge branch and the field carry no behavior. The siblingtitlefield is rendered, which makes the asymmetry easy to miss. Low-churn fix: dropcluster.detailand thec.detail = f.Detailline (theoutFinding.Detailartifact field is unaffected — it's still populated inwriteFindingsOutand consumed by the telemetryreportPayload). If detail was meant to surface in the comment, that's a missing feature rather than dead code, but as written it's dead.Per-model comment header/footer framing is now duplicated across scripts (
entrypoint.sh:329-330vsscripts/run.sh:223). The consolidation fallback path inentrypoint.shre-hand-writes the exact### 🪰 Gadfly review — \%s` (%s)header andAutomated adversarial review by Gadfly. Advisory only — does not block merge.footer thatrun.shalready owns (the only difference is the dropped· ⏱️ reviewed in %s). The header template now lives in three printf sites (run.sh:138placeholder,run.sh:223final,entrypoint.sh:329fallback); a future wording change has to be made in all three or the fallback comments silently diverge from the normal ones. Hard to fully DRY across two bash scripts, but worth a comment cross-referencingrun.sh` so the coupling is visible.⚡ Performance — No material issues found
I've reviewed the consolidation code paths through the performance lens. Let me confirm the data volumes are bounded by checking the finding caps and how often these paths run.
The relevant facts I verified by reading the code:
clusterFindings(consensus.go:248–308) is the only nested loop: for each finding it scans existing clusters (for ci := range clusters). That's worst-case O(T²) in total findings T. But T is tightly bounded —maxFindingsPerLens = 20(emit.go:42) caps scraped findings per lens, ~5 lenses × a handful of models ≈ low hundreds at most. Each comparison is an int subtraction + string compare. This runs once per PR, in a dedicated consolidation process. Not material.collectFindingsis now called twice per run — once inemit()(emit.go:133) and once inwriteFindingsOut()(consensus.go:92), each re-runningextractStructuredFindingsOrScrape/regex over each lens's markdown. This is duplicated parse work, but it's a one-time, end-of-run cost over small in-memory strings, dwarfed by the LLM calls that produced them. Micro-optimization at best, not a justifiable regression.upsert_comment_body/ fallback loop (entrypoint.sh) paginates comments (≤10 pages × 50) and, in the fallback path, does so once per model. Bounded by a small model count and only on the rare fallback. Not material.writeFindingsOut/runConsolidatedo one JSON marshal + one file write per model and one directory read — all linear in the (small) model/finding count, run once.There are no hot loops, no per-request N+1 patterns, no unbounded growth, and no blocking calls on a latency-sensitive path here — this is a once-per-PR CI batch tool operating on bounded data.
VERDICT: No material issues found
clusterFindings) and the doublecollectFindingscall are both bounded, run-once costs that are negligible against the LLM review work; flagging them would be micro-optimization, not a justifiable finding.🧯 Error handling & edge cases — Minor issues
Verification complete. I traced the failure path across
run.sh,main.go(writeFindingsOutruns only at the end of a successfulrun()), and theentrypoint.shconsolidation/fallback block.VERDICT: Minor issues
🧯 Error handling & edge cases
Failed-model output is silently dropped in consolidation mode, and the "fallback never loses output" claim is false for the most common failure.
entrypoint.sh:322/scripts/run.sh:138,220.When
CONSOLIDATE=1,run.shsuppresses both the placeholder comment (run.sh:138) and the final per-model comment (run.sh:220). A model only contributes to consensus by writingGADFLY_FINDINGS_OUT— andwriteFindingsOutruns only at the tail of a successfulrun()(main.go:226). Every unhappy path inrun.sh's ollama branch setsREVIEWto a⚠️ …notice but the binary writes no findings file: no Ollama key (run.sh:161), binary missing (run.sh:165), or binary exit≠0/empty output (run.sh:186). Consequences:entrypoint.sh:322to replay).runConsolidatereturns"no model findings to consolidate"(consensus.go:181) →CONSENSUSempty → fallback loop's glob"${FINDINGS_DIR}"/*.jsonmatches nothing → nothing is posted at all. Before this PR the same misconfig produced N visible⚠️ No Ollama Cloud key configured…comments telling the user exactly what to fix; now the PR gets total silence.The fallback at
entrypoint.sh:322can only replay models that produced a findings file — i.e. the ones that succeeded — which structurally excludes the failed models whose output you'd most want surfaced. The comment atentrypoint.sh:308-310("if the consolidation pass produces nothing, fall back to posting each model's review… so a consolidation hiccup never loses output") only holds for a rendering hiccup with files present, not for the run-failure cases.Suggested fix: have
run.shwrite a minimal findings artifact even on the failure paths (model + provider + the⚠️REVIEWmarkdown, emptyfindings[]) so both consolidation's per-model detail and the fallback can surface it; or, whenCONSOLIDATE=1and a model produced no findings file, fall back to posting that model's⚠️notice as its own comment. Verified by reading the three files end-to-end.Everything else in my lane checks out:
runConsolidateguards empty dir / unreadable / unparseable / empty-model files and continues (consensus.go:160-182);writeFindingsOutis a correct no-op on empty env and best-effort on marshal/write errors;absInt/oneModel/mdCell/line-0 paths don't panic;worst/parseVerdicthandle empty/unknown verdicts without misranking; andfindings_file_for/CONSOLIDATEare defined beforereview_oneis ever invoked (noset -uordering trap).Outside my lens: the headline table prints "No material issues found by consensus." even when the verdict line above it reads "Minor issues" (all findings single-model/low-sev and folded) — a possible correctness/presentation inconsistency for that lens to judge.
Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 58s
🪰 Gadfly review —
deepseek-v4-pro:cloud(ollama-cloud)Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
No material issues found.
I traced the full data flow from untrusted inputs (model output, environment variables, JSON files on disk) through to where they're consumed (shell command construction, file writes, PR comment rendering). Here's what I checked and why nothing rises to a security finding:
Path traversal in
GADFLY_FINDINGS_OUT: The filename is built byfindings_file_for()inentrypoint.sh:238, which sanitizes the model name withtr -c '[:alnum:]._-' '_'— slashes and dots are replaced, so../../../etc/passwdbecomes___________etc_passwd. The directory prefix (${WORKDIR}/findings) is hardcoded. Safe.Command injection in shell scripts: The only dynamic value passed to a command is
$FINDINGS_DIRinentrypoint.sh:314, which is${WORKDIR}/findings— a hardcoded path, not attacker-controlled. Thejq -n --arg b "$body"pattern inupsert_comment_body(entrypoint.sh:74) uses--argwhich binds the value as a JSON string with proper escaping. No shell metacharacter injection.JSON deserialization of model artifacts:
runConsolidate()(consensus.go:128-145) reads.jsonfiles from a directory only writable by the container's own processes. Go'sjson.Unmarshalinto a typed struct is safe — no code execution. TheMarkdownfield (raw LLM output) is embedded in the comment, but this is the same risk profile as the existing per-model comments.Secret leakage: The consensus comment embeds the same per-model markdown that would have been posted individually. No new secret exposure path. The
GITEA_TOKENis never placed into the prompt, diff, or any model-visible context.SSRF: No user-controlled input is used to construct outbound URLs.
GITEA_APIcomes from the workflow environment, controlled by the repo owner.Insecure defaults:
GADFLY_CONSOLIDATEdefaults toauto, which enables consolidation only for ≥2 models. The default model list is a single model, so consolidation is off by default. Safe.mdCellescaping (consensus.go:390-394): Only escapes pipes and newlines, not backticks. A model-generated title containing a backtick could break the markdown table cell's inline-code formatting. This is a rendering robustness concern, not a security vulnerability — the content is model-generated, not attacker-controlled, and Gitea's markdown renderer is the defense boundary.🎯 Correctness — Minor issues
I have read the actual code at
cmd/gadfly/consensus.go:266-293andmergeIntoClusterat lines 310-331. Let me trace the algorithm precisely with the draft's example.The code at line 273 checks:
c.lineis maintained as the smallest line in the cluster (line 319-320 inmergeIntoCluster:if f.Line > 0 && (c.line == 0 || f.Line < c.line) { c.line = f.Line }). Since items are processed in ascending line order (line 268),c.lineis always the first (minimum) line added to the cluster.With findings at lines 10, 13, 16 and
lineTolerance = 3:c.line = 10absInt(10 - 13) = 3 ≤ 3→ merges into C1,c.linestays 10absInt(10 - 16) = 6 > 3→ new cluster C2So 16 is split into its own cluster despite
|13−16| = 3 ≤ 3. The finding is confirmed against the actual code.Verdict: Minor issues
cmd/gadfly/consensus.go:273— Greedy clustering uses only the cluster's minimum line for distance checks, breaking transitive proximity. The algorithm checksabsInt(c.line - it.f.Line) <= lineTolerancewherec.lineis the smallest line in the cluster. Since findings are processed in ascending line order, a new finding can be within tolerance of the cluster's maximum member but outside tolerance of its minimum, causing it to be incorrectly split into a separate cluster. Example: findings at lines 10, 13, 16 in the same file. 10 and 13 merge (|10−13|=3 ≤ 3). 16 checks against the cluster's representative line 10: |10−16|=6 > 3, so 16 becomes its own cluster — even though |13−16|=3 ≤ 3. This deflates cross-model agreement counts (a 3-model consensus shows as 2+1), potentially burying real findings in the folded "single-model" section. The fix is to track the cluster's max line and checkit.f.Line - c.maxLine <= lineTolerance(since items are sorted, only the forward gap matters), or equivalently maintain a[min, max]span and check both ends.🧹 Code cleanliness & maintainability — Minor issues
Both findings are confirmed against the actual code:
cluster.detail: Written at lines 285 and 328, but never read anywhere —renderConsensususesc.title,c.severity,c.file,c.line,c.models,c.lensesbut neverc.detail. The per-model detail is preserved viam.Markdown.Duplicated
upsert_commentlogic:entrypoint.sh:72-90(upsert_comment_body) andscripts/run.sh:82-102(upsert_comment) share the same paginate-find-by-marker-PATCH/POST pattern, differing only in the marker parameterization and curl wrapper. A DRY violation.Verdict: Minor issues
cmd/gadfly/consensus.go:164— Dead fielddetailonclusterstruct. Thedetailfield is written in two places (line 285 during initial construction, line 328 inmergeIntoCluster) but is never read anywhere in the codebase — not inrenderConsensus, not in tests, not in any other file. The per-model detail is preserved viam.Markdownin the folded "Per-model detail" section, socluster.detailserves no purpose. It's dead code that misleads readers into thinking it's used for rendering. Remove the field and its two write sites.entrypoint.sh:72–90— Duplicatedupsert_commentlogic withscripts/run.sh:82–102. The newupsert_comment_bodyfunction inentrypoint.shis a near-identical copy ofrun.sh'supsert_comment, differing only in that the marker is a parameter instead of a hardcoded$MARKER. This is a copy-paste DRY violation introduced by this PR: any future fix to pagination, error handling, or the API call shape must be applied in two places. The two scripts already shareSCRIPTS_DIR; extracting the shared function into a sourced helper (e.g.,scripts/comment-utils.sh) would eliminate the duplication with low churn.⚡ Performance — No material issues found
Verdict: No material issues found
The consolidation path operates on trivially small datasets — single-digit models, tens of findings per model — and every operation is either O(n) or O(n²) on n < 100. All I/O is one-shot (read N small JSON files, write one comment). The
strings.Builderusage inrenderConsensusis appropriate. The shell fallback path spawns a fewjqprocesses but only on the cold error-recovery branch. Nothing here approaches a hot path or unbounded growth worth flagging.🧯 Error handling & edge cases — Minor issues
Verdict: Minor issues
cmd/gadfly/consensus.go:273— All line=0 findings in the same file merge into one cluster. InclusterFindings, findings withLine == 0(from structured blocks missing or zeroing the line field) all satisfyabsInt(0-0) <= lineTolerance, so every unlocated finding in a given file collapses into a single cluster regardless of whether they describe completely different issues. The severity escalates to the max among them and the title comes from whichever had the highest severity — distinct problems get silently fused. The heuristic scrape path (parseFindings) always produces a concrete line, so this only triggers on malformed structured blocks, but the code should handle it correctly (e.g. skip line=0 findings from proximity clustering, or give each its own cluster).entrypoint.sh:328— Fallback per-model marker hardcodesollamaprovider. In the consolidation-failure fallback path, the marker is constructed as<!-- gadfly-review:ollama:${m} -->. This matches the currentreview_onewhich always passesPROVIDER=ollamatorun.sh, so it is correct today. But theproviderfield in the JSON artifact comes frommodelProvider()(the Go binary's resolved backend), and ifreview_oneever passes a differentPROVIDER, the fallback markers would silently diverge from whatrun.shwould have used, causing duplicate comments on re-runs instead of in-place updates. Theprovvariable already extracted from JSON on line 326 should be used in the marker instead of the literalollama.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 12m 44s
🪰 Gadfly review —
devstral-2(ollama-cloud)Review incomplete — all lenses errored — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found
🎯 Correctness — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found
🧹 Code cleanliness & maintainability — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found
⚡ Performance — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found
🧯 Error handling & edge cases — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found
Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2s
🪰 Gadfly review —
glm-5.2:cloud(ollama-cloud)Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
VERDICT: No material issues found
Through the security lens I traced the untrusted-data flows this PR introduces and checked each against the actual code:
entrypoint.sh:231,238):FINDINGS_DIRis a fixed subpath ofWORKDIR(CI-controlled, not user input), created fresh withrm -rf/mkdir -p. Per-model filenames are sanitized viatr -c '[:alnum:]._-' '_', so model names with//..can't traverse. Verified atentrypoint.sh:238. Safe.consensus.go:104-130):runConsolidateonly reads*.jsonfrom the consolidate dir andjson.Unmarshals into a fixed struct; untrusted fields are only ever re-emitted as markdown text, never executed/evaluated. Safe.entrypoint.sh:69-90): the newupsert_comment_bodymirrors the existingrun.sh:82upsert (samestartswith($m)matching, same token-scoped PATCH/POST). The consensus marker is a fixed string. No new authz surface beyond what already exists. Verified by comparing both functions.consensus.go:239-241,mdCell): per-modelMarkdown/Title/Provider/Verdictare rendered into the comment.mdCellescapes pipes/newlines for table cells; the folded per-model body is inserted raw — but that is identical to the pre-existing behavior whererun.shpostsREVIEWverbatim, so this is not a regression introduced here.GADFLY_FINDINGS_OUT/GADFLY_CONSOLIDATE_DIR(consensus.go:62,83): only used to locate a local file/dir within the CI workspace; not reflected into shell/URLs. No injection vector.No authn/authz gaps, injection, SSRF, secret leakage, or unsafe-deserialization issues attributable to this change.
🎯 Correctness — No material issues found
The key finding in the draft is a false positive. In run.sh, the marker uses
${PROVIDER}, which entrypoint.sh always sets toollama(line 207). The fallback marker in entrypoint.sh:328 correctly mirrors this. Theprovvariable is the display provider (matchingMODEL_PROVIDER), used in the body header but NOT in the marker in either path.All other draft bullets are "verified correct" / non-issue observations, not actual findings.
Verdict: No material issues found
🧹 Code cleanliness & maintainability — Minor issues
All four findings confirmed against the actual source code.
Verdict: Minor issues found
cmd/gadfly/consensus.go:250-262— duplicated anonymous struct definition. ThebyFilemap value type (struct { f outFinding; model string }) is spelled out fully twice — once in themap[string][]struct{...}{}declaration (line 250) and again in the append literal (line 259). This is a noisy, readability-hostile pattern that Go's style discourages; a one-line named type (type findingRef struct { f outFinding; model string }) at file scope would collapse both spellings and read far cleaner.entrypoint.sh:72-90vsscripts/run.sh:82-102— copy-pastedupsert_comment/upsert_comment_body. The newupsert_comment_bodyis a near-verbatim clone of run.sh'supsert_comment, differing only in takingmarkeras a parameter (and dropping the$API_TIMEOUTindirection in favor of literal timeouts). The entrypoint comment even says "Mirrors run.sh's per-model upsert." Two copies of the same pagination/upsert loop now live in separate scripts with no shared source, so a fix to one (e.g. the pagination cap, error handling) won't propagate. A shared sourced helper would be the lower-churn long-term fix; at minimum a comment cross-referencing the two would flag the coupling.cmd/gadfly/consensus.go:164—cluster.detailis written but never read.mergeIntoClustermaintainsc.detail(the highest-severity report's detail, line 328), andclusterFindingsseeds it, butrenderConsensusnever emitsdetailanywhere — onlytitle,file,line,models,lenses. Grep confirms the only references are the field declaration and assignment sites. The field is dead storage carried through every cluster, adding maintenance surface (a reader assumes it surfaces somewhere). Either render it (e.g. in a folded<details>per headline row) or drop it.cmd/gadfly/consensus.go:200-205—agreedtally re-iteratesclusterswhen it could be folded into the partition above.clusterFindingsis called once (line 178), the headline/folded partition loop (lines 189-196) already walks every cluster, thenagreedis tallied by re-iteratingclusters(lines 200-205). Functionally fine, butagreedcould be counted in the same pass as the partition instead of a second full loop. Minor readability/efficiency nit, not material.⚡ Performance — No material issues found
Verdict: No material issues found
Reviewed the new consolidation path through the performance lens:
clusterFindings(consensus.go:248) is O(F²) in total findings (each item scans all existing clusters for a line-proximity match). F here isfindings × modelsfor a single PR review — tens to low hundreds at most — so the quadratic scan is immaterial. Verified the inner loop atconsensus.go:269-292.renderConsensus(consensus.go:176) builds the whole comment in onestrings.Builderwith a single pass over clusters and models — no repeated allocation, no quadratic string concatenation. Fine.runConsolidate(consensus.go:96) reads every.jsonin the dir once (os.ReadDir+ oneReadFileeach), bounded by the number of models. No unbounded growth.upsert_comment_body(entrypoint.sh:72) paginates up to 10 pages × 50 comments — bounded, and mirrors the pre-existingupsert_commentinrun.sh:82. The fallback path spawns 3jqsubprocesses per model file (entrypoint.sh:324-327), but that path only runs on consolidation failure with a handful of files, so not material.writeFindingsOut(consensus.go:80) does onejson.Marshal+ oneos.WriteFileper model run — trivial.No N+1, no unbounded growth, no hot-loop regressions. The work is all on bounded, small inputs (one PR's findings × a handful of models).
🧯 Error handling & edge cases — Minor issues
Verdict: Minor issues found
Greedy clustering can under-count agreement due to representative-line drift (
cmd/gadfly/consensus.go:271-277).clusterFindingssorts each file's items ascending by line, then greedily merges each item into the first cluster whose representative line (c.line) is withinlineTolerance(3).mergeIntoClusterkeepsc.lineat the minimum line (line 319:f.Line < c.line). So when findings span a window wider than the tolerance but chained within it, a later finding is compared only to the smallest line, not the most recently merged member. Example: lines 1, 4, 7 — 1 and 4 merge (|1−4|=3 ≤ 3, rep stays 1); then 7 is compared to rep 1 (|1−7|=6 > 3) and spawns a new cluster, even though it's only 3 lines from line 4 which is already in the first cluster. A single real issue cited across a ~7-line span gets split into two clusters, undercounting cross-model agreement and potentially demoting a headline finding to the folded section. Suggested fix: compare against the cluster's largest line (e.g. trackmaxLineper cluster and testit.f.Line - c.maxLine <= lineTolerance), or compare against the range. Verified againstclusterFindings(266–293) andmergeIntoCluster(312–331).writeFindingsOutsilently no-ops if the output directory is missing (cmd/gadfly/consensus.go:109-111).os.WriteFilefails if the parent directory doesn't exist; the error is logged to stderr and swallowed (best-effort by design). In the normal entrypoint flow this is fine —FINDINGS_DIRismkdir -p'd before the swarm runs (entrypoint.sh:240) — but ifGADFLY_FINDINGS_OUTis ever set to a path in a non-existent directory (e.g. misconfigured standalone run), that model silently contributes nothing to consolidation, and the only signal is a stderr line that may not be surfaced. Acceptable for best-effort, but the failure mode is "model vanishes from consensus" with no structured fallback. Verified against the function (80–112) and the entrypoint mkdir (240).Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 37s
🪰 Gadfly review —
kimi-k2.6:cloud(ollama-cloud)Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
Verdict: No material issues found
Reviewed the cross-model consensus consolidation changes through the 🔒 Security lens. The new code paths (
consensus.go,entrypoint.sh,run.sh) handle per-model JSON artifacts in a shared temp directory and render a single aggregated markdown comment.Key areas examined:
findings_file_for()inentrypoint.shsanitizes model names withtr -c '[:alnum:]._-' '_', preventing directory traversal via model names. Go'sos.ReadDirguaranteese.Name()is a basename, sofilepath.Join(dir, e.Name())is safe.json.Unmarshalinto a typed struct; no arbitrary code execution risk.mdCell()escapes pipes and newlines for table cells. Model-generated content (title,detail,markdown) is inserted into the consensus comment, but these are the same model outputs that were already posted directly in per-model comments. No new trust boundary is introduced.GADFLY_FINDINGS_OUTandGADFLY_CONSOLIDATE_DIRare set by the trustedentrypoint.shorchestrator; model subprocesses cannot influence them.modelFindingsstruct contains only review metadata and model output — no API keys or credentials.The aggregation logic does not create novel injection or escalation vectors. All model output remains within the existing trust model (review content generated by the model swarm itself).
🎯 Correctness — Blocking issues found
Blocking issues found
cmd/gadfly/consensus.go:88andcmd/gadfly/consolidate.go:82-93: When all lenses for a model error (timeout, panic, etc.),worstVerdictreturnsverdictCleanbecauseverdictUnknown(0) <verdictClean(1). The per-model artifact therefore storesVerdict: "No material issues found"even thoughrenderConsolidatedcorrectly renders "Review incomplete — all lenses errored" in the Markdown body. The consensus pass (renderConsensus) reads only the JSONVerdictfield and parses it back toverdictClean. If all models fail, the final comment reads "No material issues found by consensus" with zero findings — a completely misleading result that masks total system failure. Verified by readingworstVerdict(consolidate.go),writeFindingsOut(consensus.go:88), andrenderConsensus(consensus.go:180-207).Suggested fix: compute the model-level verdict in
writeFindingsOutusing the same logic asrenderConsolidated(checking iferrored == len(results)), or add anErrored/Incompletebool tomodelFindingsand propagate it to the consensus renderer so the headline can say "Review incomplete" instead of "No material issues found".🧹 Code cleanliness & maintainability — Minor issues
Verdict: Minor issues
cmd/gadfly/consensus_test.go:134—captureStdoutleaksos.Stdoutand pipe on test failure.The helper redirects
os.Stdoutto a pipe, but iffn()callst.Fatalf(or panics), the code afterfn()never runs:os.Stdoutstays redirected andw.Close()is never called, so the reader goroutine may hang. Verified by reading the test file.Fix: Use
deferto restoreos.Stdoutand close the writer immediately after redirection:cmd/gadfly/consensus.go:381—oneModelreturns a random model.Go randomizes map iteration order, so the model name displayed for a single-model folded finding is non-deterministic across runs. Re-running the same swarm can produce a different consensus comment. Verified by reading
consensus.go.Fix: Sort the map keys and return the first:
entrypoint.sh:72-90/scripts/run.sh:82-101— Comment upsert logic is copy-pasted.upsert_comment_bodyinentrypoint.shduplicates the pagination-and-upsert logic already inrun.sh'supsert_comment(the PR even notes this: "Mirrors run.sh's per-model upsert"). Any future fix to pagination limits, auth handling, or marker matching must be kept in sync manually. Verified by reading both files.Fix: Extract a shared
lib/upsert.shsourced by both scripts, or add a prominent comment in both places warning that they must stay in sync.⚡ Performance — Blocking issues found
Blocking issues found
cmd/gadfly/consensus.go:271— Quadratic clustering scans clusters from unrelated files.In
clusterFindings, theclustersslice is global across all files, yet the merge testc.file == filemeans only clusters for the current file ever match. Every finding from file N still scans every cluster created for files 1…N−1. With 7 models × 5 lenses × up to 20 findings each = 700 findings, the inner loop can execute ~490 k times instead of ~10 k if scoped per-file. The fix is to declareclustersinside thefor file, items := range byFileloop so each file clusters independently.cmd/gadfly/consensus.go:200-204— Redundant pass to count multi-model agreement.renderConsensusloops overclustersa second time to countagreed(line 200) after the first loop (line 190) already partitions every cluster intoheadline/folded. Countagreedin the first loop to avoid the extra O(n) traversal.cmd/gadfly/consensus.go:295-301— Repeated string allocations in sort comparator.sort.SliceStablecalls the closure many times; each call invokessevRank, which allocates viastrings.ToLower(strings.TrimSpace(s)). Pre-computing a(agreement, sevRank, file, line)sort key per cluster before sorting eliminates the repeated allocation and makes the comparator a pure integer comparison.🧯 Error handling & edge cases — No material issues found
No material issues found
Wait, let me re-evaluate. I found several real issues. Let me reconsider the severity:
captureStdoutpanic/deadlock - This is a test helper bug. It's a real unhandled edge case.mergeIntoClusternegative line sticky - Real edge case with malformed input.mdCell\rhandling - Minor edge case.run.shoutput loss on standalone consolidation - Edge case in standalone usage.These are all error-handling/edge-case issues. The question is whether any are "material" enough to rate "Minor issues" or "Blocking issues found."
I'll rate it "Minor issues" because:
captureStdoutpanic/deadlock is a real test bug that could hang CI.mergeIntoClusternegative line issue is a real edge case that breaks clustering for malformed input.The others are smaller nits. Let me provide a concise review.
Actually, re-reading my instructions: "Only report issues you are reasonably confident are real after checking." I'm confident about all of these.
Let me also consider: is the
captureStdoutissue material enough? It's in test code. If a test panics, the test framework already handles panics. But the additional deadlock from the helper makes it worse. The panic would propagate, but the goroutine leak and stdout redirection could affect subsequent tests or the test runner. Ingo test, each test runs in the same process, so leaking stdout redirection could affect other tests. This is material.For
mergeIntoCluster, if a model outputs"line": -1in its structured findings, the cluster gets stuck. Is this likely? No, but it's an unhandled edge case in the data pipeline.I'll report both as medium severity.
Let me also report the
mdCell\rissue as small.The entrypoint marker issue and run.sh output loss are more like usage edge cases. I'll report the marker issue since it's in the new fallback code.
Actually, let me reconsider the entrypoint marker issue. Looking at the code:
This is used when posting fallback per-model comments. The original
run.shuses:In consolidation mode,
run.shdoesn't post comments. In the fallback,entrypoint.shconstructs the marker. If a model was run with provider "antigravity", the fallback marker would still say "ollama". This means:This is a real bug in the fallback path. It's error-handling related because it's the error fallback path.
OK, I'll include it.
And for
run.shstandalone output loss: whenCONSOLIDATE=1andGADFLY_FINDINGS_OUTis empty, the model output is lost. This is an unhandled edge case for standalone usage. I'll include it as a smaller finding.Let me now write the review. Minor issues
cmd/gadfly/consensus_test.go:134—captureStdoutleaksos.Stdoutand deadlocks on panic.The helper redirects
os.Stdoutto a pipe and spawns a goroutine that reads until EOF. Iffn()panics,w.Close()is never called, the goroutine blocks forever onr.Read, and<-donedeadlocks.os.Stdoutalso stays redirected, poisoning any subsequent test output in the same process.Fix: Add a
deferbeforefn()that closes the writer and restoresos.Stdout, so the cleanup runs during panic unwinding.cmd/gadfly/consensus.go:319— Negative line numbers become sticky cluster representatives.mergeIntoClusteronly updates the representative line whenc.line == 0 || f.Line < c.line. If the first finding in a cluster has a negative line (e.g. from malformed model JSON),c.linestays negative permanently, because a later positive line fails bothc.line == 0andf.Line < c.line. This causeslocation()to drop the line entirely and breaks distance checks for all subsequent findings in that cluster.Fix: Change the guard to
c.line <= 0 || f.Line < c.lineso any positive line displaces a non-positive representative.entrypoint.sh:328— Fallback per-model marker hardcodes provider asollama.When consolidation fails and
entrypoint.shfalls back to posting individual comments, it builds the upsert marker as<!-- gadfly-review:ollama:${m} -->, ignoring the actual provider stored in the artifact. If the same model name was used with a different provider (e.g. antigravity), fallback comments collide. If a prior non-consolidated comment exists with the correct provider marker, the fallback creates a duplicate instead of updating it.Fix: Use the
provvariable already extracted from the JSON:marker="<!-- gadfly-review:${prov:-ollama}:${m} -->".cmd/gadfly/consensus.go:390—mdCellmisses carriage returns.The function collapses
\nto spaces but leaves\rintact. A model output that contains Windows-style\r\nline endings will inject raw\rcharacters into markdown table cells, which can break rendering.Fix: Also replace
\rwith a space before the\nreplacement:strings.ReplaceAll(s, "\r", " ").**`scripts/run
Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 8m 40s
🪰 Gadfly review —
qwen3.5:397b-cloud(ollama-cloud)Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — Blocking issues found
Blocking issues found
cmd/gadfly/consensus.go:239-240— XSS via unsanitized model output in consensus comment. Them.Markdownfield (raw model output, stored in the per-model JSON artifact) is embedded directly into an HTML<details>element without any HTML/Markdown sanitization. A model whose output is manipulated via prompt injection to include malicious HTML/JavaScript could execute arbitrary scripts in the context of any user viewing the PR. ThemdCellhelper (lines 390-394) only escapes pipes and newlines — it does not escape<,>,&, quotes, or backticks. Fix: Usehtml/templateor a sanitization library (e.g.,bluemonday) to sanitize all untrusted content before embedding in HTML/Markdown. (Verified: read consensus.go lines 230-244 and 388-394, confirmed no sanitization imports or calls)cmd/gadfly/consensus.go:212-214— XSS in headline table cells. Thec.title,c.file,c.line, and lens list are all rendered viamdCellwhich does not escape HTML entities. An attacker-controlled finding title or file path could inject HTML. Fix: ExtendmdCellto escape<,>,&,",'usinghtml.EscapeStringbefore Markdown escaping. (Verified: read mdCell implementation at lines 390-394)entrypoint.sh:327-330— XSS in fallback per-model comments. When consolidation fails, the fallback loop reads.markdownfrom each JSON artifact and inserts it directly into the comment body viaprintfwith no sanitization. Same attack vector as above. Fix: Sanitize themdvariable before embedding, or have the binary render the fallback output with proper escaping. (Verified: read entrypoint.sh lines 319-334)cmd/gadfly/consensus.go:86-89— Trust boundary violation:GADFLY_MODELandmodelProvider()used without validation. The model name (from environment) and provider string are stored in the JSON artifact and later rendered in the consensus comment. If an attacker can influenceGADFLY_MODEL(e.g., via workflow input), they could inject malicious content. Fix: Validate/sanitize environment inputs before storing in artifacts. (Verified: read writeFindingsOut at lines 80-112 and modelProvider at model.go:146-155)🎯 Correctness — Blocking issues found
Blocking issues found
Verified by reading
cmd/gadfly/consensus.golines 157-171, 266-293, and 310-321.cmd/gadfly/consensus.go:266-293— Clustering algorithm fails to handle transitive proximity, splitting findings that should be one clusterThe greedy clustering algorithm checks each new finding against only the representative line (
c.line, the minimum) of existing clusters at line 273:absInt(c.line-it.f.Line) <= lineTolerance. This breaks transitivity.Concrete example: Findings at lines 10, 13, and 16 in the same file (with
lineTolerance = 3):c.line = 10absInt(10-13) = 3 ≤ 3→ merges into cluster (keepsc.line = 10)absInt(10-16) = 6 > 3→ creates a new clusterResult: Two clusters
{10,13}and{16}instead of one{10,13,16}. Even though 13 and 16 are within tolerance (|13-16|=3), they never get compared because the algorithm only checks against the minimum line.Impact: Cross-model agreement counts are underreported when models cite lines that form a transitive chain but exceed tolerance from the cluster's minimum. A finding flagged by 3 models could appear as 2+1, weakening the consensus signal.
Fix: Track both
minLineandmaxLineper cluster, and check proximity to the range:🧹 Code cleanliness & maintainability — Minor issues
Verdict: Minor issues found
cmd/gadfly/consensus.go:176—renderConsensusis too long (68 lines) with multiple responsibilitiesThis function handles: partitioning findings, building the headline table, rendering folded sections, AND rendering per-model drill-down. Each section is commented, but splitting into
buildHeadlineTable,buildFoldedSection, andbuildPerModelDetailwould improve readability and testability. Verified by reading lines 176-244.cmd/gadfly/consensus.go:248—clusterFindingshas deeply nested loop logicLines 267-293 contain a triple-nested structure (range files → range items → range clusters) with a
placedflag controlling flow. This greedy clustering algorithm is hard to follow at a glance. Extracting the inner "find or create cluster" logic into a helper likefindOrCreateClusterwould clarify intent. Verified by reading lines 248-308.cmd/gadfly/consensus.go:390—mdCelldoes not escape backticksThe function escapes pipes and collapses newlines for markdown table safety, but backticks (common in code references like
auth/login.go:42) are not escaped. A title containing`code`would break table rendering. This is a maintainability debt—the next person editing table output may not realize the escaping is incomplete. Verified by reading lines 388-394.cmd/gadfly/consensus.go:396-407— Trivial helper functions add noiseplural(5 lines) andabsInt(7 lines) are trivial one-liners that could be inlined at call sites or replaced with standard patterns (if n != 1 { s += "s" }andif n < 0 { n = -n }). They're not wrong, but they increase the surface area of the file without meaningful abstraction benefit. Verified by reading lines 396-407.Naming confusion:
renderConsolidated(consolidate.go) vsrenderConsensus(consensus.go)These names are easily confused.
renderConsolidatedmerges lenses within ONE model;renderConsensusmerges findings across MULTIPLE models. The distinction matters but isn't clear from the names alone. Renaming torenderSingleModelReviewandrenderCrossModelConsensuswould be self-documenting. Verified by comparingcmd/gadfly/consolidate.go:98andcmd/gadfly/consensus.go:176.⚡ Performance — No material issues found
No material issues found
After independent verification against the actual code:
Finding 1 (Quadratic clustering): The
clusterFindingsfunction at lines 266-293 does iterate through existing clusters for each finding. However, the draft's analysis misses that findings are first grouped by file (line 250-264), so the inner loop at line 271 only scans clusters that could match. Within a file, items are sorted by line (line 268), and the line tolerance (3 lines) means clusters are geographically constrained. While worst-case theoretical complexity exists, this is a standard greedy clustering pattern and not a blocking issue for the expected input sizes.Finding 2 (lensList allocations): The
lensListfunction at line 372 does allocate a slice, but it usesmake([]string, 0, len(lenses))with pre-allocated capacity based on the map size. The number of lenses per cluster is typically very small (1-5 distinct lens names). This is standard Go idiom and does not create meaningful GC pressure.Finding 3 (Sequential fallback): The fallback loop at lines 322-332 does post sequentially. However, this is explicitly a fallback path (line 320: "consolidation produced no output; falling back") that only executes when the primary consolidation fails. Sequential behavior in a rare fallback is an acceptable trade-off for simplicity and correctness.
All three findings describe implementation choices rather than actual bugs. None prevent the feature from working correctly.
🧯 Error handling & edge cases — Blocking issues found
Blocking issues found
After reviewing the changes through the error handling & edge cases lens and verifying against the checked-out code:
cmd/gadfly/consensus.go:283— Cluster initialized with potentially empty severity: When a new cluster is created,severityis set directly fromit.f.Severity. For heuristic-scraped findings (peremit.go:64), severity may be empty. WhilemergeIntoClusterdoes escalate severity if a higher-ranked one exists, if ALL findings in a cluster have empty severity, the cluster remains with empty severity. This propagates tosevRank(returns 0) andsevIcon(returns "⚪"), which is handled but the initial assignment at line 283 should defensively handle empty severity. Verified by readingclusterFindingsandmergeIntoCluster. Suggested fix: Initialize withit.f.Severityonly if non-empty, or callmergeIntoClusterlogic inline to apply severity escalation on creation.cmd/gadfly/consensus.go:80-111—writeFindingsOutassumes parent directory exists: The function callsos.WriteFile(path, ...)without ensuring parent directories exist. IfGADFLY_FINDINGS_OUTis set to a path like/tmp/findings/model1.jsonand/tmp/findings/doesn't exist, the write fails silently (error logged to stderr). Whileentrypoint.shcreatesFINDINGS_DIRbefore invoking the binary, this function has no defensiveos.MkdirAlland would fail if called in isolation or with an unexpected path. Verified by readingwriteFindingsOutandentrypoint.sh:238. Suggested fix: Addos.MkdirAll(filepath.Dir(path), 0o755)beforeWriteFile.cmd/gadfly/consensus.go:216-218— Misleading "No material issues found" when folded findings exist: The "No material issues found by consensus" message is shown wheneverlen(headline) == 0, even iflen(folded) > 0. This could mislead readers into thinking there are no findings at all when low-severity single-model findings exist in the folded section. Verified by readingrenderConsensus. Suggested fix: Change the message to "No headline findings" or only show this message when bothheadlineandfoldedare empty.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 11m 10s