feat: cross-model consensus consolidation (one ranked comment, not N walls) #17

Merged
steve merged 2 commits from feat/consensus-consolidation into main 2026-06-28 22:56:15 +00:00
Owner

What

Phase 2 of the review-representation work. With ≥2 models, the swarm posts one consensus comment instead of a per-model comment each.

Every model writes its structured findings to a shared dir; after the swarm finishes, a consolidation pass clusters findings by location (±3 lines), counts how many models independently flagged each, escalates to the highest reported severity, and renders an agreement-ranked table:

## 🪰 Gadfly review — consensus across 7 models
**Verdict: Blocking issues found** · 9 findings (3 with multi-model agreement)

| | Finding | Where | Models | Lens |
|--|--|--|--|--|
| 🔴 | Auth bypass: token not verified | `auth/login.go:42` | 6/7 | security |

<details><summary>4 single-model findings (lower confidence)</summary> … </details>
<details><summary>Per-model detail</summary> … each model's full review, folded … </details>

Why

Cross-model agreement is the strongest real-vs-false-positive signal we have (the PR #16 review showed every model re-finding the same issue independently). Ranking by it surfaces the findings that matter and folds the lone nits — directly attacking both the representation noise and the false-positive rate.

How

  • consensus.go: modelFindings artifact, collectFindings (shared with telemetry emit — one definition of "what a finding is"), writeFindingsOut, runConsolidate, clustering + renderConsensus. Lone low-severity findings fold away; a lone critical still surfaces; each model's full review is preserved folded inside the one comment.
  • main.go: GADFLY_CONSOLIDATE_DIR dispatches to consolidation mode; writeFindingsOut after a normal review.
  • run.sh: when consolidating, write findings + skip the per-model comment (live progress still on the status board).
  • entrypoint.sh: auto-enable for ≥2 models; run the consolidator after the barrier and upsert ONE comment; fall back to per-model comments if consolidation yields nothing — advisory invariant preserved, output is never lost.
  • README + reusable workflow (consolidate input) + entrypoint docs updated.

Controlled by GADFLY_CONSOLIDATE: auto (default), 1, 0. Single-model runs unchanged.

Tests

Clustering/agreement/tolerance, severity escalation, nit folding, lone-critical surfacing, write→consolidate round-trip, empty-dir error. bash -n clean on both scripts; gofmt/vet clean.

🤖 Generated with Claude Code

## What Phase 2 of the review-representation work. With **≥2 models**, the swarm posts **one consensus comment** instead of a per-model comment each. Every model writes its structured findings to a shared dir; after the swarm finishes, a consolidation pass clusters findings by location (±3 lines), counts **how many models independently flagged each**, escalates to the highest reported severity, and renders an **agreement-ranked** table: ``` ## 🪰 Gadfly review — consensus across 7 models **Verdict: Blocking issues found** · 9 findings (3 with multi-model agreement) | | Finding | Where | Models | Lens | |--|--|--|--|--| | 🔴 | Auth bypass: token not verified | `auth/login.go:42` | 6/7 | security | <details><summary>4 single-model findings (lower confidence)</summary> … </details> <details><summary>Per-model detail</summary> … each model's full review, folded … </details> ``` ## Why Cross-model agreement is the strongest real-vs-false-positive signal we have (the [PR #16 review](../pulls/16) showed every model re-finding the same issue independently). Ranking by it surfaces the findings that matter and folds the lone nits — directly attacking both the representation noise *and* the false-positive rate. ## How - **`consensus.go`**: `modelFindings` artifact, `collectFindings` (shared with telemetry emit — one definition of "what a finding is"), `writeFindingsOut`, `runConsolidate`, clustering + `renderConsensus`. Lone low-severity findings fold away; a lone *critical* still surfaces; each model's full review is preserved folded inside the one comment. - **`main.go`**: `GADFLY_CONSOLIDATE_DIR` dispatches to consolidation mode; `writeFindingsOut` after a normal review. - **`run.sh`**: when consolidating, write findings + skip the per-model comment (live progress still on the status board). - **`entrypoint.sh`**: auto-enable for ≥2 models; run the consolidator after the barrier and upsert ONE comment; **fall back to per-model comments if consolidation yields nothing** — advisory invariant preserved, output is never lost. - README + reusable workflow (`consolidate` input) + entrypoint docs updated. Controlled by `GADFLY_CONSOLIDATE`: `auto` (default), `1`, `0`. Single-model runs unchanged. ## Tests Clustering/agreement/tolerance, severity escalation, nit folding, lone-critical surfacing, write→consolidate round-trip, empty-dir error. `bash -n` clean on both scripts; gofmt/vet clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
steve added 1 commit 2026-06-28 22:35:38 +00:00
feat: cross-model consensus consolidation (one ranked comment, not N walls)
Build & push image / build-and-push (pull_request) Successful in 4s
Adversarial Review (Gadfly) / review (pull_request) Successful in 16m16s
052833d830
With >=2 models the swarm now posts ONE consensus comment instead of a
per-model comment each. Every model writes its structured findings to a shared
dir (GADFLY_FINDINGS_OUT); after the swarm finishes, a consolidation pass
(the binary in GADFLY_CONSOLIDATE_DIR mode) clusters findings by location
(±3 lines), counts how many models independently flagged each, escalates to the
highest reported severity, and renders an agreement-ranked table — cross-model
agreement being the strongest real-vs-false-positive signal we have.

- consensus.go: modelFindings artifact, collectFindings (shared with emit),
  writeFindingsOut, runConsolidate, clustering + renderConsensus. Lone
  low-severity findings fold away; a lone critical still surfaces; each model's
  full review is preserved folded inside the one comment.
- main.go: GADFLY_CONSOLIDATE_DIR dispatches to consolidation; writeFindingsOut
  after a normal review.
- emit.go: reuse collectFindings (one definition of "what a finding is").
- run.sh: when GADFLY_CONSOLIDATE=1, write findings + skip the per-model comment
  (live progress still on the status board).
- entrypoint.sh: auto-enable for >=2 models; run the consolidator after the
  barrier and upsert ONE consensus comment; FALL BACK to per-model comments if
  consolidation yields nothing (advisory invariant: never lose output).
- README + reusable workflow (`consolidate` input) + entrypoint docs updated.

Tests: clustering/agreement/tolerance, severity escalation, nit folding, lone
critical, write→consolidate round-trip, empty-dir error. gofmt/vet clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

🪰 Gadfly — live review status

7/7 reviewers finished · updated 2026-06-28 22:51:38Z

claude-code/opus · claude-code — done

  • security — No material issues found
  • correctness — Minor issues
  • maintainability — Minor issues
  • performance — No material issues found
  • error-handling — Minor issues

claude-code/opus:max · claude-code — done

  • security — Minor issues
  • correctness — Minor issues
  • maintainability — Minor issues
  • performance — No material issues found
  • error-handling — Minor issues

deepseek-v4-pro:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — Minor issues
  • maintainability — Minor issues
  • performance — No material issues found
  • error-handling — Minor issues

devstral-2 · ollama-cloud — done

  • ⚠️ security — could not complete
  • ⚠️ correctness — could not complete
  • ⚠️ maintainability — could not complete
  • ⚠️ performance — could not complete
  • ⚠️ error-handling — could not complete

glm-5.2:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — No material issues found
  • maintainability — Minor issues
  • performance — No material issues found
  • error-handling — Minor issues

kimi-k2.6:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — Blocking issues found
  • maintainability — Minor issues
  • performance — Blocking issues found
  • error-handling — No material issues found

qwen3.5:397b-cloud · ollama-cloud — done

  • security — Blocking issues found
  • correctness — Blocking issues found
  • maintainability — Minor issues
  • performance — No material issues found
  • error-handling — Blocking issues found

Live status board. Findings are posted in each model's own comment. Advisory only — does not block merge.

<!-- gadfly-status-board --> ## 🪰 Gadfly — live review status 7/7 reviewers finished · updated 2026-06-28 22:51:38Z #### `claude-code/opus` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `claude-code/opus:max` · claude-code — ✅ done - ✅ **security** — Minor issues - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `deepseek-v4-pro:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `devstral-2` · ollama-cloud — ✅ done - ⚠️ **security** — could not complete - ⚠️ **correctness** — could not complete - ⚠️ **maintainability** — could not complete - ⚠️ **performance** — could not complete - ⚠️ **error-handling** — could not complete #### `glm-5.2:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `kimi-k2.6:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Blocking issues found - ✅ **maintainability** — Minor issues - ✅ **performance** — Blocking issues found - ✅ **error-handling** — No material issues found #### `qwen3.5:397b-cloud` · ollama-cloud — ✅ done - ✅ **security** — Blocking issues found - ✅ **correctness** — Blocking issues found - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Blocking issues found <sub>Live status board. Findings are posted in each model's own comment. Advisory only — does not block merge.</sub>

🪰 Gadfly review — claude-code/opus (claude-code)

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

VERDICT: No material issues found

I traced the tainted-data path this PR introduces — PR diff → model review → structured findings → per-model JSON artifact → consolidation → single PR comment — and found no new security-relevant sink.

What I checked (by reading the actual code, not just the diff):

  • JSON/command injection into the Gitea API — every comment payload is built with jq -n --arg b "$body" '{body:$b}' in upsert_comment_body (entrypoint.sh:71) and the fallback loop reads model fields with jq -r '.markdown // ""' then re-encodes via the same --arg path. Attacker-influenced model text (titles, detail, full markdown) is never interpolated into a JSON string or a shell command unquoted. Safe.
  • Path traversal on the findings artifactfindings_file_for (entrypoint.sh:209) sanitizes the model id with tr -c '[:alnum:]._-' '_', identical to the pre-existing status_file_for. / is collapsed, so a ../-style model id can't escape FINDINGS_DIR. GADFLY_FINDINGS_OUT/GADFLY_CONSOLIDATE_DIR are operator-set (entrypoint), not PR-controlled. runConsolidate only reads *.json under the given dir and json.Unmarshals them. Safe.
  • Table-cell breakoutmdCell (consensus.go) escapes | and collapses newlines before findings titles/lenses go into the markdown table, so a crafted finding title can't forge extra table columns/rows in the headline.
  • Secret/credential leakagemodelFindings carries only model/provider/verdict/markdown/findings; no keys or tokens are serialized to the artifact files (mode 0o644, inside the container workdir). writeFindingsOut no-ops when GADFLY_FINDINGS_OUT is empty (consensus.go:78).
  • Authz/trigger surface — unchanged; the consolidation step runs after the existing gating, adds no new entry point.

One pre-existing, out-of-scope note (not introduced here, not counted as a finding): the per-model full review is embedded raw inside <details> in both the consensus body and the fallback comment, so model output containing </details> could break out of the fold and inject top-level markdown. This is identical to the existing per-model path, which already posts $REVIEW raw, so this PR does not widen exposure — and Gitea sanitizes rendered HTML. Worth a future hardening pass, but no regression here.

🎯 Correctness — Minor issues

VERDICT: Minor issues

I reviewed the consensus consolidation logic through the Correctness lens, verifying cross-file symbols against the checked-out code (consolidate.go, emit.go, findings.go, main.go, run.sh, entrypoint.sh). The clustering math, severity escalation, tolerance matching, lone-critical surfacing, and the auto/1/0 resolution flow (GADFLY_CONSOLIDATE empty→auto via :-, then numeric CONSOLIDATE passed to run.sh) all check out and match the tests. The collectFindings refactor in emit.go is behavior-preserving. A few real issues:

  • cmd/gadfly/consensus.go:88 — an all-errored model is recorded as a clean participant, polluting the consensus denominator. writeFindingsOut stores Verdict: worstVerdict(results).label(). I verified worstVerdict (consolidate.go:82) ignores r.errored and only takes the max of r.verdict, starting from verdictClean — so a model whose every lens errored (e.g. all timed out) yields verdictClean → label "No material issues found", and collectFindings returns nothing for errored lenses, so it writes a file with zero findings. runConsolidate then counts it in total (renderConsensus:177) and folds it into worst. Net effect: a model that produced no actual review appears as a clean reviewer, and a real finding from the one working model in a 2-model run shows 1/2 (looks like weak agreement) instead of 1/1. Note renderConsolidated itself already distinguishes this case ("Review incomplete — all lenses errored", consolidate.go:107), so the stored Verdict field is inconsistent with the model's own rendered markdown. Fix: when every lens errored, either skip writeFindingsOut for that model or stamp a non-clean/"incomplete" verdict and exclude such artifacts from total. (Verified by reading worstVerdict, collectFindings, and the total/worst use in renderConsensus.)

  • cmd/gadfly/consensus.go:273 — tolerance window is anchored at the cluster's minimum line, so it doesn't slide; chains of nearby findings under-merge. The match test is absInt(c.line - it.f.Line) <= lineTolerance and c.line is pinned to the smallest line in the cluster (mergeIntoCluster:319 only ever lowers it). Items are line-sorted, so findings at e.g. lines 1, 4, 7 produce two clusters (1+4 merge; 7 is 6 from the anchor and splits) even though 4 and 7 are within tolerance of each other. Since cross-model agreement is the headline signal the whole feature ranks on, this can undercount agreement for the same issue cited a few lines apart. Low severity / partly a design tradeoff (the ±3 covers the common "line or two apart" case), but flagging since it directly affects the agreement count. (Verified by tracing clusterFindings + mergeIntoCluster.)

  • entrypoint.sh:317 / scripts/run.sh:50 — switching consolidation mode on/off (or model count crossing 2) on the same PR orphans the other style's comment. The consensus comment uses marker <!-- gadfly-consensus --> while per-model comments use <!-- gadfly-review:ollama:<model> -->; neither path deletes the other. A PR first reviewed with 1 model (per-model comment) then re-run with ≥2 models (consensus) leaves the stale per-model comment behind, and vice-versa. Low severity, but produces duplicate/contradictory comments. Suggest deleting the opposite-marker comment(s) when posting. (Verified markers in both scripts; confirmed no cleanup of the other marker.)

I could not run go build (private majordomo module is gated in this sandbox), but I confirmed there are no duplicate definitions of the new helpers (absInt/plural/sevRank/etc.) and all referenced symbols (parseVerdict, verdictClean, renderConsolidated, extractStructuredFindingsOrScrape, modelProvider, finding fields) exist with matching signatures, so the new file should compile.

🧹 Code cleanliness & maintainability — Minor issues

I've verified all three findings against cmd/gadfly/consensus.go. Findings 1 and 3 are confirmed exactly as described. Finding 2's substance is confirmed (the anonymous struct type is duplicated and a named type would help), but its claimed count is wrong — the type is spelled out twice (map value type at :250, append element type at :259), not three times. I've corrected that detail and kept the finding.

VERDICT: Minor issues

  • cmd/gadfly/consensus.go:164cluster.detail is write-only dead state. The field is declared (:164), populated when a cluster is created (:285) and on severity escalation in mergeIntoCluster (:328), with outFinding.Detail plumbed all the way into it — but it is never read or rendered. The consensus table emits only title, severity, location, model count, and lenses; the per-model drill-down uses m.Markdown, not findings. Confirmed by grepping every detail/Detail site: the only reads are on outFinding.Detail / finding.detail, never on cluster.detail. Suggested fix: drop cluster.detail and its two assignments (and optionally the outFinding.Detail plumbing if the artifact has no other consumer), or actually surface it if that was the intent. Severity: trivial.

  • cmd/gadfly/consensus.go:250-262 — the anonymous struct type struct { f outFinding; model string } is spelled out twice (map value type at :250, composite-literal element type in the append at :259). A one-line named type (e.g. type fileFinding struct { f outFinding; model string }) declared once would make the grouping self-documenting and the append a clean composite literal, removing the duplicate spelling that must be kept in sync by hand. Severity: trivial.

  • cmd/gadfly/consensus.go:200-205 — the multi-model agreed count re-loops over clusters immediately after the partition loop at :189-196 already walked the same slice and tested len(c.models) >= 2. Minor redundancy; the count could be accumulated in the partition loop. Severity: trivial.

Otherwise the new file follows the surrounding conventions well (doc comments per function, stdlib-only, helpers factored sensibly, and the collectFindings extraction genuinely de-duplicates emit.go). Nothing blocking.

Performance — No material issues found

I've reviewed the change through the Performance lens, verifying the clustering algorithm, the helper call paths, and the shell upsert loop against the actual code.

Review

VERDICT: No material issues found

Performance-relevant observations (all checked, none rise to a real concern):

  • clusterFindings is ~O(F·C) (cmd/gadfly/consensus.go:267-293): for each finding it linearly scans every cluster created so far (including other files', filtered by c.file == file). This is quadratic in the worst case, but the operand set is tiny — total findings = models × findings-per-model, realistically a few dozen. The whole consolidation runs once, after the swarm, off the hot path; the cost is dominated entirely by LLM inference upstream. Not worth optimizing.
  • collectFindings now runs twice per lens result — once in emit() and once in writeFindingsOut() (cmd/gadfly/main.go:225 then emit at :224). Each call re-runs extractStructuredFindingsOrScrape. I read findings.go:112 — it only parses already-in-memory strings (regex/prose scrape), no I/O. Doubling a microsecond-scale string parse at end-of-run is negligible.
  • renderConsolidated called in writeFindingsOut to populate Markdown (consensus.go): builds markdown once per model from in-memory results. Fine.
  • upsert_comment_body pagination (entrypoint.sh:71-79): sequential API calls up to 10 pages × 50 comments. This mirrors the pre-existing upsert_comment pattern in run.sh and is network-bound CI glue, not a regression introduced here — and it's bounded.
  • No unbounded growth, no N+1 against a DB, no work added inside a genuine hot loop. The per-model JSON artifacts are written once and read once.

Nothing in the performance lens is materially wrong with this change.

🧯 Error handling & edge cases — Minor issues

I've verified both findings against the actual code. Let me confirm the mechanics:

Finding 1 — Confirmed. In run.sh, the skip cases (no Ollama key at run.sh:161, binary missing at run.sh:165) set REVIEW without ever invoking the binary, so GADFLY_FINDINGS_OUT is never written; the crash case (run.sh:186) invokes the binary but writeFindingsOut (main.go:229) only runs on the success path, so a non-zero exit writes no file either. In all three, run.sh:220 (CONSOLIDATE=1) suppresses the per-model comment. In entrypoint.sh, the fallback (entrypoint.sh:311-333) fires only when CONSENSUS is entirely empty and even then iterates for f in "${FINDINGS_DIR}"/*.json — a model with no file is never reconstructed. Net: a failed/skipped model produces no PR signal (only a stderr line). Mechanically accurate.

Finding 2 — Confirmed. entrypoint.sh:238 tr -c '[:alnum:]._-' '_' maps both : and / to _, so foo:bar and foo/bar collide to foo_bar.json; the later writeFindingsOut overwrites, dropping a model from consensus. Mirrors status_file_for (entrypoint.sh:190), so not a regression. Edge case, accurately scoped.

Both survive verification; I've corrected the line anchors (main.go 230→229, entrypoint.sh 236→238).

VERDICT: Minor issues

The happy paths are well-guarded (best-effort writeFindingsOut, || true around the consolidation call, junk-file skipping in runConsolidate, nil-safe maps, line>0 guards). The gaps are on the unhappy path — what happens to a model that errors or is skipped while consolidation is on.

Findings

1. (medium) In consolidate mode, an errored/skipped model is silently dropped — its error notice is lost.
cmd/gadfly/main.go:229writeFindingsOut(results) runs only at the end of the success path; if the binary returns an error, main() (main.go:118) does os.Exit(1) and no findings file is written. The early "soft" notices in scripts/run.sh (⚠️ No Ollama Cloud key configured at run.sh:161, ⚠️ Agentic reviewer binary not found at run.sh:165) set REVIEW without invoking the binary at all → again no findings file. Meanwhile run.sh:220 suppresses that model's per-model comment when CONSOLIDATE=1. Net effect: in non-consolidate mode each of these produces a visible ⚠️ … comment; in consolidate mode they produce silence (only a stderr model X failed (continuing) line, and for the skip cases not even that — run.sh exits 0). The fallback in entrypoint.sh:311 only fires when CONSENSUS is entirely empty, and even then it loops for f in "${FINDINGS_DIR}"/*.json — a model with no file is never reconstructed. So a swarm where the API key is misconfigured for every model, or several models crash, posts either nothing or a clean-looking consensus, and the operator gets no signal that models failed. Suggested fix: when CONSOLIDATE=1 and the binary fails or is skipped, still write a minimal artifact (or have entrypoint.sh enumerate MODEL_LIST vs. present files and surface the missing/errored ones), so failures stay visible.

2. (small) findings_file_for can silently collide two distinct model specs onto one file, dropping one from consensus.
entrypoint.sh:238tr -c '[:alnum:]._-' '_' maps both : and / to _, so e.g. foo:bar (bare id) and foo/bar (provider foo) sanitize to the same foo_bar.json; the second model's writeFindingsOut overwrites the first, and one model vanishes from the cluster counts/denominator with no error. This mirrors the pre-existing status_file_for (entrypoint.sh:190) pattern, so it's low-severity and not a regression, but consensus counting makes a silent drop more misleading here than on the status board. Edge-case; only bites unusual co-configured specs.

Everything else in the error-handling lane checked out: runConsolidate correctly errors on an empty dir and tolerates unreadable/unparseable/.json-less/empty-Model files; clustering uses initialized maps and a smallest-line representative that can't drift past tolerance; clean models are correctly counted in the X/total denominator; the consolidate-mode early return in run() sidesteps the GADFLY_REPO_DIR/DIFF_FILE requirement and the telemetry emit; no panic/index/division risks.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 17s

<!-- gadfly-review:ollama:claude-code/opus --> ### 🪰 Gadfly review — `claude-code/opus` (claude-code) **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> ## VERDICT: No material issues found I traced the tainted-data path this PR introduces — PR diff → model review → structured findings → per-model JSON artifact → consolidation → single PR comment — and found no new security-relevant sink. What I checked (by reading the actual code, not just the diff): - **JSON/command injection into the Gitea API** — every comment payload is built with `jq -n --arg b "$body" '{body:$b}'` in `upsert_comment_body` (`entrypoint.sh:71`) and the fallback loop reads model fields with `jq -r '.markdown // ""'` then re-encodes via the same `--arg` path. Attacker-influenced model text (titles, detail, full markdown) is never interpolated into a JSON string or a shell command unquoted. Safe. - **Path traversal on the findings artifact** — `findings_file_for` (`entrypoint.sh:209`) sanitizes the model id with `tr -c '[:alnum:]._-' '_'`, identical to the pre-existing `status_file_for`. `/` is collapsed, so a `../`-style model id can't escape `FINDINGS_DIR`. `GADFLY_FINDINGS_OUT`/`GADFLY_CONSOLIDATE_DIR` are operator-set (entrypoint), not PR-controlled. `runConsolidate` only reads `*.json` under the given dir and `json.Unmarshal`s them. Safe. - **Table-cell breakout** — `mdCell` (`consensus.go`) escapes `|` and collapses newlines before findings titles/lenses go into the markdown table, so a crafted finding title can't forge extra table columns/rows in the headline. - **Secret/credential leakage** — `modelFindings` carries only model/provider/verdict/markdown/findings; no keys or tokens are serialized to the artifact files (mode `0o644`, inside the container workdir). `writeFindingsOut` no-ops when `GADFLY_FINDINGS_OUT` is empty (`consensus.go:78`). - **Authz/trigger surface** — unchanged; the consolidation step runs after the existing gating, adds no new entry point. One pre-existing, out-of-scope note (not introduced here, not counted as a finding): the per-model full review is embedded raw inside `<details>` in both the consensus body and the fallback comment, so model output containing `</details>` could break out of the fold and inject top-level markdown. This is identical to the existing per-model path, which already posts `$REVIEW` raw, so this PR does not widen exposure — and Gitea sanitizes rendered HTML. Worth a future hardening pass, but no regression here. </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> ## VERDICT: Minor issues I reviewed the consensus consolidation logic through the **Correctness** lens, verifying cross-file symbols against the checked-out code (`consolidate.go`, `emit.go`, `findings.go`, `main.go`, `run.sh`, `entrypoint.sh`). The clustering math, severity escalation, tolerance matching, lone-critical surfacing, and the `auto`/`1`/`0` resolution flow (`GADFLY_CONSOLIDATE` empty→`auto` via `:-`, then numeric `CONSOLIDATE` passed to `run.sh`) all check out and match the tests. The `collectFindings` refactor in `emit.go` is behavior-preserving. A few real issues: - **`cmd/gadfly/consensus.go:88` — an all-errored model is recorded as a clean participant, polluting the consensus denominator.** `writeFindingsOut` stores `Verdict: worstVerdict(results).label()`. I verified `worstVerdict` (`consolidate.go:82`) ignores `r.errored` and only takes the max of `r.verdict`, starting from `verdictClean` — so a model whose **every** lens errored (e.g. all timed out) yields `verdictClean` → label `"No material issues found"`, and `collectFindings` returns nothing for errored lenses, so it writes a file with zero findings. `runConsolidate` then counts it in `total` (`renderConsensus:177`) and folds it into `worst`. Net effect: a model that produced *no actual review* appears as a clean reviewer, and a real finding from the one working model in a 2-model run shows `1/2` (looks like weak agreement) instead of `1/1`. Note `renderConsolidated` itself already distinguishes this case ("Review incomplete — all lenses errored", `consolidate.go:107`), so the stored `Verdict` field is inconsistent with the model's own rendered markdown. Fix: when every lens errored, either skip `writeFindingsOut` for that model or stamp a non-clean/"incomplete" verdict and exclude such artifacts from `total`. (Verified by reading `worstVerdict`, `collectFindings`, and the `total`/`worst` use in `renderConsensus`.) - **`cmd/gadfly/consensus.go:273` — tolerance window is anchored at the cluster's minimum line, so it doesn't slide; chains of nearby findings under-merge.** The match test is `absInt(c.line - it.f.Line) <= lineTolerance` and `c.line` is pinned to the smallest line in the cluster (`mergeIntoCluster:319` only ever lowers it). Items are line-sorted, so findings at e.g. lines 1, 4, 7 produce two clusters (1+4 merge; 7 is 6 from the anchor and splits) even though 4 and 7 are within tolerance of each other. Since cross-model agreement is the headline signal the whole feature ranks on, this can undercount agreement for the same issue cited a few lines apart. Low severity / partly a design tradeoff (the ±3 covers the common "line or two apart" case), but flagging since it directly affects the agreement count. (Verified by tracing `clusterFindings` + `mergeIntoCluster`.) - **`entrypoint.sh:317` / `scripts/run.sh:50` — switching consolidation mode on/off (or model count crossing 2) on the same PR orphans the other style's comment.** The consensus comment uses marker `<!-- gadfly-consensus -->` while per-model comments use `<!-- gadfly-review:ollama:<model> -->`; neither path deletes the other. A PR first reviewed with 1 model (per-model comment) then re-run with ≥2 models (consensus) leaves the stale per-model comment behind, and vice-versa. Low severity, but produces duplicate/contradictory comments. Suggest deleting the opposite-marker comment(s) when posting. (Verified markers in both scripts; confirmed no cleanup of the other marker.) I could not run `go build` (private `majordomo` module is gated in this sandbox), but I confirmed there are no duplicate definitions of the new helpers (`absInt`/`plural`/`sevRank`/etc.) and all referenced symbols (`parseVerdict`, `verdictClean`, `renderConsolidated`, `extractStructuredFindingsOrScrape`, `modelProvider`, `finding` fields) exist with matching signatures, so the new file should compile. </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> I've verified all three findings against `cmd/gadfly/consensus.go`. Findings 1 and 3 are confirmed exactly as described. Finding 2's substance is confirmed (the anonymous struct type is duplicated and a named type would help), but its claimed count is wrong — the type is spelled out **twice** (map value type at `:250`, append element type at `:259`), not three times. I've corrected that detail and kept the finding. ## VERDICT: Minor issues - **`cmd/gadfly/consensus.go:164` — `cluster.detail` is write-only dead state.** The field is declared (`:164`), populated when a cluster is created (`:285`) and on severity escalation in `mergeIntoCluster` (`:328`), with `outFinding.Detail` plumbed all the way into it — but it is **never read or rendered**. The consensus table emits only `title`, `severity`, `location`, model count, and lenses; the per-model drill-down uses `m.Markdown`, not findings. Confirmed by grepping every `detail`/`Detail` site: the only reads are on `outFinding.Detail` / `finding.detail`, never on `cluster.detail`. Suggested fix: drop `cluster.detail` and its two assignments (and optionally the `outFinding.Detail` plumbing if the artifact has no other consumer), or actually surface it if that was the intent. Severity: trivial. - **`cmd/gadfly/consensus.go:250-262` — the anonymous struct type `struct { f outFinding; model string }` is spelled out twice** (map value type at `:250`, composite-literal element type in the `append` at `:259`). A one-line named type (e.g. `type fileFinding struct { f outFinding; model string }`) declared once would make the grouping self-documenting and the `append` a clean composite literal, removing the duplicate spelling that must be kept in sync by hand. Severity: trivial. - **`cmd/gadfly/consensus.go:200-205` — the multi-model `agreed` count re-loops over `clusters`** immediately after the partition loop at `:189-196` already walked the same slice and tested `len(c.models) >= 2`. Minor redundancy; the count could be accumulated in the partition loop. Severity: trivial. Otherwise the new file follows the surrounding conventions well (doc comments per function, stdlib-only, helpers factored sensibly, and the `collectFindings` extraction genuinely de-duplicates `emit.go`). Nothing blocking. </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> I've reviewed the change through the Performance lens, verifying the clustering algorithm, the helper call paths, and the shell upsert loop against the actual code. ## Review **VERDICT: No material issues found** Performance-relevant observations (all checked, none rise to a real concern): - **`clusterFindings` is ~O(F·C)** (`cmd/gadfly/consensus.go:267-293`): for each finding it linearly scans every cluster created so far (including other files', filtered by `c.file == file`). This is quadratic in the worst case, but the operand set is tiny — total findings = `models × findings-per-model`, realistically a few dozen. The whole consolidation runs once, after the swarm, off the hot path; the cost is dominated entirely by LLM inference upstream. Not worth optimizing. - **`collectFindings` now runs twice per lens result** — once in `emit()` and once in `writeFindingsOut()` (`cmd/gadfly/main.go:225` then `emit` at `:224`). Each call re-runs `extractStructuredFindingsOrScrape`. I read `findings.go:112` — it only parses already-in-memory strings (regex/prose scrape), no I/O. Doubling a microsecond-scale string parse at end-of-run is negligible. - **`renderConsolidated` called in `writeFindingsOut`** to populate `Markdown` (`consensus.go`): builds markdown once per model from in-memory results. Fine. - **`upsert_comment_body` pagination** (`entrypoint.sh:71-79`): sequential API calls up to 10 pages × 50 comments. This mirrors the pre-existing `upsert_comment` pattern in `run.sh` and is network-bound CI glue, not a regression introduced here — and it's bounded. - No unbounded growth, no N+1 against a DB, no work added inside a genuine hot loop. The per-model JSON artifacts are written once and read once. Nothing in the performance lens is materially wrong with this change. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> I've verified both findings against the actual code. Let me confirm the mechanics: **Finding 1** — Confirmed. In `run.sh`, the skip cases (no Ollama key at `run.sh:161`, binary missing at `run.sh:165`) set `REVIEW` without ever invoking the binary, so `GADFLY_FINDINGS_OUT` is never written; the crash case (`run.sh:186`) invokes the binary but `writeFindingsOut` (`main.go:229`) only runs on the success path, so a non-zero exit writes no file either. In all three, `run.sh:220` (`CONSOLIDATE=1`) suppresses the per-model comment. In `entrypoint.sh`, the fallback (`entrypoint.sh:311-333`) fires only when `CONSENSUS` is *entirely* empty and even then iterates `for f in "${FINDINGS_DIR}"/*.json` — a model with no file is never reconstructed. Net: a failed/skipped model produces no PR signal (only a stderr line). Mechanically accurate. **Finding 2** — Confirmed. `entrypoint.sh:238` `tr -c '[:alnum:]._-' '_'` maps both `:` and `/` to `_`, so `foo:bar` and `foo/bar` collide to `foo_bar.json`; the later `writeFindingsOut` overwrites, dropping a model from consensus. Mirrors `status_file_for` (`entrypoint.sh:190`), so not a regression. Edge case, accurately scoped. Both survive verification; I've corrected the line anchors (main.go 230→229, entrypoint.sh 236→238). ## VERDICT: Minor issues The happy paths are well-guarded (best-effort `writeFindingsOut`, `|| true` around the consolidation call, junk-file skipping in `runConsolidate`, nil-safe maps, `line>0` guards). The gaps are on the *unhappy* path — what happens to a model that errors or is skipped while consolidation is on. ### Findings **1. (medium) In consolidate mode, an errored/skipped model is silently dropped — its error notice is lost.** `cmd/gadfly/main.go:229` — `writeFindingsOut(results)` runs only at the *end* of the success path; if the binary returns an error, `main()` (`main.go:118`) does `os.Exit(1)` and no findings file is written. The early "soft" notices in `scripts/run.sh` (`⚠️ No Ollama Cloud key configured` at `run.sh:161`, `⚠️ Agentic reviewer binary not found` at `run.sh:165`) set `REVIEW` *without invoking the binary at all* → again no findings file. Meanwhile `run.sh:220` suppresses that model's per-model comment when `CONSOLIDATE=1`. Net effect: in non-consolidate mode each of these produces a visible `⚠️ …` comment; in consolidate mode they produce **silence** (only a stderr `model X failed (continuing)` line, and for the skip cases not even that — run.sh exits 0). The fallback in `entrypoint.sh:311` only fires when `CONSENSUS` is *entirely* empty, and even then it loops `for f in "${FINDINGS_DIR}"/*.json` — a model with no file is never reconstructed. So a swarm where the API key is misconfigured for every model, or several models crash, posts either nothing or a clean-looking consensus, and the operator gets no signal that models failed. Suggested fix: when `CONSOLIDATE=1` and the binary fails or is skipped, still write a minimal artifact (or have `entrypoint.sh` enumerate `MODEL_LIST` vs. present files and surface the missing/errored ones), so failures stay visible. **2. (small) `findings_file_for` can silently collide two distinct model specs onto one file, dropping one from consensus.** `entrypoint.sh:238` — `tr -c '[:alnum:]._-' '_'` maps both `:` and `/` to `_`, so e.g. `foo:bar` (bare id) and `foo/bar` (provider `foo`) sanitize to the same `foo_bar.json`; the second model's `writeFindingsOut` overwrites the first, and one model vanishes from the cluster counts/denominator with no error. This mirrors the pre-existing `status_file_for` (`entrypoint.sh:190`) pattern, so it's low-severity and not a regression, but consensus *counting* makes a silent drop more misleading here than on the status board. Edge-case; only bites unusual co-configured specs. Everything else in the error-handling lane checked out: `runConsolidate` correctly errors on an empty dir and tolerates unreadable/unparseable/`.json`-less/empty-`Model` files; clustering uses initialized maps and a smallest-line representative that can't drift past tolerance; clean models are correctly counted in the `X/total` denominator; the consolidate-mode early return in `run()` sidesteps the `GADFLY_REPO_DIR/DIFF_FILE` requirement and the telemetry emit; no panic/index/division risks. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 17s</sub>

🪰 Gadfly review — claude-code/opus:max (claude-code)

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — Minor issues

I've verified the finding against the actual code.

Confirmed:

  • consensus.go:213 and :226 interpolate location(c.file, c.line) directly into the table inside backticks, while every other cell (title, lensList, oneModel, model, provider) is routed through mdCell (lines 213–214, 225–227, 239–240).
  • c.file traces to the model's gadfly-findings block: outFinding.Filefinding.filesf.File, only TrimSpace'd (findings.go:60). The model reviews the attacker-controlled PR diff, so this value is untrusted.
  • mdCell (consensus.go) escapes | and collapses newlines; location() does neither.
  • The GFM table-cell split happens before inline parsing, so an unescaped | inside a code span still terminates the cell — the inconsistency is real.

The finding is accurate and honestly scoped (low impact, not XSS). It survives.

VERDICT: Minor issues

The change is mostly plumbing (write per-model findings → cluster → render one comment); the untrusted-input paths are otherwise clean. One minor output-encoding gap.

Findings

  • cmd/gadfly/consensus.go:213 / :226 — attacker-influenced file interpolated into the consensus table without escaping. c.file originates from the model's gadfly-findings block (extractStructuredFindings, findings.go:60sf.File, only TrimSpace'd). The model reviews the attacker-controlled PR diff and can be prompt-injected, so file is untrusted. In renderConsensus, title, lensList, and oneModel all pass through mdCell (which escapes | and collapses newlines), but location(c.file, c.line) does not. A file value containing | breaks the markdown table layout (a pipe inside a code span still terminates a GFM table cell), letting a crafted PR garble or smuggle text into the consensus comment. Impact is low — Gitea sanitizes raw HTML so this isn't XSS — but it's an inconsistency in untrusted-data handling. Fix: wrap it, e.g. mdCell(location(c.file, c.line)), or escape inside location.

Checked and found safe (not reported):

  • No command injection / path traversal. In entrypoint.sh, the fallback loop's m/prov/md come from JSON via jq -r and are only passed as printf '%s' args and jq -n --arg values — never eval'd. findings_file_for is called only with operator-configured model names and sanitizes via tr -c '[:alnum:]._-' '_' (slashes stripped). GADFLY_FINDINGS_OUT/GADFLY_CONSOLIDATE_DIR are operator-set, not PR-derived.
  • No SSRF / new network sinks. upsert_comment_body only curls the fixed ${GITEA_API} with the existing token header.
  • No secret leakage. tail -c 500 consolidate.err carries parse/IO errors only.
  • mdCell is added hardening for title/lens/model; the embedded folded m.Markdown is the same trust level as the pre-existing per-model comment, so no new exposure.
🎯 Correctness — Minor issues

Both findings are confirmed against the actual code. Let me verify the line numbers precisely:

  • Finding 1 (location-only clustering): The merge predicate at consensus.go:273 is c.file == file && absInt(c.line-it.f.Line) <= lineTolerance — no title/lens/identity check. mergeIntoCluster (312–331) unions models/lenses and keeps only the top-severity title/detail. Confirmed.
  • Finding 2 (threshold high, Blocking→high): The partition at consensus.go:191 is len(c.models) >= 2 || sevRank(c.severity) >= sevRank("high"). collectFindings (67–71) fills missing severity from the lens verdict; verdict.severity() maps Blocking→"high" (consolidate.go:38–39). README/PR say "lone critical"; the only test uses critical. Confirmed.

Both survive verification. Corrected line numbers below.


I've verified the dependencies and logic against the checked-out code. All helper functions (parseVerdict, worstVerdict, modelProvider, renderConsolidated, verdict.severity()/label(), extractStructuredFindingsOrScrape) exist with matching signatures; the emit.go refactor to collectFindings is behavior-preserving; and the GADFLY_FINDINGS_OUT plumbing reaches the binary on the only lane entrypoint uses. The round-trip is sound.

Two semantic-correctness issues remain, both about the agreement metric — which is the entire point of this feature.

VERDICT: Minor issues

  • cmd/gadfly/consensus.go:273 (clusterFindings / lineTolerance) — agreement count conflates distinct issues that merely sit near each other. Clustering matches on file + line within ±3 only — there is no check on title, lens, or issue identity (c.file == file && absInt(c.line-it.f.Line) <= lineTolerance is the sole predicate). So if model A flags a perf issue at foo.go:40 and model B flags an unrelated error-handling issue at foo.go:42, they merge into one cluster reported as 2/N agreement, and only the higher-severity title is displayed while the other finding vanishes. Cross-model agreement is explicitly sold as "the strongest real-vs-false-positive signal," and it's what the table is ranked by — so a false merge both inflates the headline confidence and drops a real finding. mergeIntoCluster (consensus.go:312) unions models/lenses and keeps only the top-severity title/detail; nothing guards that the merged findings describe the same problem. Suggested fix: also require lens match (or a title/keyword-similarity gate) before merging, or at minimum surface the distinct titles within a merged cluster rather than discarding all but one. Severity: medium.

  • cmd/gadfly/consensus.go:191 — lone-finding headline threshold is high, not critical as documented, and every finding from a Blocking lens is auto-tagged high. The partition is len(c.models) >= 2 || sevRank(c.severity) >= sevRank("high"). But the README ("a lone critical still surfaces") and the PR body both say the single-model carve-out is for critical. Worse, collectFindings (consensus.go:67) fills missing per-finding severity from the lens verdict, and verdict.severity() maps Blocking → "high" (consolidate.go:38-39). So every heuristic-scraped finding from any single model whose lens landed on "Blocking" gets high and surfaces as its own unfolded 🔴 row at 1/N — exactly the per-model noise the folding is meant to suppress. The existing test only covers a lone critical (TestRenderConsensusHighSeverityLoneFindingStaysHeadline), so this lone-high path is untested. Suggested fix: either change the threshold to >= sevRank("critical") to match the stated intent, or update the README/PR wording to "lone high-or-critical" — and decide intentionally whether verdict-derived high should count. Severity: small.

Nothing else in the correctness lens: the emit.gocollectFindings extraction is equivalent to the prior inline logic, severity escalation and representative-line selection (mergeIntoCluster) are correct, the empty-dir/junk-file/round-trip paths match the tests, and the entrypoint fallback preserves the advisory-never-lose-output invariant.

🧹 Code cleanliness & maintainability — Minor issues

I've verified all three findings against the actual code. Here is the corrected review.

VERDICT: Minor issues

  • cluster.detail is dead state — collected and merged but never rendered (cmd/gadfly/consensus.go:164, :285, :327-329). Confirmed: the detail field is declared on the cluster struct (:164), initialized in the cluster literal (:285), and conditionally overwritten in mergeIntoCluster ("keep the highest-severity report's title/detail", :327-329), but renderConsensus (:176-244) only ever reads c.title, c.severity, c.file, c.line, c.models, and c.lensesc.detail is read nowhere, in neither the headline table, the folded table, nor the per-model section. So the detail-merge branch and the field carry no behavior. The sibling title field is rendered, which makes the asymmetry easy to miss. Low-churn fix: drop cluster.detail and the c.detail = f.Detail line (the outFinding.Detail artifact field is unaffected — it's still populated in writeFindingsOut and consumed by the telemetry reportPayload). If detail was meant to surface in the comment, that's a missing feature rather than dead code, but as written it's dead.

  • Per-model comment header/footer framing is now duplicated across scripts (entrypoint.sh:329-330 vs scripts/run.sh:223). The consolidation fallback path in entrypoint.sh re-hand-writes the exact ### 🪰 Gadfly review — \%s` (%s)header andAutomated adversarial review by Gadfly. Advisory only — does not block merge.footer thatrun.shalready owns (the only difference is the dropped· ⏱️ reviewed in %s). The header template now lives in three printf sites (run.sh:138placeholder,run.sh:223final,entrypoint.sh:329fallback); a future wording change has to be made in all three or the fallback comments silently diverge from the normal ones. Hard to fully DRY across two bash scripts, but worth a comment cross-referencingrun.sh` so the coupling is visible.

Performance — No material issues found

I've reviewed the consolidation code paths through the performance lens. Let me confirm the data volumes are bounded by checking the finding caps and how often these paths run.

The relevant facts I verified by reading the code:

  • clusterFindings (consensus.go:248–308) is the only nested loop: for each finding it scans existing clusters (for ci := range clusters). That's worst-case O(T²) in total findings T. But T is tightly bounded — maxFindingsPerLens = 20 (emit.go:42) caps scraped findings per lens, ~5 lenses × a handful of models ≈ low hundreds at most. Each comparison is an int subtraction + string compare. This runs once per PR, in a dedicated consolidation process. Not material.
  • collectFindings is now called twice per run — once in emit() (emit.go:133) and once in writeFindingsOut() (consensus.go:92), each re-running extractStructuredFindingsOrScrape/regex over each lens's markdown. This is duplicated parse work, but it's a one-time, end-of-run cost over small in-memory strings, dwarfed by the LLM calls that produced them. Micro-optimization at best, not a justifiable regression.
  • upsert_comment_body / fallback loop (entrypoint.sh) paginates comments (≤10 pages × 50) and, in the fallback path, does so once per model. Bounded by a small model count and only on the rare fallback. Not material.
  • writeFindingsOut / runConsolidate do one JSON marshal + one file write per model and one directory read — all linear in the (small) model/finding count, run once.

There are no hot loops, no per-request N+1 patterns, no unbounded growth, and no blocking calls on a latency-sensitive path here — this is a once-per-PR CI batch tool operating on bounded data.

VERDICT: No material issues found

  • Nothing in the performance lens rises to a real regression. The one nested loop (clusterFindings) and the double collectFindings call are both bounded, run-once costs that are negligible against the LLM review work; flagging them would be micro-optimization, not a justifiable finding.
🧯 Error handling & edge cases — Minor issues

Verification complete. I traced the failure path across run.sh, main.go (writeFindingsOut runs only at the end of a successful run()), and the entrypoint.sh consolidation/fallback block.

VERDICT: Minor issues

🧯 Error handling & edge cases

  • Failed-model output is silently dropped in consolidation mode, and the "fallback never loses output" claim is false for the most common failure. entrypoint.sh:322 / scripts/run.sh:138,220.
    When CONSOLIDATE=1, run.sh suppresses both the placeholder comment (run.sh:138) and the final per-model comment (run.sh:220). A model only contributes to consensus by writing GADFLY_FINDINGS_OUT — and writeFindingsOut runs only at the tail of a successful run() (main.go:226). Every unhappy path in run.sh's ollama branch sets REVIEW to a ⚠️ … notice but the binary writes no findings file: no Ollama key (run.sh:161), binary missing (run.sh:165), or binary exit≠0/empty output (run.sh:186). Consequences:

    • Partial failure (some models succeed): consolidation succeeds and posts a consensus comment, but the failed models' error notices vanish entirely — not suppressed-but-recoverable, just gone (no file for the fallback at entrypoint.sh:322 to replay).
    • Total failure (a shared misconfig — e.g. a missing key affecting all models, a very common case): zero findings files → runConsolidate returns "no model findings to consolidate" (consensus.go:181) → CONSENSUS empty → fallback loop's glob "${FINDINGS_DIR}"/*.json matches nothing → nothing is posted at all. Before this PR the same misconfig produced N visible ⚠️ No Ollama Cloud key configured… comments telling the user exactly what to fix; now the PR gets total silence.

    The fallback at entrypoint.sh:322 can only replay models that produced a findings file — i.e. the ones that succeeded — which structurally excludes the failed models whose output you'd most want surfaced. The comment at entrypoint.sh:308-310 ("if the consolidation pass produces nothing, fall back to posting each model's review… so a consolidation hiccup never loses output") only holds for a rendering hiccup with files present, not for the run-failure cases.
    Suggested fix: have run.sh write a minimal findings artifact even on the failure paths (model + provider + the ⚠️ REVIEW markdown, empty findings[]) so both consolidation's per-model detail and the fallback can surface it; or, when CONSOLIDATE=1 and a model produced no findings file, fall back to posting that model's ⚠️ notice as its own comment. Verified by reading the three files end-to-end.

Everything else in my lane checks out: runConsolidate guards empty dir / unreadable / unparseable / empty-model files and continues (consensus.go:160-182); writeFindingsOut is a correct no-op on empty env and best-effort on marshal/write errors; absInt/oneModel/mdCell/line-0 paths don't panic; worst/parseVerdict handle empty/unknown verdicts without misranking; and findings_file_for/CONSOLIDATE are defined before review_one is ever invoked (no set -u ordering trap).

Outside my lens: the headline table prints "No material issues found by consensus." even when the verdict line above it reads "Minor issues" (all findings single-model/low-sev and folded) — a possible correctness/presentation inconsistency for that lens to judge.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 58s

<!-- gadfly-review:ollama:claude-code/opus:max --> ### 🪰 Gadfly review — `claude-code/opus:max` (claude-code) **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — Minor issues</summary> I've verified the finding against the actual code. **Confirmed:** - `consensus.go:213` and `:226` interpolate `location(c.file, c.line)` directly into the table inside backticks, while every other cell (`title`, `lensList`, `oneModel`, `model`, `provider`) is routed through `mdCell` (lines 213–214, 225–227, 239–240). - `c.file` traces to the model's `gadfly-findings` block: `outFinding.File` ← `finding.file` ← `sf.File`, only `TrimSpace`'d (`findings.go:60`). The model reviews the attacker-controlled PR diff, so this value is untrusted. - `mdCell` (consensus.go) escapes `|` and collapses newlines; `location()` does neither. - The GFM table-cell split happens before inline parsing, so an unescaped `|` inside a code span still terminates the cell — the inconsistency is real. The finding is accurate and honestly scoped (low impact, not XSS). It survives. ## VERDICT: Minor issues The change is mostly plumbing (write per-model findings → cluster → render one comment); the untrusted-input paths are otherwise clean. One minor output-encoding gap. **Findings** - **`cmd/gadfly/consensus.go:213` / `:226` — attacker-influenced `file` interpolated into the consensus table without escaping.** `c.file` originates from the model's `gadfly-findings` block (`extractStructuredFindings`, `findings.go:60` → `sf.File`, only `TrimSpace`'d). The model reviews the attacker-controlled PR diff and can be prompt-injected, so `file` is untrusted. In `renderConsensus`, `title`, `lensList`, and `oneModel` all pass through `mdCell` (which escapes `|` and collapses newlines), but `location(c.file, c.line)` does **not**. A `file` value containing `|` breaks the markdown table layout (a pipe inside a code span still terminates a GFM table cell), letting a crafted PR garble or smuggle text into the consensus comment. Impact is low — Gitea sanitizes raw HTML so this isn't XSS — but it's an inconsistency in untrusted-data handling. Fix: wrap it, e.g. `mdCell(location(c.file, c.line))`, or escape inside `location`. **Checked and found safe (not reported):** - *No command injection / path traversal.* In `entrypoint.sh`, the fallback loop's `m`/`prov`/`md` come from JSON via `jq -r` and are only passed as `printf '%s'` args and `jq -n --arg` values — never `eval`'d. `findings_file_for` is called only with operator-configured model names and sanitizes via `tr -c '[:alnum:]._-' '_'` (slashes stripped). `GADFLY_FINDINGS_OUT`/`GADFLY_CONSOLIDATE_DIR` are operator-set, not PR-derived. - *No SSRF / new network sinks.* `upsert_comment_body` only curls the fixed `${GITEA_API}` with the existing token header. - *No secret leakage.* `tail -c 500 consolidate.err` carries parse/IO errors only. - *`mdCell` is added hardening* for title/lens/model; the embedded folded `m.Markdown` is the same trust level as the pre-existing per-model comment, so no new exposure. </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> Both findings are confirmed against the actual code. Let me verify the line numbers precisely: - **Finding 1** (location-only clustering): The merge predicate at `consensus.go:273` is `c.file == file && absInt(c.line-it.f.Line) <= lineTolerance` — no title/lens/identity check. `mergeIntoCluster` (312–331) unions models/lenses and keeps only the top-severity title/detail. Confirmed. - **Finding 2** (threshold `high`, Blocking→high): The partition at `consensus.go:191` is `len(c.models) >= 2 || sevRank(c.severity) >= sevRank("high")`. `collectFindings` (67–71) fills missing severity from the lens verdict; `verdict.severity()` maps Blocking→"high" (consolidate.go:38–39). README/PR say "lone *critical*"; the only test uses `critical`. Confirmed. Both survive verification. Corrected line numbers below. --- I've verified the dependencies and logic against the checked-out code. All helper functions (`parseVerdict`, `worstVerdict`, `modelProvider`, `renderConsolidated`, `verdict.severity()/label()`, `extractStructuredFindingsOrScrape`) exist with matching signatures; the `emit.go` refactor to `collectFindings` is behavior-preserving; and the `GADFLY_FINDINGS_OUT` plumbing reaches the binary on the only lane entrypoint uses. The round-trip is sound. Two semantic-correctness issues remain, both about the **agreement metric** — which is the entire point of this feature. ## VERDICT: Minor issues - **`cmd/gadfly/consensus.go:273` (clusterFindings / lineTolerance) — agreement count conflates distinct issues that merely sit near each other.** Clustering matches on file + line within ±3 *only* — there is no check on title, lens, or issue identity (`c.file == file && absInt(c.line-it.f.Line) <= lineTolerance` is the sole predicate). So if model A flags a perf issue at `foo.go:40` and model B flags an unrelated error-handling issue at `foo.go:42`, they merge into one cluster reported as **2/N agreement**, and only the higher-severity title is displayed while the other finding vanishes. Cross-model agreement is explicitly sold as "the strongest real-vs-false-positive signal," and it's what the table is *ranked* by — so a false merge both inflates the headline confidence and drops a real finding. `mergeIntoCluster` (consensus.go:312) unions `models`/`lenses` and keeps only the top-severity `title`/`detail`; nothing guards that the merged findings describe the same problem. Suggested fix: also require lens match (or a title/keyword-similarity gate) before merging, or at minimum surface the distinct titles within a merged cluster rather than discarding all but one. Severity: medium. - **`cmd/gadfly/consensus.go:191` — lone-finding headline threshold is `high`, not `critical` as documented, and every finding from a Blocking lens is auto-tagged `high`.** The partition is `len(c.models) >= 2 || sevRank(c.severity) >= sevRank("high")`. But the README ("a lone *critical* still surfaces") and the PR body both say the single-model carve-out is for *critical*. Worse, `collectFindings` (consensus.go:67) fills missing per-finding severity from the lens verdict, and `verdict.severity()` maps **Blocking → "high"** (consolidate.go:38-39). So every heuristic-scraped finding from any single model whose lens landed on "Blocking" gets `high` and surfaces as its own unfolded 🔴 row at `1/N` — exactly the per-model noise the folding is meant to suppress. The existing test only covers a lone *critical* (`TestRenderConsensusHighSeverityLoneFindingStaysHeadline`), so this lone-`high` path is untested. Suggested fix: either change the threshold to `>= sevRank("critical")` to match the stated intent, or update the README/PR wording to "lone high-or-critical" — and decide intentionally whether verdict-derived `high` should count. Severity: small. Nothing else in the correctness lens: the `emit.go` → `collectFindings` extraction is equivalent to the prior inline logic, severity escalation and representative-line selection (`mergeIntoCluster`) are correct, the empty-dir/junk-file/round-trip paths match the tests, and the entrypoint fallback preserves the advisory-never-lose-output invariant. </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> I've verified all three findings against the actual code. Here is the corrected review. ## VERDICT: Minor issues - **`cluster.detail` is dead state — collected and merged but never rendered** (`cmd/gadfly/consensus.go:164`, `:285`, `:327-329`). Confirmed: the `detail` field is declared on the `cluster` struct (`:164`), initialized in the cluster literal (`:285`), and conditionally overwritten in `mergeIntoCluster` ("keep the highest-severity report's title/detail", `:327-329`), but `renderConsensus` (`:176-244`) only ever reads `c.title`, `c.severity`, `c.file`, `c.line`, `c.models`, and `c.lenses` — `c.detail` is read nowhere, in neither the headline table, the folded table, nor the per-model section. So the detail-merge branch and the field carry no behavior. The sibling `title` field *is* rendered, which makes the asymmetry easy to miss. Low-churn fix: drop `cluster.detail` and the `c.detail = f.Detail` line (the `outFinding.Detail` artifact field is unaffected — it's still populated in `writeFindingsOut` and consumed by the telemetry `reportPayload`). If detail was meant to surface in the comment, that's a missing feature rather than dead code, but as written it's dead. - **Per-model comment header/footer framing is now duplicated across scripts** (`entrypoint.sh:329-330` vs `scripts/run.sh:223`). The consolidation fallback path in `entrypoint.sh` re-hand-writes the exact `### 🪰 Gadfly review — \`%s\` (%s)` header and `<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge.</sub>` footer that `run.sh` already owns (the only difference is the dropped `· ⏱️ reviewed in %s`). The header template now lives in three printf sites (`run.sh:138` placeholder, `run.sh:223` final, `entrypoint.sh:329` fallback); a future wording change has to be made in all three or the fallback comments silently diverge from the normal ones. Hard to fully DRY across two bash scripts, but worth a comment cross-referencing `run.sh` so the coupling is visible. </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> I've reviewed the consolidation code paths through the performance lens. Let me confirm the data volumes are bounded by checking the finding caps and how often these paths run. The relevant facts I verified by reading the code: - **`clusterFindings` (consensus.go:248–308)** is the only nested loop: for each finding it scans existing clusters (`for ci := range clusters`). That's worst-case O(T²) in total findings T. But T is tightly bounded — `maxFindingsPerLens = 20` (emit.go:42) caps scraped findings per lens, ~5 lenses × a handful of models ≈ low hundreds at most. Each comparison is an int subtraction + string compare. This runs **once per PR**, in a dedicated consolidation process. Not material. - **`collectFindings` is now called twice per run** — once in `emit()` (emit.go:133) and once in `writeFindingsOut()` (consensus.go:92), each re-running `extractStructuredFindingsOrScrape`/regex over each lens's markdown. This is duplicated parse work, but it's a one-time, end-of-run cost over small in-memory strings, dwarfed by the LLM calls that produced them. Micro-optimization at best, not a justifiable regression. - **`upsert_comment_body` / fallback loop (entrypoint.sh)** paginates comments (≤10 pages × 50) and, in the fallback path, does so once per model. Bounded by a small model count and only on the rare fallback. Not material. - **`writeFindingsOut` / `runConsolidate`** do one JSON marshal + one file write per model and one directory read — all linear in the (small) model/finding count, run once. There are no hot loops, no per-request N+1 patterns, no unbounded growth, and no blocking calls on a latency-sensitive path here — this is a once-per-PR CI batch tool operating on bounded data. VERDICT: No material issues found - Nothing in the performance lens rises to a real regression. The one nested loop (`clusterFindings`) and the double `collectFindings` call are both bounded, run-once costs that are negligible against the LLM review work; flagging them would be micro-optimization, not a justifiable finding. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> Verification complete. I traced the failure path across `run.sh`, `main.go` (`writeFindingsOut` runs only at the end of a successful `run()`), and the `entrypoint.sh` consolidation/fallback block. ## VERDICT: Minor issues **🧯 Error handling & edge cases** - **Failed-model output is silently dropped in consolidation mode, and the "fallback never loses output" claim is false for the most common failure.** `entrypoint.sh:322` / `scripts/run.sh:138,220`. When `CONSOLIDATE=1`, `run.sh` suppresses *both* the placeholder comment (`run.sh:138`) and the final per-model comment (`run.sh:220`). A model only contributes to consensus by writing `GADFLY_FINDINGS_OUT` — and `writeFindingsOut` runs **only at the tail of a successful `run()`** (`main.go:226`). Every unhappy path in `run.sh`'s ollama branch sets `REVIEW` to a `⚠️ …` notice but the binary writes **no findings file**: no Ollama key (`run.sh:161`), binary missing (`run.sh:165`), or binary exit≠0/empty output (`run.sh:186`). Consequences: - *Partial failure* (some models succeed): consolidation succeeds and posts a consensus comment, but the failed models' error notices vanish entirely — not suppressed-but-recoverable, just gone (no file for the fallback at `entrypoint.sh:322` to replay). - *Total failure* (a shared misconfig — e.g. a missing key affecting all models, a very common case): zero findings files → `runConsolidate` returns `"no model findings to consolidate"` (`consensus.go:181`) → `CONSENSUS` empty → fallback loop's glob `"${FINDINGS_DIR}"/*.json` matches nothing → **nothing is posted at all.** Before this PR the same misconfig produced N visible `⚠️ No Ollama Cloud key configured…` comments telling the user exactly what to fix; now the PR gets total silence. The fallback at `entrypoint.sh:322` can only replay models that produced a findings file — i.e. the ones that *succeeded* — which structurally excludes the failed models whose output you'd most want surfaced. The comment at `entrypoint.sh:308-310` ("if the consolidation pass produces nothing, fall back to posting each model's review… so a consolidation hiccup never loses output") only holds for a *rendering* hiccup with files present, not for the run-failure cases. Suggested fix: have `run.sh` write a minimal findings artifact even on the failure paths (model + provider + the `⚠️` `REVIEW` markdown, empty `findings[]`) so both consolidation's per-model detail and the fallback can surface it; or, when `CONSOLIDATE=1` and a model produced no findings file, fall back to posting that model's `⚠️` notice as its own comment. Verified by reading the three files end-to-end. Everything else in my lane checks out: `runConsolidate` guards empty dir / unreadable / unparseable / empty-model files and continues (`consensus.go:160-182`); `writeFindingsOut` is a correct no-op on empty env and best-effort on marshal/write errors; `absInt`/`oneModel`/`mdCell`/line-0 paths don't panic; `worst`/`parseVerdict` handle empty/unknown verdicts without misranking; and `findings_file_for`/`CONSOLIDATE` are defined before `review_one` is ever invoked (no `set -u` ordering trap). Outside my lens: the headline table prints "No material issues found by consensus." even when the verdict line above it reads "Minor issues" (all findings single-model/low-sev and folded) — a possible correctness/presentation inconsistency for that lens to judge. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 58s</sub>

🪰 Gadfly review — deepseek-v4-pro:cloud (ollama-cloud)

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

No material issues found.

I traced the full data flow from untrusted inputs (model output, environment variables, JSON files on disk) through to where they're consumed (shell command construction, file writes, PR comment rendering). Here's what I checked and why nothing rises to a security finding:

  • Path traversal in GADFLY_FINDINGS_OUT: The filename is built by findings_file_for() in entrypoint.sh:238, which sanitizes the model name with tr -c '[:alnum:]._-' '_' — slashes and dots are replaced, so ../../../etc/passwd becomes ___________etc_passwd. The directory prefix (${WORKDIR}/findings) is hardcoded. Safe.

  • Command injection in shell scripts: The only dynamic value passed to a command is $FINDINGS_DIR in entrypoint.sh:314, which is ${WORKDIR}/findings — a hardcoded path, not attacker-controlled. The jq -n --arg b "$body" pattern in upsert_comment_body (entrypoint.sh:74) uses --arg which binds the value as a JSON string with proper escaping. No shell metacharacter injection.

  • JSON deserialization of model artifacts: runConsolidate() (consensus.go:128-145) reads .json files from a directory only writable by the container's own processes. Go's json.Unmarshal into a typed struct is safe — no code execution. The Markdown field (raw LLM output) is embedded in the comment, but this is the same risk profile as the existing per-model comments.

  • Secret leakage: The consensus comment embeds the same per-model markdown that would have been posted individually. No new secret exposure path. The GITEA_TOKEN is never placed into the prompt, diff, or any model-visible context.

  • SSRF: No user-controlled input is used to construct outbound URLs. GITEA_API comes from the workflow environment, controlled by the repo owner.

  • Insecure defaults: GADFLY_CONSOLIDATE defaults to auto, which enables consolidation only for ≥2 models. The default model list is a single model, so consolidation is off by default. Safe.

  • mdCell escaping (consensus.go:390-394): Only escapes pipes and newlines, not backticks. A model-generated title containing a backtick could break the markdown table cell's inline-code formatting. This is a rendering robustness concern, not a security vulnerability — the content is model-generated, not attacker-controlled, and Gitea's markdown renderer is the defense boundary.

🎯 Correctness — Minor issues

I have read the actual code at cmd/gadfly/consensus.go:266-293 and mergeIntoCluster at lines 310-331. Let me trace the algorithm precisely with the draft's example.

The code at line 273 checks:

if c.file == file && absInt(c.line-it.f.Line) <= lineTolerance {

c.line is maintained as the smallest line in the cluster (line 319-320 in mergeIntoCluster: if f.Line > 0 && (c.line == 0 || f.Line < c.line) { c.line = f.Line }). Since items are processed in ascending line order (line 268), c.line is always the first (minimum) line added to the cluster.

With findings at lines 10, 13, 16 and lineTolerance = 3:

  1. Line 10 → new cluster C1, c.line = 10
  2. Line 13 → absInt(10 - 13) = 3 ≤ 3 → merges into C1, c.line stays 10
  3. Line 16 → absInt(10 - 16) = 6 > 3 → new cluster C2

So 16 is split into its own cluster despite |13−16| = 3 ≤ 3. The finding is confirmed against the actual code.


Verdict: Minor issues

  • cmd/gadfly/consensus.go:273 — Greedy clustering uses only the cluster's minimum line for distance checks, breaking transitive proximity. The algorithm checks absInt(c.line - it.f.Line) <= lineTolerance where c.line is the smallest line in the cluster. Since findings are processed in ascending line order, a new finding can be within tolerance of the cluster's maximum member but outside tolerance of its minimum, causing it to be incorrectly split into a separate cluster. Example: findings at lines 10, 13, 16 in the same file. 10 and 13 merge (|10−13|=3 ≤ 3). 16 checks against the cluster's representative line 10: |10−16|=6 > 3, so 16 becomes its own cluster — even though |13−16|=3 ≤ 3. This deflates cross-model agreement counts (a 3-model consensus shows as 2+1), potentially burying real findings in the folded "single-model" section. The fix is to track the cluster's max line and check it.f.Line - c.maxLine <= lineTolerance (since items are sorted, only the forward gap matters), or equivalently maintain a [min, max] span and check both ends.
🧹 Code cleanliness & maintainability — Minor issues

Both findings are confirmed against the actual code:

  1. cluster.detail: Written at lines 285 and 328, but never read anywhere — renderConsensus uses c.title, c.severity, c.file, c.line, c.models, c.lenses but never c.detail. The per-model detail is preserved via m.Markdown.

  2. Duplicated upsert_comment logic: entrypoint.sh:72-90 (upsert_comment_body) and scripts/run.sh:82-102 (upsert_comment) share the same paginate-find-by-marker-PATCH/POST pattern, differing only in the marker parameterization and curl wrapper. A DRY violation.


Verdict: Minor issues

  • cmd/gadfly/consensus.go:164 — Dead field detail on cluster struct. The detail field is written in two places (line 285 during initial construction, line 328 in mergeIntoCluster) but is never read anywhere in the codebase — not in renderConsensus, not in tests, not in any other file. The per-model detail is preserved via m.Markdown in the folded "Per-model detail" section, so cluster.detail serves no purpose. It's dead code that misleads readers into thinking it's used for rendering. Remove the field and its two write sites.

  • entrypoint.sh:72–90 — Duplicated upsert_comment logic with scripts/run.sh:82–102. The new upsert_comment_body function in entrypoint.sh is a near-identical copy of run.sh's upsert_comment, differing only in that the marker is a parameter instead of a hardcoded $MARKER. This is a copy-paste DRY violation introduced by this PR: any future fix to pagination, error handling, or the API call shape must be applied in two places. The two scripts already share SCRIPTS_DIR; extracting the shared function into a sourced helper (e.g., scripts/comment-utils.sh) would eliminate the duplication with low churn.

Performance — No material issues found

Verdict: No material issues found

The consolidation path operates on trivially small datasets — single-digit models, tens of findings per model — and every operation is either O(n) or O(n²) on n < 100. All I/O is one-shot (read N small JSON files, write one comment). The strings.Builder usage in renderConsensus is appropriate. The shell fallback path spawns a few jq processes but only on the cold error-recovery branch. Nothing here approaches a hot path or unbounded growth worth flagging.

🧯 Error handling & edge cases — Minor issues

Verdict: Minor issues

  • cmd/gadfly/consensus.go:273 — All line=0 findings in the same file merge into one cluster. In clusterFindings, findings with Line == 0 (from structured blocks missing or zeroing the line field) all satisfy absInt(0-0) <= lineTolerance, so every unlocated finding in a given file collapses into a single cluster regardless of whether they describe completely different issues. The severity escalates to the max among them and the title comes from whichever had the highest severity — distinct problems get silently fused. The heuristic scrape path (parseFindings) always produces a concrete line, so this only triggers on malformed structured blocks, but the code should handle it correctly (e.g. skip line=0 findings from proximity clustering, or give each its own cluster).

  • entrypoint.sh:328 — Fallback per-model marker hardcodes ollama provider. In the consolidation-failure fallback path, the marker is constructed as <!-- gadfly-review:ollama:${m} -->. This matches the current review_one which always passes PROVIDER=ollama to run.sh, so it is correct today. But the provider field in the JSON artifact comes from modelProvider() (the Go binary's resolved backend), and if review_one ever passes a different PROVIDER, the fallback markers would silently diverge from what run.sh would have used, causing duplicate comments on re-runs instead of in-place updates. The prov variable already extracted from JSON on line 326 should be used in the marker instead of the literal ollama.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 12m 44s

<!-- gadfly-review:ollama:deepseek-v4-pro:cloud --> ### 🪰 Gadfly review — `deepseek-v4-pro:cloud` (ollama-cloud) **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> No material issues found. I traced the full data flow from untrusted inputs (model output, environment variables, JSON files on disk) through to where they're consumed (shell command construction, file writes, PR comment rendering). Here's what I checked and why nothing rises to a security finding: - **Path traversal in `GADFLY_FINDINGS_OUT`**: The filename is built by `findings_file_for()` in `entrypoint.sh:238`, which sanitizes the model name with `tr -c '[:alnum:]._-' '_'` — slashes and dots are replaced, so `../../../etc/passwd` becomes `___________etc_passwd`. The directory prefix (`${WORKDIR}/findings`) is hardcoded. Safe. - **Command injection in shell scripts**: The only dynamic value passed to a command is `$FINDINGS_DIR` in `entrypoint.sh:314`, which is `${WORKDIR}/findings` — a hardcoded path, not attacker-controlled. The `jq -n --arg b "$body"` pattern in `upsert_comment_body` (`entrypoint.sh:74`) uses `--arg` which binds the value as a JSON string with proper escaping. No shell metacharacter injection. - **JSON deserialization of model artifacts**: `runConsolidate()` (`consensus.go:128-145`) reads `.json` files from a directory only writable by the container's own processes. Go's `json.Unmarshal` into a typed struct is safe — no code execution. The `Markdown` field (raw LLM output) is embedded in the comment, but this is the same risk profile as the existing per-model comments. - **Secret leakage**: The consensus comment embeds the same per-model markdown that would have been posted individually. No new secret exposure path. The `GITEA_TOKEN` is never placed into the prompt, diff, or any model-visible context. - **SSRF**: No user-controlled input is used to construct outbound URLs. `GITEA_API` comes from the workflow environment, controlled by the repo owner. - **Insecure defaults**: `GADFLY_CONSOLIDATE` defaults to `auto`, which enables consolidation only for ≥2 models. The default model list is a single model, so consolidation is off by default. Safe. - **`mdCell` escaping** (`consensus.go:390-394`): Only escapes pipes and newlines, not backticks. A model-generated title containing a backtick could break the markdown table cell's inline-code formatting. This is a rendering robustness concern, not a security vulnerability — the content is model-generated, not attacker-controlled, and Gitea's markdown renderer is the defense boundary. </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> I have read the actual code at `cmd/gadfly/consensus.go:266-293` and `mergeIntoCluster` at lines 310-331. Let me trace the algorithm precisely with the draft's example. The code at line 273 checks: ```go if c.file == file && absInt(c.line-it.f.Line) <= lineTolerance { ``` `c.line` is maintained as the *smallest* line in the cluster (line 319-320 in `mergeIntoCluster`: `if f.Line > 0 && (c.line == 0 || f.Line < c.line) { c.line = f.Line }`). Since items are processed in ascending line order (line 268), `c.line` is always the first (minimum) line added to the cluster. With findings at lines 10, 13, 16 and `lineTolerance = 3`: 1. Line 10 → new cluster C1, `c.line = 10` 2. Line 13 → `absInt(10 - 13) = 3 ≤ 3` → merges into C1, `c.line` stays 10 3. Line 16 → `absInt(10 - 16) = 6 > 3` → new cluster C2 So 16 is split into its own cluster despite `|13−16| = 3 ≤ 3`. The finding is **confirmed** against the actual code. --- **Verdict: Minor issues** - **`cmd/gadfly/consensus.go:273` — Greedy clustering uses only the cluster's minimum line for distance checks, breaking transitive proximity.** The algorithm checks `absInt(c.line - it.f.Line) <= lineTolerance` where `c.line` is the *smallest* line in the cluster. Since findings are processed in ascending line order, a new finding can be within tolerance of the cluster's *maximum* member but outside tolerance of its minimum, causing it to be incorrectly split into a separate cluster. Example: findings at lines 10, 13, 16 in the same file. 10 and 13 merge (|10−13|=3 ≤ 3). 16 checks against the cluster's representative line 10: |10−16|=6 > 3, so 16 becomes its own cluster — even though |13−16|=3 ≤ 3. This deflates cross-model agreement counts (a 3-model consensus shows as 2+1), potentially burying real findings in the folded "single-model" section. The fix is to track the cluster's max line and check `it.f.Line - c.maxLine <= lineTolerance` (since items are sorted, only the forward gap matters), or equivalently maintain a `[min, max]` span and check both ends. </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> Both findings are confirmed against the actual code: 1. **`cluster.detail`**: Written at lines 285 and 328, but never read anywhere — `renderConsensus` uses `c.title`, `c.severity`, `c.file`, `c.line`, `c.models`, `c.lenses` but never `c.detail`. The per-model detail is preserved via `m.Markdown`. 2. **Duplicated `upsert_comment` logic**: `entrypoint.sh:72-90` (`upsert_comment_body`) and `scripts/run.sh:82-102` (`upsert_comment`) share the same paginate-find-by-marker-PATCH/POST pattern, differing only in the marker parameterization and curl wrapper. A DRY violation. --- **Verdict: Minor issues** - **`cmd/gadfly/consensus.go:164` — Dead field `detail` on `cluster` struct.** The `detail` field is written in two places (line 285 during initial construction, line 328 in `mergeIntoCluster`) but is never read anywhere in the codebase — not in `renderConsensus`, not in tests, not in any other file. The per-model detail is preserved via `m.Markdown` in the folded "Per-model detail" section, so `cluster.detail` serves no purpose. It's dead code that misleads readers into thinking it's used for rendering. Remove the field and its two write sites. - **`entrypoint.sh:72–90` — Duplicated `upsert_comment` logic with `scripts/run.sh:82–102`.** The new `upsert_comment_body` function in `entrypoint.sh` is a near-identical copy of `run.sh`'s `upsert_comment`, differing only in that the marker is a parameter instead of a hardcoded `$MARKER`. This is a copy-paste DRY violation introduced by this PR: any future fix to pagination, error handling, or the API call shape must be applied in two places. The two scripts already share `SCRIPTS_DIR`; extracting the shared function into a sourced helper (e.g., `scripts/comment-utils.sh`) would eliminate the duplication with low churn. </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> **Verdict: No material issues found** The consolidation path operates on trivially small datasets — single-digit models, tens of findings per model — and every operation is either O(n) or O(n²) on n < 100. All I/O is one-shot (read N small JSON files, write one comment). The `strings.Builder` usage in `renderConsensus` is appropriate. The shell fallback path spawns a few `jq` processes but only on the cold error-recovery branch. Nothing here approaches a hot path or unbounded growth worth flagging. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> ## Verdict: Minor issues - **`cmd/gadfly/consensus.go:273` — All line=0 findings in the same file merge into one cluster.** In `clusterFindings`, findings with `Line == 0` (from structured blocks missing or zeroing the line field) all satisfy `absInt(0-0) <= lineTolerance`, so every unlocated finding in a given file collapses into a single cluster regardless of whether they describe completely different issues. The severity escalates to the max among them and the title comes from whichever had the highest severity — distinct problems get silently fused. The heuristic scrape path (`parseFindings`) always produces a concrete line, so this only triggers on malformed structured blocks, but the code should handle it correctly (e.g. skip line=0 findings from proximity clustering, or give each its own cluster). - **`entrypoint.sh:328` — Fallback per-model marker hardcodes `ollama` provider.** In the consolidation-failure fallback path, the marker is constructed as `<!-- gadfly-review:ollama:${m} -->`. This matches the current `review_one` which always passes `PROVIDER=ollama` to `run.sh`, so it is correct *today*. But the `provider` field in the JSON artifact comes from `modelProvider()` (the Go binary's resolved backend), and if `review_one` ever passes a different `PROVIDER`, the fallback markers would silently diverge from what `run.sh` would have used, causing duplicate comments on re-runs instead of in-place updates. The `prov` variable already extracted from JSON on line 326 should be used in the marker instead of the literal `ollama`. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 12m 44s</sub>

🪰 Gadfly review — devstral-2 (ollama-cloud)

Review incomplete — all lenses errored — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security⚠️ could not complete

⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found

🎯 Correctness⚠️ could not complete

⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found

🧹 Code cleanliness & maintainability⚠️ could not complete

⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found

Performance⚠️ could not complete

⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found

🧯 Error handling & edge cases⚠️ could not complete

⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2s

<!-- gadfly-review:ollama:devstral-2 --> ### 🪰 Gadfly review — `devstral-2` (ollama-cloud) **Review incomplete — all lenses errored** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found </details> <details><summary><b>🎯 Correctness</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found </details> <details><summary><b>⚡ Performance</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found </details> <details><summary><b>🧯 Error handling & edge cases</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed ollama-cloud/devstral-2: ollama-cloud/devstral-2: HTTP 404: model 'devstral-2' not found </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2s</sub>

🪰 Gadfly review — glm-5.2:cloud (ollama-cloud)

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

VERDICT: No material issues found

Through the security lens I traced the untrusted-data flows this PR introduces and checked each against the actual code:

  • Findings artifact dir / filenames (entrypoint.sh:231,238): FINDINGS_DIR is a fixed subpath of WORKDIR (CI-controlled, not user input), created fresh with rm -rf/mkdir -p. Per-model filenames are sanitized via tr -c '[:alnum:]._-' '_', so model names with //.. can't traverse. Verified at entrypoint.sh:238. Safe.
  • Consolidation reads (consensus.go:104-130): runConsolidate only reads *.json from the consolidate dir and json.Unmarshals into a fixed struct; untrusted fields are only ever re-emitted as markdown text, never executed/evaluated. Safe.
  • Comment upsert by marker (entrypoint.sh:69-90): the new upsert_comment_body mirrors the existing run.sh:82 upsert (same startswith($m) matching, same token-scoped PATCH/POST). The consensus marker is a fixed string. No new authz surface beyond what already exists. Verified by comparing both functions.
  • Untrusted model output into a PR comment (consensus.go:239-241, mdCell): per-model Markdown/Title/Provider/Verdict are rendered into the comment. mdCell escapes pipes/newlines for table cells; the folded per-model body is inserted raw — but that is identical to the pre-existing behavior where run.sh posts REVIEW verbatim, so this is not a regression introduced here.
  • GADFLY_FINDINGS_OUT / GADFLY_CONSOLIDATE_DIR (consensus.go:62,83): only used to locate a local file/dir within the CI workspace; not reflected into shell/URLs. No injection vector.

No authn/authz gaps, injection, SSRF, secret leakage, or unsafe-deserialization issues attributable to this change.

🎯 Correctness — No material issues found

The key finding in the draft is a false positive. In run.sh, the marker uses ${PROVIDER}, which entrypoint.sh always sets to ollama (line 207). The fallback marker in entrypoint.sh:328 correctly mirrors this. The prov variable is the display provider (matching MODEL_PROVIDER), used in the body header but NOT in the marker in either path.

All other draft bullets are "verified correct" / non-issue observations, not actual findings.

Verdict: No material issues found

🧹 Code cleanliness & maintainability — Minor issues

All four findings confirmed against the actual source code.

Verdict: Minor issues found

  • cmd/gadfly/consensus.go:250-262 — duplicated anonymous struct definition. The byFile map value type (struct { f outFinding; model string }) is spelled out fully twice — once in the map[string][]struct{...}{} declaration (line 250) and again in the append literal (line 259). This is a noisy, readability-hostile pattern that Go's style discourages; a one-line named type (type findingRef struct { f outFinding; model string }) at file scope would collapse both spellings and read far cleaner.

  • entrypoint.sh:72-90 vs scripts/run.sh:82-102 — copy-pasted upsert_comment / upsert_comment_body. The new upsert_comment_body is a near-verbatim clone of run.sh's upsert_comment, differing only in taking marker as a parameter (and dropping the $API_TIMEOUT indirection in favor of literal timeouts). The entrypoint comment even says "Mirrors run.sh's per-model upsert." Two copies of the same pagination/upsert loop now live in separate scripts with no shared source, so a fix to one (e.g. the pagination cap, error handling) won't propagate. A shared sourced helper would be the lower-churn long-term fix; at minimum a comment cross-referencing the two would flag the coupling.

  • cmd/gadfly/consensus.go:164cluster.detail is written but never read. mergeIntoCluster maintains c.detail (the highest-severity report's detail, line 328), and clusterFindings seeds it, but renderConsensus never emits detail anywhere — only title, file, line, models, lenses. Grep confirms the only references are the field declaration and assignment sites. The field is dead storage carried through every cluster, adding maintenance surface (a reader assumes it surfaces somewhere). Either render it (e.g. in a folded <details> per headline row) or drop it.

  • cmd/gadfly/consensus.go:200-205agreed tally re-iterates clusters when it could be folded into the partition above. clusterFindings is called once (line 178), the headline/folded partition loop (lines 189-196) already walks every cluster, then agreed is tallied by re-iterating clusters (lines 200-205). Functionally fine, but agreed could be counted in the same pass as the partition instead of a second full loop. Minor readability/efficiency nit, not material.

Performance — No material issues found

Verdict: No material issues found

Reviewed the new consolidation path through the performance lens:

  • clusterFindings (consensus.go:248) is O(F²) in total findings (each item scans all existing clusters for a line-proximity match). F here is findings × models for a single PR review — tens to low hundreds at most — so the quadratic scan is immaterial. Verified the inner loop at consensus.go:269-292.
  • renderConsensus (consensus.go:176) builds the whole comment in one strings.Builder with a single pass over clusters and models — no repeated allocation, no quadratic string concatenation. Fine.
  • runConsolidate (consensus.go:96) reads every .json in the dir once (os.ReadDir + one ReadFile each), bounded by the number of models. No unbounded growth.
  • upsert_comment_body (entrypoint.sh:72) paginates up to 10 pages × 50 comments — bounded, and mirrors the pre-existing upsert_comment in run.sh:82. The fallback path spawns 3 jq subprocesses per model file (entrypoint.sh:324-327), but that path only runs on consolidation failure with a handful of files, so not material.
  • writeFindingsOut (consensus.go:80) does one json.Marshal + one os.WriteFile per model run — trivial.

No N+1, no unbounded growth, no hot-loop regressions. The work is all on bounded, small inputs (one PR's findings × a handful of models).

🧯 Error handling & edge cases — Minor issues

Verdict: Minor issues found

  • Greedy clustering can under-count agreement due to representative-line drift (cmd/gadfly/consensus.go:271-277). clusterFindings sorts each file's items ascending by line, then greedily merges each item into the first cluster whose representative line (c.line) is within lineTolerance (3). mergeIntoCluster keeps c.line at the minimum line (line 319: f.Line < c.line). So when findings span a window wider than the tolerance but chained within it, a later finding is compared only to the smallest line, not the most recently merged member. Example: lines 1, 4, 7 — 1 and 4 merge (|1−4|=3 ≤ 3, rep stays 1); then 7 is compared to rep 1 (|1−7|=6 > 3) and spawns a new cluster, even though it's only 3 lines from line 4 which is already in the first cluster. A single real issue cited across a ~7-line span gets split into two clusters, undercounting cross-model agreement and potentially demoting a headline finding to the folded section. Suggested fix: compare against the cluster's largest line (e.g. track maxLine per cluster and test it.f.Line - c.maxLine <= lineTolerance), or compare against the range. Verified against clusterFindings (266–293) and mergeIntoCluster (312–331).

  • writeFindingsOut silently no-ops if the output directory is missing (cmd/gadfly/consensus.go:109-111). os.WriteFile fails if the parent directory doesn't exist; the error is logged to stderr and swallowed (best-effort by design). In the normal entrypoint flow this is fine — FINDINGS_DIR is mkdir -p'd before the swarm runs (entrypoint.sh:240) — but if GADFLY_FINDINGS_OUT is ever set to a path in a non-existent directory (e.g. misconfigured standalone run), that model silently contributes nothing to consolidation, and the only signal is a stderr line that may not be surfaced. Acceptable for best-effort, but the failure mode is "model vanishes from consensus" with no structured fallback. Verified against the function (80–112) and the entrypoint mkdir (240).

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 37s

<!-- gadfly-review:ollama:glm-5.2:cloud --> ### 🪰 Gadfly review — `glm-5.2:cloud` (ollama-cloud) **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> **VERDICT: No material issues found** Through the security lens I traced the untrusted-data flows this PR introduces and checked each against the actual code: - **Findings artifact dir / filenames** (`entrypoint.sh:231,238`): `FINDINGS_DIR` is a fixed subpath of `WORKDIR` (CI-controlled, not user input), created fresh with `rm -rf`/`mkdir -p`. Per-model filenames are sanitized via `tr -c '[:alnum:]._-' '_'`, so model names with `/`/`..` can't traverse. Verified at `entrypoint.sh:238`. Safe. - **Consolidation reads** (`consensus.go:104-130`): `runConsolidate` only reads `*.json` from the consolidate dir and `json.Unmarshal`s into a fixed struct; untrusted fields are only ever re-emitted as markdown text, never executed/evaluated. Safe. - **Comment upsert by marker** (`entrypoint.sh:69-90`): the new `upsert_comment_body` mirrors the existing `run.sh:82` upsert (same `startswith($m)` matching, same token-scoped PATCH/POST). The consensus marker is a fixed string. No new authz surface beyond what already exists. Verified by comparing both functions. - **Untrusted model output into a PR comment** (`consensus.go:239-241`, `mdCell`): per-model `Markdown`/`Title`/`Provider`/`Verdict` are rendered into the comment. `mdCell` escapes pipes/newlines for table cells; the folded per-model body is inserted raw — but that is identical to the pre-existing behavior where `run.sh` posts `REVIEW` verbatim, so this is not a regression introduced here. - **`GADFLY_FINDINGS_OUT` / `GADFLY_CONSOLIDATE_DIR`** (`consensus.go:62,83`): only used to locate a local file/dir within the CI workspace; not reflected into shell/URLs. No injection vector. No authn/authz gaps, injection, SSRF, secret leakage, or unsafe-deserialization issues attributable to this change. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> The key finding in the draft is a false positive. In run.sh, the marker uses `${PROVIDER}`, which entrypoint.sh always sets to `ollama` (line 207). The fallback marker in entrypoint.sh:328 correctly mirrors this. The `prov` variable is the display provider (matching `MODEL_PROVIDER`), used in the body header but NOT in the marker in either path. All other draft bullets are "verified correct" / non-issue observations, not actual findings. **Verdict: No material issues found** </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> All four findings confirmed against the actual source code. **Verdict: Minor issues found** - **`cmd/gadfly/consensus.go:250-262` — duplicated anonymous struct definition.** The `byFile` map value type (`struct { f outFinding; model string }`) is spelled out fully twice — once in the `map[string][]struct{...}{}` declaration (line 250) and again in the append literal (line 259). This is a noisy, readability-hostile pattern that Go's style discourages; a one-line named type (`type findingRef struct { f outFinding; model string }`) at file scope would collapse both spellings and read far cleaner. - **`entrypoint.sh:72-90` vs `scripts/run.sh:82-102` — copy-pasted `upsert_comment` / `upsert_comment_body`.** The new `upsert_comment_body` is a near-verbatim clone of run.sh's `upsert_comment`, differing only in taking `marker` as a parameter (and dropping the `$API_TIMEOUT` indirection in favor of literal timeouts). The entrypoint comment even says "Mirrors run.sh's per-model upsert." Two copies of the same pagination/upsert loop now live in separate scripts with no shared source, so a fix to one (e.g. the pagination cap, error handling) won't propagate. A shared sourced helper would be the lower-churn long-term fix; at minimum a comment cross-referencing the two would flag the coupling. - **`cmd/gadfly/consensus.go:164` — `cluster.detail` is written but never read.** `mergeIntoCluster` maintains `c.detail` (the highest-severity report's detail, line 328), and `clusterFindings` seeds it, but `renderConsensus` never emits `detail` anywhere — only `title`, `file`, `line`, `models`, `lenses`. Grep confirms the only references are the field declaration and assignment sites. The field is dead storage carried through every cluster, adding maintenance surface (a reader assumes it surfaces somewhere). Either render it (e.g. in a folded `<details>` per headline row) or drop it. - **`cmd/gadfly/consensus.go:200-205` — `agreed` tally re-iterates `clusters` when it could be folded into the partition above.** `clusterFindings` is called once (line 178), the headline/folded partition loop (lines 189-196) already walks every cluster, then `agreed` is tallied by re-iterating `clusters` (lines 200-205). Functionally fine, but `agreed` could be counted in the same pass as the partition instead of a second full loop. Minor readability/efficiency nit, not material. </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> **Verdict: No material issues found** Reviewed the new consolidation path through the performance lens: - **`clusterFindings` (`consensus.go:248`)** is O(F²) in total findings (each item scans all existing clusters for a line-proximity match). F here is `findings × models` for a single PR review — tens to low hundreds at most — so the quadratic scan is immaterial. Verified the inner loop at `consensus.go:269-292`. - **`renderConsensus` (`consensus.go:176`)** builds the whole comment in one `strings.Builder` with a single pass over clusters and models — no repeated allocation, no quadratic string concatenation. Fine. - **`runConsolidate` (`consensus.go:96`)** reads every `.json` in the dir once (`os.ReadDir` + one `ReadFile` each), bounded by the number of models. No unbounded growth. - **`upsert_comment_body` (`entrypoint.sh:72`)** paginates up to 10 pages × 50 comments — bounded, and mirrors the pre-existing `upsert_comment` in `run.sh:82`. The fallback path spawns 3 `jq` subprocesses per model file (`entrypoint.sh:324-327`), but that path only runs on consolidation failure with a handful of files, so not material. - **`writeFindingsOut` (`consensus.go:80`)** does one `json.Marshal` + one `os.WriteFile` per model run — trivial. No N+1, no unbounded growth, no hot-loop regressions. The work is all on bounded, small inputs (one PR's findings × a handful of models). </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> **Verdict: Minor issues found** - **Greedy clustering can under-count agreement due to representative-line drift** (`cmd/gadfly/consensus.go:271-277`). `clusterFindings` sorts each file's items ascending by line, then greedily merges each item into the first cluster whose representative line (`c.line`) is within `lineTolerance` (3). `mergeIntoCluster` keeps `c.line` at the *minimum* line (line 319: `f.Line < c.line`). So when findings span a window wider than the tolerance but chained within it, a later finding is compared only to the smallest line, not the most recently merged member. Example: lines 1, 4, 7 — 1 and 4 merge (|1−4|=3 ≤ 3, rep stays 1); then 7 is compared to rep 1 (|1−7|=6 > 3) and spawns a *new* cluster, even though it's only 3 lines from line 4 which is already in the first cluster. A single real issue cited across a ~7-line span gets split into two clusters, undercounting cross-model agreement and potentially demoting a headline finding to the folded section. Suggested fix: compare against the cluster's largest line (e.g. track `maxLine` per cluster and test `it.f.Line - c.maxLine <= lineTolerance`), or compare against the range. Verified against `clusterFindings` (266–293) and `mergeIntoCluster` (312–331). - **`writeFindingsOut` silently no-ops if the output directory is missing** (`cmd/gadfly/consensus.go:109-111`). `os.WriteFile` fails if the parent directory doesn't exist; the error is logged to stderr and swallowed (best-effort by design). In the normal entrypoint flow this is fine — `FINDINGS_DIR` is `mkdir -p`'d before the swarm runs (`entrypoint.sh:240`) — but if `GADFLY_FINDINGS_OUT` is ever set to a path in a non-existent directory (e.g. misconfigured standalone run), that model silently contributes nothing to consolidation, and the only signal is a stderr line that may not be surfaced. Acceptable for best-effort, but the failure mode is "model vanishes from consensus" with no structured fallback. Verified against the function (80–112) and the entrypoint mkdir (240). </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 37s</sub>

🪰 Gadfly review — kimi-k2.6:cloud (ollama-cloud)

Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

Verdict: No material issues found

Reviewed the cross-model consensus consolidation changes through the 🔒 Security lens. The new code paths (consensus.go, entrypoint.sh, run.sh) handle per-model JSON artifacts in a shared temp directory and render a single aggregated markdown comment.

Key areas examined:

  • File I/O paths: findings_file_for() in entrypoint.sh sanitizes model names with tr -c '[:alnum:]._-' '_', preventing directory traversal via model names. Go's os.ReadDir guarantees e.Name() is a basename, so filepath.Join(dir, e.Name()) is safe.
  • JSON parsing: Standard Go json.Unmarshal into a typed struct; no arbitrary code execution risk.
  • Markdown construction: mdCell() escapes pipes and newlines for table cells. Model-generated content (title, detail, markdown) is inserted into the consensus comment, but these are the same model outputs that were already posted directly in per-model comments. No new trust boundary is introduced.
  • Environment variables: GADFLY_FINDINGS_OUT and GADFLY_CONSOLIDATE_DIR are set by the trusted entrypoint.sh orchestrator; model subprocesses cannot influence them.
  • No secrets in artifacts: The modelFindings struct contains only review metadata and model output — no API keys or credentials.

The aggregation logic does not create novel injection or escalation vectors. All model output remains within the existing trust model (review content generated by the model swarm itself).

🎯 Correctness — Blocking issues found

Blocking issues found

  • cmd/gadfly/consensus.go:88 and cmd/gadfly/consolidate.go:82-93: When all lenses for a model error (timeout, panic, etc.), worstVerdict returns verdictClean because verdictUnknown (0) < verdictClean (1). The per-model artifact therefore stores Verdict: "No material issues found" even though renderConsolidated correctly renders "Review incomplete — all lenses errored" in the Markdown body. The consensus pass (renderConsensus) reads only the JSON Verdict field and parses it back to verdictClean. If all models fail, the final comment reads "No material issues found by consensus" with zero findings — a completely misleading result that masks total system failure. Verified by reading worstVerdict (consolidate.go), writeFindingsOut (consensus.go:88), and renderConsensus (consensus.go:180-207).

    Suggested fix: compute the model-level verdict in writeFindingsOut using the same logic as renderConsolidated (checking if errored == len(results)), or add an Errored/Incomplete bool to modelFindings and propagate it to the consensus renderer so the headline can say "Review incomplete" instead of "No material issues found".

🧹 Code cleanliness & maintainability — Minor issues

Verdict: Minor issues

  • cmd/gadfly/consensus_test.go:134captureStdout leaks os.Stdout and pipe on test failure.
    The helper redirects os.Stdout to a pipe, but if fn() calls t.Fatalf (or panics), the code after fn() never runs: os.Stdout stays redirected and w.Close() is never called, so the reader goroutine may hang. Verified by reading the test file.
    Fix: Use defer to restore os.Stdout and close the writer immediately after redirection:

    os.Stdout = w
    defer func() { w.Close(); os.Stdout = orig }()
    
  • cmd/gadfly/consensus.go:381oneModel returns a random model.
    Go randomizes map iteration order, so the model name displayed for a single-model folded finding is non-deterministic across runs. Re-running the same swarm can produce a different consensus comment. Verified by reading consensus.go.
    Fix: Sort the map keys and return the first:

    func oneModel(models map[string]bool) string {
        out := make([]string, 0, len(models))
        for m := range models { out = append(out, m) }
        sort.Strings(out)
        if len(out) > 0 { return out[0] }
        return ""
    }
    
  • entrypoint.sh:72-90 / scripts/run.sh:82-101 — Comment upsert logic is copy-pasted.
    upsert_comment_body in entrypoint.sh duplicates the pagination-and-upsert logic already in run.sh's upsert_comment (the PR even notes this: "Mirrors run.sh's per-model upsert"). Any future fix to pagination limits, auth handling, or marker matching must be kept in sync manually. Verified by reading both files.
    Fix: Extract a shared lib/upsert.sh sourced by both scripts, or add a prominent comment in both places warning that they must stay in sync.

Performance — Blocking issues found

Blocking issues found

  • cmd/gadfly/consensus.go:271Quadratic clustering scans clusters from unrelated files.
    In clusterFindings, the clusters slice is global across all files, yet the merge test c.file == file means only clusters for the current file ever match. Every finding from file N still scans every cluster created for files 1…N−1. With 7 models × 5 lenses × up to 20 findings each = 700 findings, the inner loop can execute ~490 k times instead of ~10 k if scoped per-file. The fix is to declare clusters inside the for file, items := range byFile loop so each file clusters independently.

  • cmd/gadfly/consensus.go:200-204Redundant pass to count multi-model agreement.
    renderConsensus loops over clusters a second time to count agreed (line 200) after the first loop (line 190) already partitions every cluster into headline/folded. Count agreed in the first loop to avoid the extra O(n) traversal.

  • cmd/gadfly/consensus.go:295-301Repeated string allocations in sort comparator.
    sort.SliceStable calls the closure many times; each call invokes sevRank, which allocates via strings.ToLower(strings.TrimSpace(s)). Pre-computing a (agreement, sevRank, file, line) sort key per cluster before sorting eliminates the repeated allocation and makes the comparator a pure integer comparison.

🧯 Error handling & edge cases — No material issues found

No material issues found

Wait, let me re-evaluate. I found several real issues. Let me reconsider the severity:

  1. captureStdout panic/deadlock - This is a test helper bug. It's a real unhandled edge case.
  2. mergeIntoCluster negative line sticky - Real edge case with malformed input.
  3. mdCell \r handling - Minor edge case.
  4. Entrypoint fallback marker hardcodes provider - Edge case in fallback.
  5. run.sh output loss on standalone consolidation - Edge case in standalone usage.

These are all error-handling/edge-case issues. The question is whether any are "material" enough to rate "Minor issues" or "Blocking issues found."

I'll rate it "Minor issues" because:

  • The captureStdout panic/deadlock is a real test bug that could hang CI.
  • The mergeIntoCluster negative line issue is a real edge case that breaks clustering for malformed input.

The others are smaller nits. Let me provide a concise review.

Actually, re-reading my instructions: "Only report issues you are reasonably confident are real after checking." I'm confident about all of these.

Let me also consider: is the captureStdout issue material enough? It's in test code. If a test panics, the test framework already handles panics. But the additional deadlock from the helper makes it worse. The panic would propagate, but the goroutine leak and stdout redirection could affect subsequent tests or the test runner. In go test, each test runs in the same process, so leaking stdout redirection could affect other tests. This is material.

For mergeIntoCluster, if a model outputs "line": -1 in its structured findings, the cluster gets stuck. Is this likely? No, but it's an unhandled edge case in the data pipeline.

I'll report both as medium severity.

Let me also report the mdCell \r issue as small.

The entrypoint marker issue and run.sh output loss are more like usage edge cases. I'll report the marker issue since it's in the new fallback code.

Actually, let me reconsider the entrypoint marker issue. Looking at the code:

marker="<!-- gadfly-review:ollama:${m} -->"

This is used when posting fallback per-model comments. The original run.sh uses:

MARKER="<!-- gadfly-review:${PROVIDER}:${MODEL} -->"

In consolidation mode, run.sh doesn't post comments. In the fallback, entrypoint.sh constructs the marker. If a model was run with provider "antigravity", the fallback marker would still say "ollama". This means:

  1. If there was a previous non-consolidated comment with the correct marker, the fallback won't update it (different marker).
  2. If two models have the same name but different providers, their fallback comments would have the same marker and collide.

This is a real bug in the fallback path. It's error-handling related because it's the error fallback path.

OK, I'll include it.

And for run.sh standalone output loss: when CONSOLIDATE=1 and GADFLY_FINDINGS_OUT is empty, the model output is lost. This is an unhandled edge case for standalone usage. I'll include it as a smaller finding.

Let me now write the review. Minor issues

  • cmd/gadfly/consensus_test.go:134captureStdout leaks os.Stdout and deadlocks on panic.
    The helper redirects os.Stdout to a pipe and spawns a goroutine that reads until EOF. If fn() panics, w.Close() is never called, the goroutine blocks forever on r.Read, and <-done deadlocks. os.Stdout also stays redirected, poisoning any subsequent test output in the same process.
    Fix: Add a defer before fn() that closes the writer and restores os.Stdout, so the cleanup runs during panic unwinding.

  • cmd/gadfly/consensus.go:319 — Negative line numbers become sticky cluster representatives.
    mergeIntoCluster only updates the representative line when c.line == 0 || f.Line < c.line. If the first finding in a cluster has a negative line (e.g. from malformed model JSON), c.line stays negative permanently, because a later positive line fails both c.line == 0 and f.Line < c.line. This causes location() to drop the line entirely and breaks distance checks for all subsequent findings in that cluster.
    Fix: Change the guard to c.line <= 0 || f.Line < c.line so any positive line displaces a non-positive representative.

  • entrypoint.sh:328 — Fallback per-model marker hardcodes provider as ollama.
    When consolidation fails and entrypoint.sh falls back to posting individual comments, it builds the upsert marker as <!-- gadfly-review:ollama:${m} -->, ignoring the actual provider stored in the artifact. If the same model name was used with a different provider (e.g. antigravity), fallback comments collide. If a prior non-consolidated comment exists with the correct provider marker, the fallback creates a duplicate instead of updating it.
    Fix: Use the prov variable already extracted from the JSON: marker="<!-- gadfly-review:${prov:-ollama}:${m} -->".

  • cmd/gadfly/consensus.go:390mdCell misses carriage returns.
    The function collapses \n to spaces but leaves \r intact. A model output that contains Windows-style \r\n line endings will inject raw \r characters into markdown table cells, which can break rendering.
    Fix: Also replace \r with a space before the \n replacement: strings.ReplaceAll(s, "\r", " ").

  • **`scripts/run

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 8m 40s

<!-- gadfly-review:ollama:kimi-k2.6:cloud --> ### 🪰 Gadfly review — `kimi-k2.6:cloud` (ollama-cloud) **Verdict: Blocking issues found** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> **Verdict: No material issues found** Reviewed the cross-model consensus consolidation changes through the 🔒 Security lens. The new code paths (`consensus.go`, `entrypoint.sh`, `run.sh`) handle per-model JSON artifacts in a shared temp directory and render a single aggregated markdown comment. Key areas examined: - **File I/O paths**: `findings_file_for()` in `entrypoint.sh` sanitizes model names with `tr -c '[:alnum:]._-' '_'`, preventing directory traversal via model names. Go's `os.ReadDir` guarantees `e.Name()` is a basename, so `filepath.Join(dir, e.Name())` is safe. - **JSON parsing**: Standard Go `json.Unmarshal` into a typed struct; no arbitrary code execution risk. - **Markdown construction**: `mdCell()` escapes pipes and newlines for table cells. Model-generated content (`title`, `detail`, `markdown`) is inserted into the consensus comment, but these are the same model outputs that were already posted directly in per-model comments. No new trust boundary is introduced. - **Environment variables**: `GADFLY_FINDINGS_OUT` and `GADFLY_CONSOLIDATE_DIR` are set by the trusted `entrypoint.sh` orchestrator; model subprocesses cannot influence them. - **No secrets in artifacts**: The `modelFindings` struct contains only review metadata and model output — no API keys or credentials. The aggregation logic does not create novel injection or escalation vectors. All model output remains within the existing trust model (review content generated by the model swarm itself). </details> <details><summary><b>🎯 Correctness</b> — Blocking issues found</summary> **Blocking issues found** - **`cmd/gadfly/consensus.go:88`** and **`cmd/gadfly/consolidate.go:82-93`**: When all lenses for a model error (timeout, panic, etc.), `worstVerdict` returns `verdictClean` because `verdictUnknown` (0) < `verdictClean` (1). The per-model artifact therefore stores `Verdict: "No material issues found"` even though `renderConsolidated` correctly renders "Review incomplete — all lenses errored" in the Markdown body. The consensus pass (`renderConsensus`) reads only the JSON `Verdict` field and parses it back to `verdictClean`. If **all** models fail, the final comment reads **"No material issues found by consensus"** with zero findings — a completely misleading result that masks total system failure. **Verified** by reading `worstVerdict` (consolidate.go), `writeFindingsOut` (consensus.go:88), and `renderConsensus` (consensus.go:180-207). **Suggested fix**: compute the model-level verdict in `writeFindingsOut` using the same logic as `renderConsolidated` (checking if `errored == len(results)`), or add an `Errored`/`Incomplete` bool to `modelFindings` and propagate it to the consensus renderer so the headline can say "Review incomplete" instead of "No material issues found". </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> **Verdict: Minor issues** - **`cmd/gadfly/consensus_test.go:134` — `captureStdout` leaks `os.Stdout` and pipe on test failure.** The helper redirects `os.Stdout` to a pipe, but if `fn()` calls `t.Fatalf` (or panics), the code after `fn()` never runs: `os.Stdout` stays redirected and `w.Close()` is never called, so the reader goroutine may hang. Verified by reading the test file. **Fix:** Use `defer` to restore `os.Stdout` and close the writer immediately after redirection: ```go os.Stdout = w defer func() { w.Close(); os.Stdout = orig }() ``` - **`cmd/gadfly/consensus.go:381` — `oneModel` returns a random model.** Go randomizes map iteration order, so the model name displayed for a single-model folded finding is non-deterministic across runs. Re-running the same swarm can produce a different consensus comment. Verified by reading `consensus.go`. **Fix:** Sort the map keys and return the first: ```go func oneModel(models map[string]bool) string { out := make([]string, 0, len(models)) for m := range models { out = append(out, m) } sort.Strings(out) if len(out) > 0 { return out[0] } return "" } ``` - **`entrypoint.sh:72-90` / `scripts/run.sh:82-101` — Comment upsert logic is copy-pasted.** `upsert_comment_body` in `entrypoint.sh` duplicates the pagination-and-upsert logic already in `run.sh`'s `upsert_comment` (the PR even notes this: "Mirrors run.sh's per-model upsert"). Any future fix to pagination limits, auth handling, or marker matching must be kept in sync manually. Verified by reading both files. **Fix:** Extract a shared `lib/upsert.sh` sourced by both scripts, or add a prominent comment in both places warning that they must stay in sync. </details> <details><summary><b>⚡ Performance</b> — Blocking issues found</summary> **Blocking issues found** - `cmd/gadfly/consensus.go:271` — **Quadratic clustering scans clusters from unrelated files.** In `clusterFindings`, the `clusters` slice is global across all files, yet the merge test `c.file == file` means only clusters for the current file ever match. Every finding from file N still scans every cluster created for files 1…N−1. With 7 models × 5 lenses × up to 20 findings each = 700 findings, the inner loop can execute ~490 k times instead of ~10 k if scoped per-file. The fix is to declare `clusters` inside the `for file, items := range byFile` loop so each file clusters independently. - `cmd/gadfly/consensus.go:200-204` — **Redundant pass to count multi-model agreement.** `renderConsensus` loops over `clusters` a second time to count `agreed` (line 200) after the first loop (line 190) already partitions every cluster into `headline`/`folded`. Count `agreed` in the first loop to avoid the extra O(n) traversal. - `cmd/gadfly/consensus.go:295-301` — **Repeated string allocations in sort comparator.** `sort.SliceStable` calls the closure many times; each call invokes `sevRank`, which allocates via `strings.ToLower(strings.TrimSpace(s))`. Pre-computing a `(agreement, sevRank, file, line)` sort key per cluster before sorting eliminates the repeated allocation and makes the comparator a pure integer comparison. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> No material issues found Wait, let me re-evaluate. I found several real issues. Let me reconsider the severity: 1. `captureStdout` panic/deadlock - This is a test helper bug. It's a real unhandled edge case. 2. `mergeIntoCluster` negative line sticky - Real edge case with malformed input. 3. `mdCell` `\r` handling - Minor edge case. 4. Entrypoint fallback marker hardcodes provider - Edge case in fallback. 5. `run.sh` output loss on standalone consolidation - Edge case in standalone usage. These are all error-handling/edge-case issues. The question is whether any are "material" enough to rate "Minor issues" or "Blocking issues found." I'll rate it "Minor issues" because: - The `captureStdout` panic/deadlock is a real test bug that could hang CI. - The `mergeIntoCluster` negative line issue is a real edge case that breaks clustering for malformed input. The others are smaller nits. Let me provide a concise review. Actually, re-reading my instructions: "Only report issues you are reasonably confident are real after checking." I'm confident about all of these. Let me also consider: is the `captureStdout` issue material enough? It's in test code. If a test panics, the test framework already handles panics. But the additional deadlock from the helper makes it worse. The panic would propagate, but the goroutine leak and stdout redirection could affect subsequent tests or the test runner. In `go test`, each test runs in the same process, so leaking stdout redirection could affect other tests. This is material. For `mergeIntoCluster`, if a model outputs `"line": -1` in its structured findings, the cluster gets stuck. Is this likely? No, but it's an unhandled edge case in the data pipeline. I'll report both as medium severity. Let me also report the `mdCell` `\r` issue as small. The entrypoint marker issue and run.sh output loss are more like usage edge cases. I'll report the marker issue since it's in the new fallback code. Actually, let me reconsider the entrypoint marker issue. Looking at the code: ```bash marker="<!-- gadfly-review:ollama:${m} -->" ``` This is used when posting fallback per-model comments. The original `run.sh` uses: ```bash MARKER="<!-- gadfly-review:${PROVIDER}:${MODEL} -->" ``` In consolidation mode, `run.sh` doesn't post comments. In the fallback, `entrypoint.sh` constructs the marker. If a model was run with provider "antigravity", the fallback marker would still say "ollama". This means: 1. If there was a previous non-consolidated comment with the correct marker, the fallback won't update it (different marker). 2. If two models have the same name but different providers, their fallback comments would have the same marker and collide. This is a real bug in the fallback path. It's error-handling related because it's the error fallback path. OK, I'll include it. And for `run.sh` standalone output loss: when `CONSOLIDATE=1` and `GADFLY_FINDINGS_OUT` is empty, the model output is lost. This is an unhandled edge case for standalone usage. I'll include it as a smaller finding. Let me now write the review. </think> Minor issues * **`cmd/gadfly/consensus_test.go:134` — `captureStdout` leaks `os.Stdout` and deadlocks on panic.** The helper redirects `os.Stdout` to a pipe and spawns a goroutine that reads until EOF. If `fn()` panics, `w.Close()` is never called, the goroutine blocks forever on `r.Read`, and `<-done` deadlocks. `os.Stdout` also stays redirected, poisoning any subsequent test output in the same process. **Fix:** Add a `defer` before `fn()` that closes the writer and restores `os.Stdout`, so the cleanup runs during panic unwinding. * **`cmd/gadfly/consensus.go:319` — Negative line numbers become sticky cluster representatives.** `mergeIntoCluster` only updates the representative line when `c.line == 0 || f.Line < c.line`. If the first finding in a cluster has a negative line (e.g. from malformed model JSON), `c.line` stays negative permanently, because a later positive line fails both `c.line == 0` and `f.Line < c.line`. This causes `location()` to drop the line entirely and breaks distance checks for all subsequent findings in that cluster. **Fix:** Change the guard to `c.line <= 0 || f.Line < c.line` so any positive line displaces a non-positive representative. * **`entrypoint.sh:328` — Fallback per-model marker hardcodes provider as `ollama`.** When consolidation fails and `entrypoint.sh` falls back to posting individual comments, it builds the upsert marker as `<!-- gadfly-review:ollama:${m} -->`, ignoring the actual provider stored in the artifact. If the same model name was used with a different provider (e.g. antigravity), fallback comments collide. If a prior non-consolidated comment exists with the correct provider marker, the fallback creates a duplicate instead of updating it. **Fix:** Use the `prov` variable already extracted from the JSON: `marker="<!-- gadfly-review:${prov:-ollama}:${m} -->"`. * **`cmd/gadfly/consensus.go:390` — `mdCell` misses carriage returns.** The function collapses `\n` to spaces but leaves `\r` intact. A model output that contains Windows-style `\r\n` line endings will inject raw `\r` characters into markdown table cells, which can break rendering. **Fix:** Also replace `\r` with a space before the `\n` replacement: `strings.ReplaceAll(s, "\r", " ")`. * **`scripts/run </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 8m 40s</sub>

🪰 Gadfly review — qwen3.5:397b-cloud (ollama-cloud)

Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — Blocking issues found

Blocking issues found

  • cmd/gadfly/consensus.go:239-240XSS via unsanitized model output in consensus comment. The m.Markdown field (raw model output, stored in the per-model JSON artifact) is embedded directly into an HTML <details> element without any HTML/Markdown sanitization. A model whose output is manipulated via prompt injection to include malicious HTML/JavaScript could execute arbitrary scripts in the context of any user viewing the PR. The mdCell helper (lines 390-394) only escapes pipes and newlines — it does not escape <, >, &, quotes, or backticks. Fix: Use html/template or a sanitization library (e.g., bluemonday) to sanitize all untrusted content before embedding in HTML/Markdown. (Verified: read consensus.go lines 230-244 and 388-394, confirmed no sanitization imports or calls)

  • cmd/gadfly/consensus.go:212-214XSS in headline table cells. The c.title, c.file, c.line, and lens list are all rendered via mdCell which does not escape HTML entities. An attacker-controlled finding title or file path could inject HTML. Fix: Extend mdCell to escape <, >, &, ", ' using html.EscapeString before Markdown escaping. (Verified: read mdCell implementation at lines 390-394)

  • entrypoint.sh:327-330XSS in fallback per-model comments. When consolidation fails, the fallback loop reads .markdown from each JSON artifact and inserts it directly into the comment body via printf with no sanitization. Same attack vector as above. Fix: Sanitize the md variable before embedding, or have the binary render the fallback output with proper escaping. (Verified: read entrypoint.sh lines 319-334)

  • cmd/gadfly/consensus.go:86-89Trust boundary violation: GADFLY_MODEL and modelProvider() used without validation. The model name (from environment) and provider string are stored in the JSON artifact and later rendered in the consensus comment. If an attacker can influence GADFLY_MODEL (e.g., via workflow input), they could inject malicious content. Fix: Validate/sanitize environment inputs before storing in artifacts. (Verified: read writeFindingsOut at lines 80-112 and modelProvider at model.go:146-155)

🎯 Correctness — Blocking issues found

Blocking issues found

Verified by reading cmd/gadfly/consensus.go lines 157-171, 266-293, and 310-321.

  • cmd/gadfly/consensus.go:266-293 — Clustering algorithm fails to handle transitive proximity, splitting findings that should be one cluster

    The greedy clustering algorithm checks each new finding against only the representative line (c.line, the minimum) of existing clusters at line 273: absInt(c.line-it.f.Line) <= lineTolerance. This breaks transitivity.

    Concrete example: Findings at lines 10, 13, and 16 in the same file (with lineTolerance = 3):

    1. Line 10: Creates cluster with c.line = 10
    2. Line 13: absInt(10-13) = 3 ≤ 3 → merges into cluster (keeps c.line = 10)
    3. Line 16: absInt(10-16) = 6 > 3creates a new cluster

    Result: Two clusters {10,13} and {16} instead of one {10,13,16}. Even though 13 and 16 are within tolerance (|13-16|=3), they never get compared because the algorithm only checks against the minimum line.

    Impact: Cross-model agreement counts are underreported when models cite lines that form a transitive chain but exceed tolerance from the cluster's minimum. A finding flagged by 3 models could appear as 2+1, weakening the consensus signal.

    Fix: Track both minLine and maxLine per cluster, and check proximity to the range:

    type cluster struct {
        // ... existing fields ...
        minLine, maxLine int  // track range, not just representative
    }
    // In clusterFindings loop:
    if c.file == file && (absInt(c.minLine-it.f.Line) <= lineTolerance || absInt(c.maxLine-it.f.Line) <= lineTolerance) {
        // merge
    }
    // In mergeIntoCluster:
    if f.Line > 0 {
        if c.minLine == 0 || f.Line < c.minLine { c.minLine = f.Line }
        if f.Line > c.maxLine { c.maxLine = f.Line }
    }
    

🧹 Code cleanliness & maintainability — Minor issues

Verdict: Minor issues found

  • cmd/gadfly/consensus.go:176renderConsensus is too long (68 lines) with multiple responsibilities
    This function handles: partitioning findings, building the headline table, rendering folded sections, AND rendering per-model drill-down. Each section is commented, but splitting into buildHeadlineTable, buildFoldedSection, and buildPerModelDetail would improve readability and testability. Verified by reading lines 176-244.

  • cmd/gadfly/consensus.go:248clusterFindings has deeply nested loop logic
    Lines 267-293 contain a triple-nested structure (range files → range items → range clusters) with a placed flag controlling flow. This greedy clustering algorithm is hard to follow at a glance. Extracting the inner "find or create cluster" logic into a helper like findOrCreateCluster would clarify intent. Verified by reading lines 248-308.

  • cmd/gadfly/consensus.go:390mdCell does not escape backticks
    The function escapes pipes and collapses newlines for markdown table safety, but backticks (common in code references like auth/login.go:42) are not escaped. A title containing `code` would break table rendering. This is a maintainability debt—the next person editing table output may not realize the escaping is incomplete. Verified by reading lines 388-394.

  • cmd/gadfly/consensus.go:396-407 — Trivial helper functions add noise
    plural (5 lines) and absInt (7 lines) are trivial one-liners that could be inlined at call sites or replaced with standard patterns (if n != 1 { s += "s" } and if n < 0 { n = -n }). They're not wrong, but they increase the surface area of the file without meaningful abstraction benefit. Verified by reading lines 396-407.

  • Naming confusion: renderConsolidated (consolidate.go) vs renderConsensus (consensus.go)
    These names are easily confused. renderConsolidated merges lenses within ONE model; renderConsensus merges findings across MULTIPLE models. The distinction matters but isn't clear from the names alone. Renaming to renderSingleModelReview and renderCrossModelConsensus would be self-documenting. Verified by comparing cmd/gadfly/consolidate.go:98 and cmd/gadfly/consensus.go:176.

Performance — No material issues found

No material issues found

After independent verification against the actual code:

  • Finding 1 (Quadratic clustering): The clusterFindings function at lines 266-293 does iterate through existing clusters for each finding. However, the draft's analysis misses that findings are first grouped by file (line 250-264), so the inner loop at line 271 only scans clusters that could match. Within a file, items are sorted by line (line 268), and the line tolerance (3 lines) means clusters are geographically constrained. While worst-case theoretical complexity exists, this is a standard greedy clustering pattern and not a blocking issue for the expected input sizes.

  • Finding 2 (lensList allocations): The lensList function at line 372 does allocate a slice, but it uses make([]string, 0, len(lenses)) with pre-allocated capacity based on the map size. The number of lenses per cluster is typically very small (1-5 distinct lens names). This is standard Go idiom and does not create meaningful GC pressure.

  • Finding 3 (Sequential fallback): The fallback loop at lines 322-332 does post sequentially. However, this is explicitly a fallback path (line 320: "consolidation produced no output; falling back") that only executes when the primary consolidation fails. Sequential behavior in a rare fallback is an acceptable trade-off for simplicity and correctness.

All three findings describe implementation choices rather than actual bugs. None prevent the feature from working correctly.

🧯 Error handling & edge cases — Blocking issues found

Blocking issues found

After reviewing the changes through the error handling & edge cases lens and verifying against the checked-out code:

  • cmd/gadfly/consensus.go:283 — Cluster initialized with potentially empty severity: When a new cluster is created, severity is set directly from it.f.Severity. For heuristic-scraped findings (per emit.go:64), severity may be empty. While mergeIntoCluster does escalate severity if a higher-ranked one exists, if ALL findings in a cluster have empty severity, the cluster remains with empty severity. This propagates to sevRank (returns 0) and sevIcon (returns ""), which is handled but the initial assignment at line 283 should defensively handle empty severity. Verified by reading clusterFindings and mergeIntoCluster. Suggested fix: Initialize with it.f.Severity only if non-empty, or call mergeIntoCluster logic inline to apply severity escalation on creation.

  • cmd/gadfly/consensus.go:80-111writeFindingsOut assumes parent directory exists: The function calls os.WriteFile(path, ...) without ensuring parent directories exist. If GADFLY_FINDINGS_OUT is set to a path like /tmp/findings/model1.json and /tmp/findings/ doesn't exist, the write fails silently (error logged to stderr). While entrypoint.sh creates FINDINGS_DIR before invoking the binary, this function has no defensive os.MkdirAll and would fail if called in isolation or with an unexpected path. Verified by reading writeFindingsOut and entrypoint.sh:238. Suggested fix: Add os.MkdirAll(filepath.Dir(path), 0o755) before WriteFile.

  • cmd/gadfly/consensus.go:216-218 — Misleading "No material issues found" when folded findings exist: The "No material issues found by consensus" message is shown whenever len(headline) == 0, even if len(folded) > 0. This could mislead readers into thinking there are no findings at all when low-severity single-model findings exist in the folded section. Verified by reading renderConsensus. Suggested fix: Change the message to "No headline findings" or only show this message when both headline and folded are empty.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 11m 10s

<!-- gadfly-review:ollama:qwen3.5:397b-cloud --> ### 🪰 Gadfly review — `qwen3.5:397b-cloud` (ollama-cloud) **Verdict: Blocking issues found** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — Blocking issues found</summary> **Blocking issues found** - **`cmd/gadfly/consensus.go:239-240`** — **XSS via unsanitized model output in consensus comment**. The `m.Markdown` field (raw model output, stored in the per-model JSON artifact) is embedded directly into an HTML `<details>` element without any HTML/Markdown sanitization. A model whose output is manipulated via prompt injection to include malicious HTML/JavaScript could execute arbitrary scripts in the context of any user viewing the PR. The `mdCell` helper (lines 390-394) only escapes pipes and newlines — it does not escape `<`, `>`, `&`, quotes, or backticks. **Fix:** Use `html/template` or a sanitization library (e.g., `bluemonday`) to sanitize all untrusted content before embedding in HTML/Markdown. *(Verified: read consensus.go lines 230-244 and 388-394, confirmed no sanitization imports or calls)* - **`cmd/gadfly/consensus.go:212-214`** — **XSS in headline table cells**. The `c.title`, `c.file`, `c.line`, and lens list are all rendered via `mdCell` which does not escape HTML entities. An attacker-controlled finding title or file path could inject HTML. **Fix:** Extend `mdCell` to escape `<`, `>`, `&`, `"`, `'` using `html.EscapeString` before Markdown escaping. *(Verified: read mdCell implementation at lines 390-394)* - **`entrypoint.sh:327-330`** — **XSS in fallback per-model comments**. When consolidation fails, the fallback loop reads `.markdown` from each JSON artifact and inserts it directly into the comment body via `printf` with no sanitization. Same attack vector as above. **Fix:** Sanitize the `md` variable before embedding, or have the binary render the fallback output with proper escaping. *(Verified: read entrypoint.sh lines 319-334)* - **`cmd/gadfly/consensus.go:86-89`** — **Trust boundary violation: `GADFLY_MODEL` and `modelProvider()` used without validation**. The model name (from environment) and provider string are stored in the JSON artifact and later rendered in the consensus comment. If an attacker can influence `GADFLY_MODEL` (e.g., via workflow input), they could inject malicious content. **Fix:** Validate/sanitize environment inputs before storing in artifacts. *(Verified: read writeFindingsOut at lines 80-112 and modelProvider at model.go:146-155)* </details> <details><summary><b>🎯 Correctness</b> — Blocking issues found</summary> ## Blocking issues found **Verified by reading `cmd/gadfly/consensus.go` lines 157-171, 266-293, and 310-321.** - **`cmd/gadfly/consensus.go:266-293` — Clustering algorithm fails to handle transitive proximity, splitting findings that should be one cluster** The greedy clustering algorithm checks each new finding against only the **representative line** (`c.line`, the minimum) of existing clusters at line 273: `absInt(c.line-it.f.Line) <= lineTolerance`. This breaks transitivity. **Concrete example:** Findings at lines 10, 13, and 16 in the same file (with `lineTolerance = 3`): 1. Line 10: Creates cluster with `c.line = 10` 2. Line 13: `absInt(10-13) = 3 ≤ 3` → merges into cluster (keeps `c.line = 10`) 3. Line 16: `absInt(10-16) = 6 > 3` → **creates a new cluster** Result: Two clusters `{10,13}` and `{16}` instead of one `{10,13,16}`. Even though 13 and 16 are within tolerance (`|13-16|=3`), they never get compared because the algorithm only checks against the minimum line. **Impact:** Cross-model agreement counts are underreported when models cite lines that form a transitive chain but exceed tolerance from the cluster's minimum. A finding flagged by 3 models could appear as 2+1, weakening the consensus signal. **Fix:** Track both `minLine` and `maxLine` per cluster, and check proximity to the range: ```go type cluster struct { // ... existing fields ... minLine, maxLine int // track range, not just representative } // In clusterFindings loop: if c.file == file && (absInt(c.minLine-it.f.Line) <= lineTolerance || absInt(c.maxLine-it.f.Line) <= lineTolerance) { // merge } // In mergeIntoCluster: if f.Line > 0 { if c.minLine == 0 || f.Line < c.minLine { c.minLine = f.Line } if f.Line > c.maxLine { c.maxLine = f.Line } } ``` --- </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> ## Verdict: Minor issues found - **`cmd/gadfly/consensus.go:176` — `renderConsensus` is too long (68 lines) with multiple responsibilities** This function handles: partitioning findings, building the headline table, rendering folded sections, AND rendering per-model drill-down. Each section is commented, but splitting into `buildHeadlineTable`, `buildFoldedSection`, and `buildPerModelDetail` would improve readability and testability. **Verified by reading lines 176-244.** - **`cmd/gadfly/consensus.go:248` — `clusterFindings` has deeply nested loop logic** Lines 267-293 contain a triple-nested structure (range files → range items → range clusters) with a `placed` flag controlling flow. This greedy clustering algorithm is hard to follow at a glance. Extracting the inner "find or create cluster" logic into a helper like `findOrCreateCluster` would clarify intent. **Verified by reading lines 248-308.** - **`cmd/gadfly/consensus.go:390` — `mdCell` does not escape backticks** The function escapes pipes and collapses newlines for markdown table safety, but backticks (common in code references like `auth/login.go:42`) are not escaped. A title containing `` `code` `` would break table rendering. This is a maintainability debt—the next person editing table output may not realize the escaping is incomplete. **Verified by reading lines 388-394.** - **`cmd/gadfly/consensus.go:396-407` — Trivial helper functions add noise** `plural` (5 lines) and `absInt` (7 lines) are trivial one-liners that could be inlined at call sites or replaced with standard patterns (`if n != 1 { s += "s" }` and `if n < 0 { n = -n }`). They're not wrong, but they increase the surface area of the file without meaningful abstraction benefit. **Verified by reading lines 396-407.** - **Naming confusion: `renderConsolidated` (consolidate.go) vs `renderConsensus` (consensus.go)** These names are easily confused. `renderConsolidated` merges lenses within ONE model; `renderConsensus` merges findings across MULTIPLE models. The distinction matters but isn't clear from the names alone. Renaming to `renderSingleModelReview` and `renderCrossModelConsensus` would be self-documenting. **Verified by comparing `cmd/gadfly/consolidate.go:98` and `cmd/gadfly/consensus.go:176`.** </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> **No material issues found** After independent verification against the actual code: - **Finding 1 (Quadratic clustering)**: The `clusterFindings` function at lines 266-293 does iterate through existing clusters for each finding. However, the draft's analysis misses that findings are **first grouped by file** (line 250-264), so the inner loop at line 271 only scans clusters that could match. Within a file, items are sorted by line (line 268), and the line tolerance (3 lines) means clusters are geographically constrained. While worst-case theoretical complexity exists, this is a standard greedy clustering pattern and not a blocking issue for the expected input sizes. - **Finding 2 (lensList allocations)**: The `lensList` function at line 372 does allocate a slice, but it uses `make([]string, 0, len(lenses))` with pre-allocated capacity based on the map size. The number of lenses per cluster is typically very small (1-5 distinct lens names). This is standard Go idiom and does not create meaningful GC pressure. - **Finding 3 (Sequential fallback)**: The fallback loop at lines 322-332 does post sequentially. However, this is explicitly a **fallback path** (line 320: "consolidation produced no output; falling back") that only executes when the primary consolidation fails. Sequential behavior in a rare fallback is an acceptable trade-off for simplicity and correctness. All three findings describe implementation choices rather than actual bugs. None prevent the feature from working correctly. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Blocking issues found</summary> ## Blocking issues found After reviewing the changes through the error handling & edge cases lens and verifying against the checked-out code: - **`cmd/gadfly/consensus.go:283` — Cluster initialized with potentially empty severity**: When a new cluster is created, `severity` is set directly from `it.f.Severity`. For heuristic-scraped findings (per `emit.go:64`), severity may be empty. While `mergeIntoCluster` does escalate severity if a higher-ranked one exists, if ALL findings in a cluster have empty severity, the cluster remains with empty severity. This propagates to `sevRank` (returns 0) and `sevIcon` (returns "⚪"), which is handled but the initial assignment at line 283 should defensively handle empty severity. **Verified by reading `clusterFindings` and `mergeIntoCluster`**. Suggested fix: Initialize with `it.f.Severity` only if non-empty, or call `mergeIntoCluster` logic inline to apply severity escalation on creation. - **`cmd/gadfly/consensus.go:80-111` — `writeFindingsOut` assumes parent directory exists**: The function calls `os.WriteFile(path, ...)` without ensuring parent directories exist. If `GADFLY_FINDINGS_OUT` is set to a path like `/tmp/findings/model1.json` and `/tmp/findings/` doesn't exist, the write fails silently (error logged to stderr). While `entrypoint.sh` creates `FINDINGS_DIR` before invoking the binary, this function has no defensive `os.MkdirAll` and would fail if called in isolation or with an unexpected path. **Verified by reading `writeFindingsOut` and `entrypoint.sh:238`**. Suggested fix: Add `os.MkdirAll(filepath.Dir(path), 0o755)` before `WriteFile`. - **`cmd/gadfly/consensus.go:216-218` — Misleading "No material issues found" when folded findings exist**: The "No material issues found by consensus" message is shown whenever `len(headline) == 0`, even if `len(folded) > 0`. This could mislead readers into thinking there are no findings at all when low-severity single-model findings exist in the folded section. **Verified by reading `renderConsensus`**. Suggested fix: Change the message to "No headline findings" or only show this message when both `headline` and `folded` are empty. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 11m 10s</sub>
steve added 1 commit 2026-06-28 22:56:07 +00:00
fix: address swarm review of consensus consolidation (PR #17)
Build & push image / build-and-push (pull_request) Successful in 6s
58c46db695
From PR #17's own adversarial review (now running on the Phase-1 image — note
the much cleaner, less-fragmented findings). Graded in gadfly-reports.

- Never lose output (advisory invariant): run.sh writes a stub findings file
  when a model fails/skips under consolidation, so a failed model still shows
  (as failed) and an all-models-fail run still posts a comment.
- Errored models excluded from the agreement denominator (modelFindings.Errored
  + allErrored); shown folded as "reviewer failed", not counted as clean.
- Sliding cluster window: span [line,maxLine] widened by tolerance on both
  edges, so chains of nearby findings merge instead of splitting (fixes
  agreement under-counting). Clustering is now per-file (no global O(n²) scan).
- Lone-finding headline threshold raised high -> critical (matches the docs);
  only consensus or a lone critical earns the headline, the rest fold.
- findings_file_for appends a cksum so foo:bar vs foo/bar can't collide.
- mdCell escapes the location cell + neutralizes backticks (table-injection);
  "No material issues" no longer shown when folded findings exist.
- Removed dead cluster.detail; named findingRef type; single partition/agreed
  pass; writeFindingsOut MkdirAll; captureStdout defer-restore + close.

Tests added: sliding-window chain merge, errored-denominator exclusion, lone
HIGH folds. gofmt/vet/bash -n clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
steve merged commit 88f74aa768 into main 2026-06-28 22:56:15 +00:00
steve deleted branch feat/consensus-consolidation 2026-06-28 22:56:15 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: steve/gadfly#17