feat: structured findings contract (machine-readable gadfly-findings block) #16

2026-06-28T22:03:15Z

steve commented

2026-06-28 22:03:15 +00:00

What

Phase 1 of the review-representation work: give each lens a machine-readable findings contract so downstream tooling stops scraping prose with a path:line regex.

Each lens now appends a fenced ```gadfly-findings JSON array after its prose review:

[{"file": "model.go", "line": 184, "severity": "high", "confidence": "high", "title": "Unauthenticated endpoint"}]

Why

Telemetry now carries per-finding severity (critical/high/medium/small/trivial), not just the lens-level verdict — the gadfly-reports store and grading get real signal.
Gives consensus consolidation (Phase 2) and inline PR reviews (Phase 3) a clean contract to cluster/anchor on instead of fragile markdown.

How

system-prompt.txt + recheck prompt: ask for / regenerate the block (recheck lists only survivors).
findings.go: extractStructuredFindings + stripFindingsBlock, severity normalization, prose-paragraph detail enrichment.
emit.go: prefer the structured block; gracefully fall back to the heuristic scrape when a model doesn't emit a parseable one (stays provider-agnostic — weak models still contribute).
consolidate.go: strip the block from the rendered comment (tooling-only, never shown).

No new env vars / config; the displayed comment is unchanged except the JSON block is hidden. New findings_test.go covers extraction, empty-array, malformed-fallback, quoted line numbers, stripping, and severity normalization.

Advisory invariant untouched — pure producer-side change.

🤖 Generated with Claude Code

## What Phase 1 of the review-representation work: give each lens a **machine-readable findings contract** so downstream tooling stops scraping prose with a `path:line` regex. Each lens now appends a fenced ` ```gadfly-findings ` JSON array after its prose review: ```json [{"file": "model.go", "line": 184, "severity": "high", "confidence": "high", "title": "Unauthenticated endpoint"}] ``` ## Why - Telemetry now carries **per-finding severity** (critical/high/medium/small/trivial), not just the lens-level verdict — the `gadfly-reports` store and grading get real signal. - Gives consensus consolidation (Phase 2) and inline PR reviews (Phase 3) a clean contract to cluster/anchor on instead of fragile markdown. ## How - `system-prompt.txt` + recheck prompt: ask for / regenerate the block (recheck lists only survivors). - `findings.go`: `extractStructuredFindings` + `stripFindingsBlock`, severity normalization, prose-paragraph detail enrichment. - `emit.go`: prefer the structured block; **gracefully fall back** to the heuristic scrape when a model doesn't emit a parseable one (stays provider-agnostic — weak models still contribute). - `consolidate.go`: strip the block from the rendered comment (tooling-only, never shown). No new env vars / config; the displayed comment is unchanged except the JSON block is hidden. New `findings_test.go` covers extraction, empty-array, malformed-fallback, quoted line numbers, stripping, and severity normalization. Advisory invariant untouched — pure producer-side change. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

steve added 2 commits 2026-06-28 22:03:15 +00:00

docs: correct examples/reusable.yml pin guidance (runners cache @v1; prefer @sha)

Adversarial Review (Gadfly) / review (pull_request) Successful in 3m4s

Details

6e87a3e73f

The @v1 comment claimed it auto-updates on releases, but long-lived act_runners
cache the reusable by ref so a moved tag isn't re-fetched. Recommend an
immutable @<sha>; routine tuning rides owner variables.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: structured findings contract (machine-readable gadfly-findings block)

Build & push image / build-and-push (pull_request) Successful in 4s

Details

Adversarial Review (Gadfly) / review (pull_request) Successful in 44m27s

Details

9bcd7b9696

Each lens now appends a fenced ```gadfly-findings JSON array of its findings
({file, line, severity, confidence, title}) after the prose review. The binary
parses it exactly instead of scraping prose with a path:line regex, so telemetry
carries PER-FINDING severity (not just the lens-level verdict) and downstream
consolidation gets a clean contract to cluster on.

- system-prompt.txt + recheck prompt: ask for / regenerate the block.
- findings.go: extractStructuredFindings + stripFindingsBlock, with severity
  normalization and prose-paragraph detail enrichment.
- emit.go: prefer the structured block; gracefully fall back to the heuristic
  scrape when a model doesn't emit a parseable one (provider-agnostic, robust).
- consolidate.go: strip the block from the rendered comment (tooling-only).

This is the enabling refactor for consensus consolidation + inline reviews.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gitea-actions bot commented

2026-06-28 22:03:18 +00:00

🪰 Gadfly — live review status

7/7 reviewers finished · updated 2026-06-28 22:47:42Z

`claude-code/opus` · claude-code — ✅ done

✅ security — No material issues found
✅ correctness — Blocking issues found
✅ maintainability — Minor issues
✅ performance — No material issues found
✅ error-handling — Minor issues

`claude-code/opus:max` · claude-code — ✅ done

✅ security — No material issues found
✅ correctness — Minor issues
✅ maintainability — Minor issues
✅ performance — No material issues found
✅ error-handling — Minor issues

`claude-code/sonnet` · claude-code — ✅ done

✅ security — No material issues found
✅ correctness — Minor issues
✅ maintainability — Minor issues
✅ performance — No material issues found
✅ error-handling — No material issues found

`deepseek-v4-pro:cloud` · ollama-cloud — ✅ done

✅ security — No material issues found
✅ correctness — Blocking issues found
✅ maintainability — Minor issues
✅ performance — No material issues found
✅ error-handling — Minor issues

`glm-5.2:cloud` · ollama-cloud — ✅ done

✅ security — No material issues found
✅ correctness — Minor issues
✅ maintainability — Minor issues
✅ performance — No material issues found
✅ error-handling — Minor issues

`minimax-m3:cloud` · ollama-cloud — ✅ done

✅ security — No material issues found
✅ correctness — Minor issues
✅ maintainability — Minor issues
✅ performance — Minor issues
✅ error-handling — Minor issues

`ragnaros/qwen3.6-27b` · ragnaros — ✅ done

✅ security — No material issues found
⚠️ correctness — could not complete
✅ maintainability — Minor issues
✅ performance — No material issues found
⚠️ error-handling — could not complete

_{Live status board. Findings are posted in each model's own comment. Advisory only — does not block merge.}

## 🪰 Gadfly — live review status 7/7 reviewers finished · updated 2026-06-28 22:47:42Z #### `claude-code/opus` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Blocking issues found - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `claude-code/opus:max` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `claude-code/sonnet` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — No material issues found #### `deepseek-v4-pro:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Blocking issues found - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `glm-5.2:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `minimax-m3:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — Minor issues - ✅ **error-handling** — Minor issues #### `ragnaros/qwen3.6-27b` · ragnaros — ✅ done - ✅ **security** — No material issues found - ⚠️ **correctness** — could not complete - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ⚠️ **error-handling** — could not complete Live status board. Findings are posted in each model's own comment. Advisory only — does not block merge.

gitea-actions bot commented

2026-06-28 22:03:18 +00:00

🪰 Gadfly review — `glm-5.2:cloud` (ollama-cloud)

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

I verified the full data flow for the security lens. The newly parsed untrusted model fields (file, line, severity, confidence, title, detail) are never used to open files, build URLs, or interpolate into commands — they only flow into reportPayload and are JSON-marshaled into an advisory telemetry POST to an operator-configured GADFLY_FINDINGS_URL. json.Unmarshal into a concrete struct with json.Number is safe (no unsafe deserialization, no float-precision pitfalls). stripFindingsBlock only removes markdown lines; it doesn't evaluate anything.

VERDICT: No material issues found

Verified via cmd/gadfly/findings.go:44-86 and cmd/gadfly/emit.go:140-163: the untrusted structured fields are consumed only as map keys (prose[file+":"+...]) and as reportPayload values marshaled by encoding/json (emit.go:150-162). No path/command/URL sink is reached by tainted model output.
postJSON (emit.go:381-404) sets only fixed headers (Content-Type, conditional Authorization from env, not from model output) and POSTs to GADFLY_FINDINGS_URL (operator env). No SSRF from model data.
The new detail borrowing (findings.go:65-68) reuses the pre-existing pathLineRe heuristic over already-stripped prose; it introduces no new disclosure surface beyond the existing prose-scrape path (the model could already exfiltrate PR content into prose that gets POSTed — this PR neither widens nor narrows that).
normalizeSeverity (findings.go:164-179) lowercases unknown severities and passes them through; this is a grading-integrity concern (a model could self-report misleading severities), but it is an inherent advisory/telemetry trust property, not an injection or authz issue, and is outside the security lens's "real" threats.

Nothing in my lane to block on.

🎯 Correctness — Minor issues

All three findings verified against the actual source. Keeping them.

VERDICT: Minor issues found

cmd/gadfly/findings.go:65-80 — structured findings can be emitted with empty title and empty detail. When a model omits title in its JSON object, the code falls back to truncate(detail, 120) (line 70-71); when detail is also absent it falls back to an exact prose[file+":"+ln] key lookup (line 66-67). If that key misses — e.g. JSON says "file": "model.go", "line": 184 but the prose only references run/executor.go:166, or the model gives a title-less/detail-less object whose file:line isn't cited verbatim in prose — both title and detail resolve to "". The finding is still appended (line 73) and sent to the store as a reportPayload with empty Title/Detail (verified at emit.go:150-162). The heuristic path (parseFindings, deriveTitle/paragraphTitle at emit.go:224-297) generally yields a non-empty title because it scrapes the nearest heading/paragraph. Suggested fix: skip a structured finding when both title and detail are empty (it carries only a location + severity and no human signal), or fall back to a nearest-paragraph scrape rather than an exact file:line key lookup.
cmd/gadfly/findings.go:170-172 — "minor" normalizes to "medium", collapsing two distinct severity tiers. The canonical set is critical/high/medium/small/trivial. "minor" is mapped into the "medium" bucket alongside "moderate" (line 170), while "low" maps to "small" (line 172). A model writing "minor" (a low-severity nit) is silently inflated to medium in telemetry. This is a judgment call rather than a hard bug, and the test at findings_test.go codifies "minor": "medium" — worth a deliberate decision (move "minor" to the "small" case, or keep with a justifying comment).
cmd/gadfly/emit.go:144-149 (pre-existing, not a regression) — heuristic-fallback severity uses full verdict phrases, not canonical severity words. When the structured block is absent, lensSev := r.verdict.label() (line 144) yields "Blocking issues found" / "Minor issues" / "No material issues found" / "Reviewed" (verified at consolidate.go:18-29), so raw_severity sent to the store is a full sentence, not critical/high/.... The old code did the same r.verdict.label(), so this is unchanged by the PR — flagged only so the claim that "telemetry now carries per-finding severity" is understood to hold only on the structured path, not the fallback path.

🧹 Code cleanliness & maintainability — Minor issues

All four draft findings verified against the actual source. Confirming each:

findings.go:96 & :126 — Confirmed verbatim: both use strings.HasPrefix(t, "```") && strings.Contains(t, findingsFence), a substring match that would match ```gadfly-findings-extra. The predicate is duplicated. Keep.
findings.go:146-159 vs emit.go:194-217 — Confirmed: proseParagraphs re-runs the pathLineRe.FindAllStringSubmatchIndex → strings.Count(...,"\n") → paragraphAt(lines, li) pipeline already in parseFindings, differing only in producing a dedup map vs a slice. Keep.
findings_test.go:102-111 — Confirmed: the file imports only testing and defines a hand-rolled contains + containsFence wrapper; the rest of the suite (e.g. engine_test.go:101, recheck_test.go:55) uses strings.Contains directly. Keep.
findings.go:164-179 — Confirmed: normalizeSeverity maps "minor"→medium but "minor-issue"→small, and the default branch passes unknown tokens through lowercased (explicitly asserted by the test at findings_test.go:93: "weird"→"weird"), so non-canonical severity words leak into reportPayload.RawSeverity. The comment at lines 162-163 documents this deliberately, but it does leave the contract leaky relative to the canonical set. Keep as minor.

VERDICT: Minor issues

cmd/gadfly/findings.go:96 & :126 — fragile info-string detection via strings.Contains. Both findingsBlock and stripFindingsBlock identify an opening fence with strings.HasPrefix(t, "```") && strings.Contains(t, findingsFence). This matches any fence whose info string merely contains the substring gadfly-findings — e.g. ```gadfly-findings-extra or a legitimately tagged prose block whose info string happens to mention the token. The two functions also duplicate this predicate verbatim rather than sharing one helper. Cleanliness/robustness fix: extract a single isFindingsOpen(line) bool (trim, check HasPrefix("```") and that the info-string token equals findingsFence) and call it from both places; it removes the copy-paste and tightens the match.
cmd/gadfly/findings.go:146-159 — proseParagraphs duplicates the scan logic already in parseFindings (emit.go:194-217). The new helper re-implements the pathLineRe.FindAllStringSubmatchIndex → line-index-via-strings.Count → paragraphAt(lines, li) pipeline that parseFindings already uses, differing only in that it emits a dedup map keyed by file:line and operates on strip-block'd text. This is exactly the kind of cross-file duplication the surrounding code avoids (paragraphAt/truncate/maxFindingsPerLens/pathLineRe are all shared). A shared locateProseRefs(out string) map[string]string (or having parseFindings consume it) would keep the two extraction paths from drifting; right now a change to how refs are located must be made in two places.
cmd/gadfly/findings_test.go:102-111 — hand-rolled contains/containsFence reinvent strings.Contains. The rest of the suite uses strings.Contains directly (engine_test.go:101, recheck_test.go:55, tools_test.go:89). The new file imports only testing, then defines a 7-line contains plus a containsFence wrapper that just calls contains(s, findingsFence). This is needless churn/inconsistency: use strings.Contains(stripped, findingsFence) and strings.Contains(stripped, "Unauthenticated endpoint") (adding the strings import) to match established test style.
cmd/gadfly/findings.go:164-179 — normalizeSeverity pass-through lets non-canonical words reach RawSeverity. The canonical set is critical/high/medium/small/trivial, yet an unrecognized token (e.g. "weird") is returned lowercased and lands in reportPayload.RawSeverity as a non-canonical word — explicitly asserted by the test at findings_test.go:93 ("weird" → "weird"). The map also splits synonyms across bands inconsistently ("minor"→medium but "minor-issue"→small). The comment at lines 162-163 documents the pass-through as intentional, but it still leaves the contract leaky for downstream tooling that expects the canonical set. Either clamp unknowns to a documented default or note the deliberate escape hatch; either way the current behavior defeats normalization for off-vocabulary severities.

No dead code or unused symbols found; truncate, paragraphAt, maxFindingsPerLens, and pathLineRe all resolve to existing definitions in emit.go.

⚡ Performance — No material issues found

Verdict: No material issues found

Through the performance lens, I examined the new findings.go, its integration in emit.go, and the consolidate.go change. Here's what I verified:

proseParagraphs is called unconditionally (findings.go:54) even when every structured finding already carries a detail field from the JSON. It does a full stripFindingsBlock (line split + join), a full pathLineRe.FindAllStringSubmatchIndex(prose, -1) with no match cap (unlike parseFindings which breaks at maxFindingsPerLens=20), and an O(n) strings.Count(prose[:loc[0]], "\n") per match (findings.go:155). This is the most wasteful new work, but it is bounded by lens output size (token-limited, typically a few KB) and the number of path:line references in prose (dozens, not thousands). Not material.
Multiple full-output splits occur per structured lens: findingsBlock splits once (findings.go:92), proseParagraphs→stripFindingsBlock splits again (findings.go:147), then proseParagraphs splits a third time (findings.go:148). All bounded by output size; these are small strings, not a hot path.
emit does not double-parse: extractStructuredFindings and parseFindings are mutually exclusive (the latter only runs on fallback, emit.go:141-143).
consolidate.go adds one stripFindingsBlock pass per lens (consolidate.go:105) — a single linear scan, no regression.

None of these rise to a material efficiency regression. The quadratic strings.Count-per-match pattern is pre-existing in parseFindings and bounded by maxFindingsPerLens; the new occurrence in proseParagraphs is bounded by output size. The only thing I'd flag as a minor nit (not blocking): proseParagraphs could be deferred/lazy so it's skipped when all findings supply their own detail, but the current cost is negligible for bounded lens outputs.

🧯 Error handling & edge cases — Minor issues

VERDICT: Minor issues

cmd/gadfly/findings.go:115-133 — stripFindingsBlock silently drops all content after an unterminated gadfly-findings fence. When a model emits an opening ```gadfly-findings line but the output is truncated before the closing fence (a realistic failure when a model hits its output-token limit right at the block), skipping flips to true and never resets. Every subsequent line — including legitimate prose findings — is discarded from the rendered comment in consolidate.go:105. By contrast, findingsBlock (used by the telemetry path in emit.go) correctly returns ok=false on an unterminated block, so the emit path degrades gracefully to the heuristic scrape; only the human-visible comment loses data. Verified by reading both functions: stripFindingsBlock has no EOF-while-skipping guard, and the test suite (findings_test.go) has no test for the unterminated case. Suggested fix: when the loop ends with skipping == true, stop skipping at EOF without dropping buffered content (or re-emit the open-fence line so downstream body == "" detection still works).
**cmd/gadfly/findings.go:96,126 — opening-fence detection uses strings.Containsrather than exact info-string match.** A line like `` ```gadfly-findings-extra `` (info string with a suffix) would be treated as an opening fence by bothfindingsBlockandstripFindingsBlock. Low probability, but the substring match is loose; not blocking, just noting the match is Contains`, not exact info-string equality.
cmd/gadfly/findings.go:44-86 — an empty structured array on a non-clean lens permanently suppresses the heuristic fallback. extractStructuredFindings returns ok=true for [], so emit.go:141 never falls back to parseFindings. If a model's prose describes real findings but its JSON block is erroneously [] (model error, not code error), telemetry records zero findings for that lens with no recovery path. This is documented as intentional ("an empty array is a valid 'nothing found'"), so it's not a code bug, but it's an unrecoverable data-loss edge case with no sanity cross-check against the prose. No fix required if the contract is trusted; noting for completeness.

_{Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 33s}

### 🪰 Gadfly review — `glm-5.2:cloud` (ollama-cloud) **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary>🔒 Security — No material issues found</summary> I verified the full data flow for the security lens. The newly parsed untrusted model fields (`file`, `line`, `severity`, `confidence`, `title`, `detail`) are never used to open files, build URLs, or interpolate into commands — they only flow into `reportPayload` and are JSON-marshaled into an advisory telemetry POST to an operator-configured `GADFLY_FINDINGS_URL`. `json.Unmarshal` into a concrete struct with `json.Number` is safe (no unsafe deserialization, no float-precision pitfalls). `stripFindingsBlock` only removes markdown lines; it doesn't evaluate anything. VERDICT: No material issues found - Verified via `cmd/gadfly/findings.go:44-86` and `cmd/gadfly/emit.go:140-163`: the untrusted structured fields are consumed only as map keys (`prose[file+":"+...]`) and as `reportPayload` values marshaled by `encoding/json` (`emit.go:150-162`). No path/command/URL sink is reached by tainted model output. - `postJSON` (`emit.go:381-404`) sets only fixed headers (`Content-Type`, conditional `Authorization` from env, not from model output) and POSTs to `GADFLY_FINDINGS_URL` (operator env). No SSRF from model data. - The new `detail` borrowing (`findings.go:65-68`) reuses the pre-existing `pathLineRe` heuristic over already-stripped prose; it introduces no new disclosure surface beyond the existing prose-scrape path (the model could already exfiltrate PR content into prose that gets POSTed — this PR neither widens nor narrows that). - `normalizeSeverity` (`findings.go:164-179`) lowercases unknown severities and passes them through; this is a grading-integrity concern (a model could self-report misleading severities), but it is an inherent advisory/telemetry trust property, not an injection or authz issue, and is outside the security lens's "real" threats. Nothing in my lane to block on. </details> <details><summary>🎯 Correctness — Minor issues</summary> All three findings verified against the actual source. Keeping them. **VERDICT: Minor issues found** - **`cmd/gadfly/findings.go:65-80` — structured findings can be emitted with empty title and empty detail.** When a model omits `title` in its JSON object, the code falls back to `truncate(detail, 120)` (line 70-71); when `detail` is also absent it falls back to an exact `prose[file+":"+ln]` key lookup (line 66-67). If that key misses — e.g. JSON says `"file": "model.go", "line": 184` but the prose only references `run/executor.go:166`, or the model gives a title-less/detail-less object whose `file:line` isn't cited verbatim in prose — both `title` and `detail` resolve to `""`. The finding is still appended (line 73) and sent to the store as a `reportPayload` with empty `Title`/`Detail` (verified at `emit.go:150-162`). The heuristic path (`parseFindings`, `deriveTitle`/`paragraphTitle` at `emit.go:224-297`) generally yields a non-empty title because it scrapes the nearest heading/paragraph. Suggested fix: skip a structured finding when both `title` and `detail` are empty (it carries only a location + severity and no human signal), or fall back to a nearest-paragraph scrape rather than an exact `file:line` key lookup. - **`cmd/gadfly/findings.go:170-172` — `"minor"` normalizes to `"medium"`, collapsing two distinct severity tiers.** The canonical set is `critical/high/medium/small/trivial`. `"minor"` is mapped into the `"medium"` bucket alongside `"moderate"` (line 170), while `"low"` maps to `"small"` (line 172). A model writing `"minor"` (a low-severity nit) is silently inflated to `medium` in telemetry. This is a judgment call rather than a hard bug, and the test at `findings_test.go` codifies `"minor": "medium"` — worth a deliberate decision (move `"minor"` to the `"small"` case, or keep with a justifying comment). - **`cmd/gadfly/emit.go:144-149` (pre-existing, not a regression) — heuristic-fallback severity uses full verdict phrases, not canonical severity words.** When the structured block is absent, `lensSev := r.verdict.label()` (line 144) yields `"Blocking issues found"` / `"Minor issues"` / `"No material issues found"` / `"Reviewed"` (verified at `consolidate.go:18-29`), so `raw_severity` sent to the store is a full sentence, not `critical/high/...`. The old code did the same `r.verdict.label()`, so this is unchanged by the PR — flagged only so the claim that "telemetry now carries per-finding severity" is understood to hold only on the structured path, not the fallback path. </details> <details><summary>🧹 Code cleanliness & maintainability — Minor issues</summary> All four draft findings verified against the actual source. Confirming each: 1. **`findings.go:96` & `:126`** — Confirmed verbatim: both use `strings.HasPrefix(t, "```") && strings.Contains(t, findingsFence)`, a substring match that would match ` ```gadfly-findings-extra `. The predicate is duplicated. Keep. 2. **`findings.go:146-159` vs `emit.go:194-217`** — Confirmed: `proseParagraphs` re-runs the `pathLineRe.FindAllStringSubmatchIndex` → `strings.Count(...,"\n")` → `paragraphAt(lines, li)` pipeline already in `parseFindings`, differing only in producing a dedup map vs a slice. Keep. 3. **`findings_test.go:102-111`** — Confirmed: the file imports only `testing` and defines a hand-rolled `contains` + `containsFence` wrapper; the rest of the suite (e.g. `engine_test.go:101`, `recheck_test.go:55`) uses `strings.Contains` directly. Keep. 4. **`findings.go:164-179`** — Confirmed: `normalizeSeverity` maps `"minor"`→`medium` but `"minor-issue"`→`small`, and the `default` branch passes unknown tokens through lowercased (explicitly asserted by the test at `findings_test.go:93`: `"weird"→"weird"`), so non-canonical severity words leak into `reportPayload.RawSeverity`. The comment at lines 162-163 documents this deliberately, but it does leave the contract leaky relative to the canonical set. Keep as minor. --- VERDICT: Minor issues - **`cmd/gadfly/findings.go:96` & `:126` — fragile info-string detection via `strings.Contains`.** Both `findingsBlock` and `stripFindingsBlock` identify an opening fence with `strings.HasPrefix(t, "```") && strings.Contains(t, findingsFence)`. This matches *any* fence whose info string merely contains the substring `gadfly-findings` — e.g. ` ```gadfly-findings-extra ` or a legitimately tagged prose block whose info string happens to mention the token. The two functions also duplicate this predicate verbatim rather than sharing one helper. Cleanliness/robustness fix: extract a single `isFindingsOpen(line) bool` (trim, check `HasPrefix("```")` and that the info-string token equals `findingsFence`) and call it from both places; it removes the copy-paste and tightens the match. - **`cmd/gadfly/findings.go:146-159` — `proseParagraphs` duplicates the scan logic already in `parseFindings` (`emit.go:194-217`).** The new helper re-implements the `pathLineRe.FindAllStringSubmatchIndex` → line-index-via-`strings.Count` → `paragraphAt(lines, li)` pipeline that `parseFindings` already uses, differing only in that it emits a dedup map keyed by `file:line` and operates on strip-block'd text. This is exactly the kind of cross-file duplication the surrounding code avoids (`paragraphAt`/`truncate`/`maxFindingsPerLens`/`pathLineRe` are all shared). A shared `locateProseRefs(out string) map[string]string` (or having `parseFindings` consume it) would keep the two extraction paths from drifting; right now a change to how refs are located must be made in two places. - **`cmd/gadfly/findings_test.go:102-111` — hand-rolled `contains`/`containsFence` reinvent `strings.Contains`.** The rest of the suite uses `strings.Contains` directly (`engine_test.go:101`, `recheck_test.go:55`, `tools_test.go:89`). The new file imports only `testing`, then defines a 7-line `contains` plus a `containsFence` wrapper that just calls `contains(s, findingsFence)`. This is needless churn/inconsistency: use `strings.Contains(stripped, findingsFence)` and `strings.Contains(stripped, "Unauthenticated endpoint")` (adding the `strings` import) to match established test style. - **`cmd/gadfly/findings.go:164-179` — `normalizeSeverity` pass-through lets non-canonical words reach `RawSeverity`.** The canonical set is `critical/high/medium/small/trivial`, yet an unrecognized token (e.g. `"weird"`) is returned lowercased and lands in `reportPayload.RawSeverity` as a non-canonical word — explicitly asserted by the test at `findings_test.go:93` (`"weird" → "weird"`). The map also splits synonyms across bands inconsistently (`"minor"→medium` but `"minor-issue"→small`). The comment at lines 162-163 documents the pass-through as intentional, but it still leaves the contract leaky for downstream tooling that expects the canonical set. Either clamp unknowns to a documented default or note the deliberate escape hatch; either way the current behavior defeats normalization for off-vocabulary severities. No dead code or unused symbols found; `truncate`, `paragraphAt`, `maxFindingsPerLens`, and `pathLineRe` all resolve to existing definitions in `emit.go`. </details> <details><summary>⚡ Performance — No material issues found</summary> **Verdict: No material issues found** Through the performance lens, I examined the new `findings.go`, its integration in `emit.go`, and the `consolidate.go` change. Here's what I verified: - **`proseParagraphs` is called unconditionally** (`findings.go:54`) even when every structured finding already carries a `detail` field from the JSON. It does a full `stripFindingsBlock` (line split + join), a full `pathLineRe.FindAllStringSubmatchIndex(prose, -1)` with **no match cap** (unlike `parseFindings` which breaks at `maxFindingsPerLens=20`), and an O(n) `strings.Count(prose[:loc[0]], "\n")` per match (`findings.go:155`). This is the most wasteful new work, but it is bounded by lens output size (token-limited, typically a few KB) and the number of `path:line` references in prose (dozens, not thousands). Not material. - **Multiple full-output splits** occur per structured lens: `findingsBlock` splits once (`findings.go:92`), `proseParagraphs`→`stripFindingsBlock` splits again (`findings.go:147`), then `proseParagraphs` splits a third time (`findings.go:148`). All bounded by output size; these are small strings, not a hot path. - **`emit` does not double-parse**: `extractStructuredFindings` and `parseFindings` are mutually exclusive (the latter only runs on fallback, `emit.go:141-143`). - **`consolidate.go`** adds one `stripFindingsBlock` pass per lens (`consolidate.go:105`) — a single linear scan, no regression. None of these rise to a material efficiency regression. The quadratic `strings.Count`-per-match pattern is pre-existing in `parseFindings` and bounded by `maxFindingsPerLens`; the new occurrence in `proseParagraphs` is bounded by output size. The only thing I'd flag as a minor nit (not blocking): `proseParagraphs` could be deferred/lazy so it's skipped when all findings supply their own `detail`, but the current cost is negligible for bounded lens outputs. </details> <details><summary>🧯 Error handling & edge cases — Minor issues</summary> **VERDICT: Minor issues** - **`cmd/gadfly/findings.go:115-133` — `stripFindingsBlock` silently drops all content after an unterminated `gadfly-findings` fence.** When a model emits an opening `` ```gadfly-findings `` line but the output is truncated before the closing fence (a realistic failure when a model hits its output-token limit right at the block), `skipping` flips to `true` and never resets. Every subsequent line — including legitimate prose findings — is discarded from the rendered comment in `consolidate.go:105`. By contrast, `findingsBlock` (used by the telemetry path in `emit.go`) correctly returns `ok=false` on an unterminated block, so the emit path degrades gracefully to the heuristic scrape; only the human-visible comment loses data. Verified by reading both functions: `stripFindingsBlock` has no EOF-while-skipping guard, and the test suite (`findings_test.go`) has no test for the unterminated case. Suggested fix: when the loop ends with `skipping == true`, stop skipping at EOF without dropping buffered content (or re-emit the open-fence line so downstream `body == ""` detection still works). - **`cmd/gadfly/findings.go:96,126 — opening-fence detection uses `strings.Contains` rather than exact info-string match.** A line like `` ```gadfly-findings-extra `` (info string with a suffix) would be treated as an opening fence by both `findingsBlock` and `stripFindingsBlock`. Low probability, but the substring match is loose; not blocking, just noting the match is `Contains`, not exact info-string equality. - **`cmd/gadfly/findings.go:44-86` — an empty structured array on a non-clean lens permanently suppresses the heuristic fallback.** `extractStructuredFindings` returns `ok=true` for `[]`, so `emit.go:141` never falls back to `parseFindings`. If a model's prose describes real findings but its JSON block is erroneously `[]` (model error, not code error), telemetry records zero findings for that lens with no recovery path. This is documented as intentional ("an empty array is a valid 'nothing found'"), so it's not a code bug, but it's an unrecoverable data-loss edge case with no sanity cross-check against the prose. No fix required if the contract is trusted; noting for completeness. </details> Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 33s

gitea-actions bot commented

2026-06-28 22:03:18 +00:00

🪰 Gadfly review — `ragnaros/qwen3.6-27b` (ragnaros)

Verdict: Minor issues · ⚠️ 2/5 lens(es) errored — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

Verdict: No material issues found

Reviewed findings.go, emit.go, consolidate.go, recheck.go, and system-prompt.txt through the security lens. Here's what I verified:

JSON parsing of untrusted LLM output (findings.go:50): json.Unmarshal into a flat struct of simple types (string, json.Number) is safe in Go — no code execution, no deserialization gadgets. json.Number stores the raw text, so integer overflow on line is not possible.
SSRF surface (emit.go:102, emit.go:166): GADFLY_FINDINGS_URL is operator-controlled via environment variable, not derived from user/LLM input. The http.Client has a 10s timeout. Standard telemetry pattern, no SSRF vector.
Credential handling (emit.go:167, emit.go:391-393): GADFLY_FINDINGS_TOKEN is read from env and sent as a Bearer header. Not logged, not echoed to stdout. Correct.
Path traversal: File names from LLM output are used only as JSON field values and map keys — never for file system access.
Prompt injection: The system prompt (scripts/system-prompt.txt) and recheck prompt (recheck.go) are static templates with no interpolation of untrusted data.
Findings block stripping (findings.go:115-133): The stripFindingsBlock/isFenceClose logic correctly handles multi-backtick fences and avoids leaking the JSON block into rendered comments.
Confidence normalization (findings.go:79): confidence is lowercased and trimmed but not validated against the whitelist (high/medium/low). An LLM could emit arbitrary strings. This is a data quality gap, not a security vulnerability — the downstream store simply receives whatever it gets.

No injection, authz, credential leakage, SSRF, or unsafe deserialization issues identified in this change.

🎯 Correctness — ⚠️ could not complete

⚠️ This reviewer failed to complete: agent: step 6: context deadline exceeded

🧹 Code cleanliness & maintainability — Minor issues

Verdict: Minor issues found

cmd/gadfly/findings_test.go:104–111 — Custom contains function reinvents strings.Contains. The test file defines a manual byte-level substring search when the standard library provides strings.Contains(s, sub) directly. This adds maintenance burden (reviewers will wonder why it's custom) and hides an import. Replace with strings.Contains(s, sub) and add "strings" to the import.
cmd/gadfly/findings_test.go:102 — containsFence one-liner obscures intent. containsFence(s string) bool { return contains(s, findingsFence) } is a one-line wrapper around what is, after the above fix, just strings.Contains. Consider inlining the call at its single usage site (TestStripFindingsBlock, line 70) rather than adding a named helper for one call.
cmd/gadfly/findings.go:81–83 vs cmd/gadfly/emit.go:214–216 — maxFindingsPerLens cap duplicated across two extraction paths. Both extractStructuredFindings and parseFindings independently cap at maxFindingsPerLens with identical if len(findings) >= maxFindingsPerLens { break }. This is harmless duplication (different code paths), but a shared post-processing helper like capFindings(findings []finding) []finding would eliminate the repetition and make the cap policy a single point of change.

⚡ Performance — No material issues found

Verdict: No material issues found (Performance lens)

The change operates at a scale where performance concerns are immaterial:

Per-invocation scope: Each new function (extractStructuredFindings, stripFindingsBlock, proseParagraphs) runs once per lens per PR review — not in a hot loop. With ~5–10 lenses and LLM outputs of a few KB each, the total work is trivial.
Redundant splits: extractStructuredFindings splits the output text into lines up to 3 times (findingsBlock → stripFindingsBlock → proseParagraphs). This could be refactored to pass pre-split lines, but at the scale of ~2–4 KB strings the allocation cost is negligible.
Quadratic strings.Count: proseParagraphs (line 155) does strings.Count(prose[:loc[0]], "\n") per regex match, which is O(N×M). This is capped at maxFindingsPerLens (20) and M is the full prose text length (~KB). Pre-existing code in parseFindings (emit.go:207) uses the identical pattern, so there's no regression.
proseParagraphs called unconditionally: Even when all structured findings carry their own Detail, the full prose scan runs. Could be lazy, but again — immaterial at this scale.

No unbounded growth, no N+1 queries, no blocking calls on hot paths, and no quadratic behavior that exceeds the previous code's characteristics.

🧯 Error handling & edge cases — ⚠️ could not complete

⚠️ This reviewer failed to complete: agent: step 7: context deadline exceeded

_{Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 44m 13s}

### 🪰 Gadfly review — `ragnaros/qwen3.6-27b` (ragnaros) **Verdict: Minor issues · ⚠️ 2/5 lens(es) errored** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary>🔒 Security — No material issues found</summary> **Verdict: No material issues found** Reviewed `findings.go`, `emit.go`, `consolidate.go`, `recheck.go`, and `system-prompt.txt` through the security lens. Here's what I verified: - **JSON parsing of untrusted LLM output** (`findings.go:50`): `json.Unmarshal` into a flat struct of simple types (`string`, `json.Number`) is safe in Go — no code execution, no deserialization gadgets. `json.Number` stores the raw text, so integer overflow on `line` is not possible. - **SSRF surface** (`emit.go:102`, `emit.go:166`): `GADFLY_FINDINGS_URL` is operator-controlled via environment variable, not derived from user/LLM input. The `http.Client` has a 10s timeout. Standard telemetry pattern, no SSRF vector. - **Credential handling** (`emit.go:167`, `emit.go:391-393`): `GADFLY_FINDINGS_TOKEN` is read from env and sent as a Bearer header. Not logged, not echoed to stdout. Correct. - **Path traversal**: File names from LLM output are used only as JSON field values and map keys — never for file system access. - **Prompt injection**: The system prompt (`scripts/system-prompt.txt`) and recheck prompt (`recheck.go`) are static templates with no interpolation of untrusted data. - **Findings block stripping** (`findings.go:115-133`): The `stripFindingsBlock`/`isFenceClose` logic correctly handles multi-backtick fences and avoids leaking the JSON block into rendered comments. - **Confidence normalization** (`findings.go:79`): `confidence` is lowercased and trimmed but not validated against the whitelist (`high`/`medium`/`low`). An LLM could emit arbitrary strings. This is a **data quality gap**, not a security vulnerability — the downstream store simply receives whatever it gets. No injection, authz, credential leakage, SSRF, or unsafe deserialization issues identified in this change. </details> <details><summary>🎯 Correctness — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 6: context deadline exceeded </details> <details><summary>🧹 Code cleanliness & maintainability — Minor issues</summary> **Verdict: Minor issues found** - **`cmd/gadfly/findings_test.go:104–111` — Custom `contains` function reinvents `strings.Contains`.** The test file defines a manual byte-level substring search when the standard library provides `strings.Contains(s, sub)` directly. This adds maintenance burden (reviewers will wonder why it's custom) and hides an import. Replace with `strings.Contains(s, sub)` and add `"strings"` to the import. - **`cmd/gadfly/findings_test.go:102` — `containsFence` one-liner obscures intent.** `containsFence(s string) bool { return contains(s, findingsFence) }` is a one-line wrapper around what is, after the above fix, just `strings.Contains`. Consider inlining the call at its single usage site (`TestStripFindingsBlock`, line 70) rather than adding a named helper for one call. - **`cmd/gadfly/findings.go:81–83` vs `cmd/gadfly/emit.go:214–216` — `maxFindingsPerLens` cap duplicated across two extraction paths.** Both `extractStructuredFindings` and `parseFindings` independently cap at `maxFindingsPerLens` with identical `if len(findings) >= maxFindingsPerLens { break }`. This is harmless duplication (different code paths), but a shared post-processing helper like `capFindings(findings []finding) []finding` would eliminate the repetition and make the cap policy a single point of change. </details> <details><summary>⚡ Performance — No material issues found</summary> **Verdict: No material issues found** (Performance lens) The change operates at a scale where performance concerns are immaterial: - **Per-invocation scope**: Each new function (`extractStructuredFindings`, `stripFindingsBlock`, `proseParagraphs`) runs once per lens per PR review — not in a hot loop. With ~5–10 lenses and LLM outputs of a few KB each, the total work is trivial. - **Redundant splits**: `extractStructuredFindings` splits the output text into lines up to 3 times (`findingsBlock` → `stripFindingsBlock` → `proseParagraphs`). This could be refactored to pass pre-split lines, but at the scale of ~2–4 KB strings the allocation cost is negligible. - **Quadratic `strings.Count`**: `proseParagraphs` (line 155) does `strings.Count(prose[:loc[0]], "\n")` per regex match, which is O(N×M). This is capped at `maxFindingsPerLens` (20) and M is the full prose text length (~KB). Pre-existing code in `parseFindings` (emit.go:207) uses the identical pattern, so there's no regression. - **`proseParagraphs` called unconditionally**: Even when all structured findings carry their own `Detail`, the full prose scan runs. Could be lazy, but again — immaterial at this scale. No unbounded growth, no N+1 queries, no blocking calls on hot paths, and no quadratic behavior that exceeds the previous code's characteristics. </details> <details><summary>🧯 Error handling & edge cases — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 7: context deadline exceeded </details> Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 44m 13s

gitea-actions bot commented

2026-06-28 22:03:18 +00:00

🪰 Gadfly review — `claude-code/opus` (claude-code)

Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

VERDICT: No material issues found

I reviewed this change strictly through the 🔒 Security lens, tracing how the newly-parsed data flows from untrusted source (LLM lens output, itself influenced by the attacker-controlled PR diff) to its sinks.

What I verified:

Sink trace — no dangerous use of parsed fields. extractStructuredFindings produces finding{file,line,title,detail,severity,confidence}. I grepped every consumer of those fields (emit.go:140, consolidate.go:105) and confirmed they reach only two places:
- emit.go:150-162 → a reportPayload that is sent via postJSON using json.Marshal (emit.go:382). Encoding is structured, so model-supplied strings (file, title, detail) cannot break out of the JSON body or inject headers. The Bearer token comes from GADFLY_FINDINGS_TOKEN env (emit.go:167) — unchanged, not leaked.
- consolidate.go:105 → stripFindingsBlock removes the block before render, so the JSON is never emitted into the comment. No new markdown/XSS surface.
- I confirmed via grep that the only os.Open/os.ReadFile/exec sites (tools.go, engine.go, specialists.go, main.go) do not consume any structured-finding field — so the model-supplied file string is never used as a filesystem path. No path traversal.
Deserialization is safe. json.Unmarshal into a typed []structuredFinding with no custom UnmarshalJSON and Line json.Number (findings.go:49-52) — standard stdlib decoding, no gadget/polymorphic risk. Malformed JSON returns ok=false and falls back gracefully.
No SSRF / new outbound targets. The POST URL is still derived solely from GADFLY_FINDINGS_URL (emit.go:102); the structured block introduces no model-controlled URL.
Insecure-default / authz posture unchanged. Telemetry stays opt-in (off unless GADFLY_FINDINGS_URL set), advisory, and the verdictClean guard (emit.go:134) still gates emission regardless of which extractor runs.

Minor note (not a finding, flagging for completeness): unlike the heuristic path (constrained by pathLineRe), the structured file field accepts any non-empty string and is not length-capped (findings.go:57-58; only title/detail are truncated). maxFindingsPerLens still caps count at 20, and the value is JSON-encoded, so this is at most a telemetry-data-quality concern for the downstream store — not an injection or sink vulnerability in this codebase. No action needed within Gadfly's security boundary.

Nothing material in my lane.

🎯 Correctness — Blocking issues found

I've verified all three findings against the actual source in findings.go, emit.go, and consolidate.go. All three are factually accurate against the code:

normalizeSeverity (findings.go:170-172): confirmed "minor" → medium while "minor-issue" → small and "low" → small, so "minor" outranks "low" and splits on hyphenation.
emit.go:144-148: confirmed the heuristic fallback writes r.verdict.label() ("Blocking issues found"/"Minor issues") into RawSeverity, vs canonical words from the structured path.
emit.go:140-143 + findings.go:44-52: confirmed [] makes extractStructuredFindings return ok=true, suppressing the parseFindings fallback, reachable only for non-clean verdicts.

All three survive verification.

VERDICT: Minor issues

normalizeSeverity maps "minor" → medium, which both over-inflates and contradicts the table's own neighbors (cmd/gadfly/findings.go:170-173). The canonical ladder is critical > high > medium > small > trivial. "minor" is colloquially low-priority, yet it's mapped to the middle rung (medium), while the near-identical "minor-issue" maps to small (one rung lower) and "nit"/"low" map to trivial/small. So the same concept ("minor") lands on two different severities depending on hyphenation, and "minor" outranks "low" — internally inconsistent. Verified by reading the switch: case "medium", "moderate", "minor" vs case "small", "low", "minor-issue". This only bites when a model ignores the 5-word contract and emits "minor", but the telemetry severity will then be wrong/inconsistent. Suggested fix: map "minor" to small (alongside "minor-issue"), or drop it from the synonym set so it falls through as the raw word rather than being silently promoted to medium.
raw_severity now carries two incompatible vocabularies depending on which path produced the finding (cmd/gadfly/emit.go:144-149). Structured findings emit canonical severities (critical/high/.../trivial); the heuristic fallback emits lensSev = r.verdict.label(), i.e. the literal verdict strings "Blocking issues found" / "Minor issues". So any model that doesn't emit a parseable block still writes verdict-label strings into the same RawSeverity field. Verified via verdict.label() at consolidate.go:18-29. The fallback vocabulary is pre-existing, but mixing it with the new canonical set in one field means downstream grading must special-case both. Worth at least normalizing the fallback to a canonical bucket (e.g. blocking→high/critical, minor→small) so the field has one vocabulary.
An empty gadfly-findings array suppresses the heuristic fallback even when the prose clearly lists findings (cmd/gadfly/emit.go:140-143, findings.go:44-85). extractStructuredFindings returns ok=true for [], so parseFindings is never consulted. This is reached only for non-clean verdicts (the verdictClean guard at emit.go:134 already continues). So a model that writes a "Blocking issues found" prose review with path:line findings but emits a [] block (e.g. a recheck that regenerates the block incorrectly) will contribute zero findings to telemetry while the human comment still shows the issues. This is the documented "empty array = nothing found" design, but it trusts the JSON block over the verdict; a defensive option is to fall back to the scrape when the array is empty and the verdict is non-clean. Low likelihood, real impact on the telemetry signal this PR exists to improve.

🧹 Code cleanliness & maintainability — Minor issues

I've verified each finding against the actual source. Let me record my conclusions.

Findings 1–4 are confirmed by reading the real code:

proseParagraphs (findings.go:146-159) and parseFindings (emit.go:194-217) genuinely share the same regex-scan / dedupe / paragraphAt idiom.
The emit.go:13-17 header still describes heuristic-only extraction, which emit() (emit.go:137-143) now contradicts by preferring the structured block.
findings_test.go:104-111 hand-rolls contains while importing only testing.
The open-fence predicate is duplicated at findings.go:96 and :126.

Finding 5 (the key-mismatch aside) I'm dropping: detail is explicitly an optional field, a key mismatch only leaves it empty (the same as when the model omits it), and the claim rests on speculation about model output formatting ("likely a full path", "often a bare") that I cannot positively confirm. It describes tolerated degradation, not a defect.

VERDICT: Minor issues

Reviewing only through the code cleanliness & maintainability lens.

Findings

Copy-pasted scan loop: proseParagraphs duplicates the core of parseFindings — cmd/gadfly/findings.go:146 vs cmd/gadfly/emit.go:194. Both iterate pathLineRe.FindAllStringSubmatchIndex(...), build the same file + ":" + line key, dedupe on first occurrence, recompute the match's line index with strings.Count(...[:loc[0]], "\n"), and call paragraphAt(lines, li). Only the per-match action differs (build a finding vs map key→paragraph). A future change to the path:line anchoring or paragraph logic must now be made in two places. Suggested fix: extract a helper that yields (file, lineStr, lineIndex) tuples and have both call sites use it.
Stale file-level doc comment in emit.go — cmd/gadfly/emit.go:13-17. The header still states findings are extracted "heuristically … This is best-effort signal … not a structured contract the reviewer guarantees." After this PR, emit() prefers the structured gadfly-findings contract and only falls back to the heuristic (emit.go:137-143). The package's introductory comment now contradicts the behavior. Suggested fix: update the comment to describe the structured-block-first / heuristic-fallback flow and cross-reference findings.go.
Test reimplements strings.Contains — cmd/gadfly/findings_test.go:104-111. The hand-rolled contains (and containsFence wrapper) duplicate stdlib strings.Contains purely to avoid importing strings. The rest of the package uses strings freely, so this diverges from the surrounding style for no benefit. Suggested fix: import "strings" and call strings.Contains directly.

Minor / optional

Repeated open-fence predicate — the expression strings.HasPrefix(t, "```") && strings.Contains(t, findingsFence) appears in both findingsBlock (findings.go:96) and stripFindingsBlock (findings.go:126). Pairs naturally with the existing isFenceClose helper; consider an isFindingsFenceOpen(line string) bool for symmetry and single-point maintenance. Low churn, optional.

⚡ Performance — No material issues found

I've verified the full call graph through the performance lens. Here's my review.

VERDICT: No material issues found

I reviewed the new structured-findings path specifically for efficiency regressions (allocations, hot loops, N+1, unbounded growth) and verified call frequency by reading emit.go, findings.go, consolidate.go, and main.go:212.

Findings:

No hot path involved. extractStructuredFindings (emit.go:140) and stripFindingsBlock (consolidate.go:105) each run exactly once per specialist result — and results is lenses × models (single/low-double digits). Both run at the very end of a review, after minutes of LLM inference per lens. Nothing here is in a loop over diff hunks, files, or findings-at-scale.
Telemetry stays at two POSTs total, not per-finding. Verified emit (emit.go:166-176): findings are accumulated into one reports slice and POSTed in a single /reports call, plus one /runs call. The new per-finding severity work adds only field copies, no extra requests. Output is already capped at maxFindingsPerLens = 20 (emit.go:81, :214), so payload growth is bounded.
The redundant work I found is immaterial. proseParagraphs (findings.go:147) calls stripFindingsBlock(out) and then extractStructuredFindings's caller path effectively walks out a few times (findingsBlock split, stripFindingsBlock split, a regex pass), and the loop at findings.go:155 does strings.Count(prose[:loc[0]], "\n") per match → O(n·m). But n is a few-KB review string and m is bounded by the count of path:line references — this executes once at process end and is dwarfed by the surrounding network/LLM cost. Flagging it would be a micro-optimization nit, which the lens explicitly excludes.
No unbounded growth. proseParagraphs' map is keyed by file:line with first-wins dedup (findings.go:152); the findings slice is pre-sized (make([]finding, 0, len(raw)), findings.go:55) and capped at 20.

The change is efficient relative to its actual call frequency. Nothing in my lane warrants a fix.

🧯 Error handling & edge cases — Minor issues

All findings in the draft check out against the actual code. Confirmed: the verdictClean skip at emit.go:134, the empty-array → structured=true no-fallback path (extractStructuredFindings returns ([]finding{}, true) for []), strconv.Atoi("-5") succeeding into ln with no guard, the strings.Contains(t, findingsFence) fence match, and that confidence (findings.go:79) is never read by reportPayload (emit.go:83-95) or anywhere else.

VERDICT: Minor issues

Reviewing only through the error handling & edge cases lens. The code is, on the whole, defensively written — graceful fallback on missing/malformed blocks, bounds-checked paragraphAt/truncate, json.Number tolerating quoted line numbers, unterminated-block handling. Two real edge-case gaps:

emit.go:140-148 — an empty [] block on a non-clean lens silently drops all telemetry, defeating the fallback safety net. r.verdict comes from parseVerdict(r.out) scraping the prose ("blocking issue" / "minor issue" — consolidate.go:33-53), while fs/structured come from the JSON block. These are independent sources and can disagree. If a model writes "Blocking issues found" in prose (verdict = non-clean, so it passes the verdictClean skip at line 134) but emits an empty gadfly-findings array, then structured == true, fs is empty, and the if !structured fallback to parseFindings is not taken. Result: a lens that self-reports issues contributes zero findings to the store — whereas before this PR, parseFindings would have recovered them from the prose path:line references. The empty-array case is explicitly treated as ok=true (findings.go:36-39 + TestExtractStructuredFindingsEmptyArray), so the heuristic is suppressed exactly when the verdict insists there's something to report. Suggested fix: when structured && len(fs) == 0 && r.verdict != verdictClean, fall back to parseFindings (treat a non-clean verdict with an empty block as "model under-reported", not "nothing found").
findings.go:61-63 / 69-72 — no validation of the parsed line number; negative/zero lines pass through. strconv.Atoi("-5") succeeds, so a malformed/hallucinated "line": -5 becomes ln = -5 and is emitted verbatim (and used as a map key for prose-detail lookup at line 67). The heuristic path can't produce this because pathLineRe only matches \d+. Low impact today, but Phase 3 inline-PR-review anchoring will use these as coordinates, where a negative/zero line is bad. Suggested fix: clamp/guard ln < 0 (or drop the line, keeping the finding with line = 0 as "file-level"), consistent with how the regex path can never go negative.

Minor / not worth changing, noted for completeness:

findingsBlock/stripFindingsBlock match the fence with strings.Contains(t, findingsFence) (findings.go:96, 126), so any fence whose info-string merely contains gadfly-findings (e.g. a model quoting the contract example from system-prompt.txt inside a ``` block) would be parsed/stripped. Unlikely in practice; flagging only because it's the kind of input a weak model could produce.
Outside my lens: confidence is extracted and normalized (findings.go:79) but reportPayload (emit.go:83-95) has no confidence field and nothing else reads the struct field, so it's silently discarded at emit time; the per-finding confidence the PR adds never reaches the store.

_{Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 3m 18s}

### 🪰 Gadfly review — `claude-code/opus` (claude-code) **Verdict: Blocking issues found** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary>🔒 Security — No material issues found</summary> ## VERDICT: No material issues found I reviewed this change strictly through the **🔒 Security** lens, tracing how the newly-parsed data flows from untrusted source (LLM lens output, itself influenced by the attacker-controlled PR diff) to its sinks. **What I verified:** - **Sink trace — no dangerous use of parsed fields.** `extractStructuredFindings` produces `finding{file,line,title,detail,severity,confidence}`. I grepped every consumer of those fields (`emit.go:140`, `consolidate.go:105`) and confirmed they reach only two places: - `emit.go:150-162` → a `reportPayload` that is sent via `postJSON` using `json.Marshal` (`emit.go:382`). Encoding is structured, so model-supplied strings (`file`, `title`, `detail`) cannot break out of the JSON body or inject headers. The Bearer token comes from `GADFLY_FINDINGS_TOKEN` env (`emit.go:167`) — unchanged, not leaked. - `consolidate.go:105` → `stripFindingsBlock` *removes* the block before render, so the JSON is never emitted into the comment. No new markdown/XSS surface. - I confirmed via grep that the only `os.Open`/`os.ReadFile`/`exec` sites (`tools.go`, `engine.go`, `specialists.go`, `main.go`) do **not** consume any structured-finding field — so the model-supplied `file` string is **never** used as a filesystem path. No path traversal. - **Deserialization is safe.** `json.Unmarshal` into a typed `[]structuredFinding` with no custom `UnmarshalJSON` and `Line json.Number` (`findings.go:49-52`) — standard stdlib decoding, no gadget/polymorphic risk. Malformed JSON returns `ok=false` and falls back gracefully. - **No SSRF / new outbound targets.** The POST URL is still derived solely from `GADFLY_FINDINGS_URL` (`emit.go:102`); the structured block introduces no model-controlled URL. - **Insecure-default / authz posture unchanged.** Telemetry stays opt-in (off unless `GADFLY_FINDINGS_URL` set), advisory, and the `verdictClean` guard (`emit.go:134`) still gates emission regardless of which extractor runs. **Minor note (not a finding, flagging for completeness):** unlike the heuristic path (constrained by `pathLineRe`), the structured `file` field accepts any non-empty string and is **not** length-capped (`findings.go:57-58`; only `title`/`detail` are truncated). `maxFindingsPerLens` still caps count at 20, and the value is JSON-encoded, so this is at most a telemetry-data-quality concern for the downstream store — not an injection or sink vulnerability in this codebase. No action needed within Gadfly's security boundary. Nothing material in my lane. </details> <details><summary>🎯 Correctness — Blocking issues found</summary> I've verified all three findings against the actual source in `findings.go`, `emit.go`, and `consolidate.go`. All three are factually accurate against the code: 1. `normalizeSeverity` (findings.go:170-172): confirmed `"minor"` → `medium` while `"minor-issue"` → `small` and `"low"` → `small`, so "minor" outranks "low" and splits on hyphenation. 2. emit.go:144-148: confirmed the heuristic fallback writes `r.verdict.label()` ("Blocking issues found"/"Minor issues") into `RawSeverity`, vs canonical words from the structured path. 3. emit.go:140-143 + findings.go:44-52: confirmed `[]` makes `extractStructuredFindings` return `ok=true`, suppressing the `parseFindings` fallback, reachable only for non-clean verdicts. All three survive verification. --- **VERDICT: Minor issues** - **`normalizeSeverity` maps "minor" → `medium`, which both over-inflates and contradicts the table's own neighbors** (`cmd/gadfly/findings.go:170-173`). The canonical ladder is `critical > high > medium > small > trivial`. "minor" is colloquially low-priority, yet it's mapped to the *middle* rung (`medium`), while the near-identical `"minor-issue"` maps to `small` (one rung lower) and `"nit"`/`"low"` map to `trivial`/`small`. So the same concept ("minor") lands on two different severities depending on hyphenation, and "minor" outranks "low" — internally inconsistent. Verified by reading the switch: `case "medium", "moderate", "minor"` vs `case "small", "low", "minor-issue"`. This only bites when a model ignores the 5-word contract and emits "minor", but the telemetry severity will then be wrong/inconsistent. Suggested fix: map `"minor"` to `small` (alongside `"minor-issue"`), or drop it from the synonym set so it falls through as the raw word rather than being silently promoted to `medium`. - **`raw_severity` now carries two incompatible vocabularies depending on which path produced the finding** (`cmd/gadfly/emit.go:144-149`). Structured findings emit canonical severities (`critical/high/.../trivial`); the heuristic fallback emits `lensSev = r.verdict.label()`, i.e. the literal verdict strings `"Blocking issues found"` / `"Minor issues"`. So any model that doesn't emit a parseable block still writes verdict-label strings into the same `RawSeverity` field. Verified via `verdict.label()` at `consolidate.go:18-29`. The fallback vocabulary is pre-existing, but mixing it with the new canonical set in one field means downstream grading must special-case both. Worth at least normalizing the fallback to a canonical bucket (e.g. blocking→`high`/`critical`, minor→`small`) so the field has one vocabulary. - **An empty `gadfly-findings` array suppresses the heuristic fallback even when the prose clearly lists findings** (`cmd/gadfly/emit.go:140-143`, `findings.go:44-85`). `extractStructuredFindings` returns `ok=true` for `[]`, so `parseFindings` is never consulted. This is reached only for non-clean verdicts (the `verdictClean` guard at `emit.go:134` already `continue`s). So a model that writes a "Blocking issues found" prose review with `path:line` findings but emits a `[]` block (e.g. a recheck that regenerates the block incorrectly) will contribute **zero** findings to telemetry while the human comment still shows the issues. This is the documented "empty array = nothing found" design, but it trusts the JSON block over the verdict; a defensive option is to fall back to the scrape when the array is empty *and* the verdict is non-clean. Low likelihood, real impact on the telemetry signal this PR exists to improve. </details> <details><summary>🧹 Code cleanliness & maintainability — Minor issues</summary> I've verified each finding against the actual source. Let me record my conclusions. Findings 1–4 are confirmed by reading the real code: - `proseParagraphs` (findings.go:146-159) and `parseFindings` (emit.go:194-217) genuinely share the same regex-scan / dedupe / `paragraphAt` idiom. - The emit.go:13-17 header still describes heuristic-only extraction, which `emit()` (emit.go:137-143) now contradicts by preferring the structured block. - findings_test.go:104-111 hand-rolls `contains` while importing only `testing`. - The open-fence predicate is duplicated at findings.go:96 and :126. Finding 5 (the key-mismatch aside) I'm dropping: `detail` is explicitly an optional field, a key mismatch only leaves it empty (the same as when the model omits it), and the claim rests on speculation about model output formatting ("likely a full path", "often a bare") that I cannot positively confirm. It describes tolerated degradation, not a defect. --- **VERDICT: Minor issues** Reviewing only through the **code cleanliness & maintainability** lens. ### Findings - **Copy-pasted scan loop: `proseParagraphs` duplicates the core of `parseFindings`** — `cmd/gadfly/findings.go:146` vs `cmd/gadfly/emit.go:194`. Both iterate `pathLineRe.FindAllStringSubmatchIndex(...)`, build the same `file + ":" + line` key, dedupe on first occurrence, recompute the match's line index with `strings.Count(...[:loc[0]], "\n")`, and call `paragraphAt(lines, li)`. Only the per-match action differs (build a `finding` vs map key→paragraph). A future change to the path:line anchoring or paragraph logic must now be made in two places. Suggested fix: extract a helper that yields `(file, lineStr, lineIndex)` tuples and have both call sites use it. - **Stale file-level doc comment in `emit.go`** — `cmd/gadfly/emit.go:13-17`. The header still states findings are extracted "heuristically … This is best-effort signal … *not a structured contract the reviewer guarantees.*" After this PR, `emit()` *prefers* the structured `gadfly-findings` contract and only falls back to the heuristic (`emit.go:137-143`). The package's introductory comment now contradicts the behavior. Suggested fix: update the comment to describe the structured-block-first / heuristic-fallback flow and cross-reference `findings.go`. - **Test reimplements `strings.Contains`** — `cmd/gadfly/findings_test.go:104-111`. The hand-rolled `contains` (and `containsFence` wrapper) duplicate stdlib `strings.Contains` purely to avoid importing `strings`. The rest of the package uses `strings` freely, so this diverges from the surrounding style for no benefit. Suggested fix: `import "strings"` and call `strings.Contains` directly. ### Minor / optional - **Repeated open-fence predicate** — the expression `strings.HasPrefix(t, "```") && strings.Contains(t, findingsFence)` appears in both `findingsBlock` (`findings.go:96`) and `stripFindingsBlock` (`findings.go:126`). Pairs naturally with the existing `isFenceClose` helper; consider an `isFindingsFenceOpen(line string) bool` for symmetry and single-point maintenance. Low churn, optional. </details> <details><summary>⚡ Performance — No material issues found</summary> I've verified the full call graph through the performance lens. Here's my review. --- **VERDICT: No material issues found** I reviewed the new structured-findings path specifically for efficiency regressions (allocations, hot loops, N+1, unbounded growth) and verified call frequency by reading `emit.go`, `findings.go`, `consolidate.go`, and `main.go:212`. Findings: - **No hot path involved.** `extractStructuredFindings` (`emit.go:140`) and `stripFindingsBlock` (`consolidate.go:105`) each run exactly **once per specialist result** — and `results` is `lenses × models` (single/low-double digits). Both run at the very end of a review, after minutes of LLM inference per lens. Nothing here is in a loop over diff hunks, files, or findings-at-scale. - **Telemetry stays at two POSTs total, not per-finding.** Verified `emit` (`emit.go:166-176`): findings are accumulated into one `reports` slice and POSTed in a single `/reports` call, plus one `/runs` call. The new per-finding severity work adds only field copies, no extra requests. Output is already capped at `maxFindingsPerLens = 20` (`emit.go:81`, `:214`), so payload growth is bounded. - **The redundant work I found is immaterial.** `proseParagraphs` (`findings.go:147`) calls `stripFindingsBlock(out)` and then `extractStructuredFindings`'s caller path effectively walks `out` a few times (findingsBlock split, stripFindingsBlock split, a regex pass), and the loop at `findings.go:155` does `strings.Count(prose[:loc[0]], "\n")` per match → O(n·m). But `n` is a few-KB review string and `m` is bounded by the count of `path:line` references — this executes once at process end and is dwarfed by the surrounding network/LLM cost. Flagging it would be a micro-optimization nit, which the lens explicitly excludes. - **No unbounded growth.** `proseParagraphs`' map is keyed by `file:line` with first-wins dedup (`findings.go:152`); the findings slice is pre-sized (`make([]finding, 0, len(raw))`, `findings.go:55`) and capped at 20. The change is efficient relative to its actual call frequency. Nothing in my lane warrants a fix. </details> <details><summary>🧯 Error handling & edge cases — Minor issues</summary> All findings in the draft check out against the actual code. Confirmed: the `verdictClean` skip at emit.go:134, the empty-array → `structured=true` no-fallback path (extractStructuredFindings returns `([]finding{}, true)` for `[]`), `strconv.Atoi("-5")` succeeding into `ln` with no guard, the `strings.Contains(t, findingsFence)` fence match, and that `confidence` (findings.go:79) is never read by `reportPayload` (emit.go:83-95) or anywhere else. ## VERDICT: Minor issues Reviewing only through the **error handling & edge cases** lens. The code is, on the whole, defensively written — graceful fallback on missing/malformed blocks, bounds-checked `paragraphAt`/`truncate`, `json.Number` tolerating quoted line numbers, unterminated-block handling. Two real edge-case gaps: - **`emit.go:140-148` — an empty `[]` block on a non-clean lens silently drops all telemetry, defeating the fallback safety net.** `r.verdict` comes from `parseVerdict(r.out)` scraping the *prose* ("blocking issue" / "minor issue" — consolidate.go:33-53), while `fs`/`structured` come from the *JSON block*. These are independent sources and can disagree. If a model writes "Blocking issues found" in prose (verdict = non-clean, so it passes the `verdictClean` skip at line 134) but emits an empty `gadfly-findings` array, then `structured == true`, `fs` is empty, and the `if !structured` fallback to `parseFindings` is **not** taken. Result: a lens that self-reports issues contributes **zero** findings to the store — whereas before this PR, `parseFindings` would have recovered them from the prose `path:line` references. The empty-array case is explicitly treated as `ok=true` (findings.go:36-39 + `TestExtractStructuredFindingsEmptyArray`), so the heuristic is suppressed exactly when the verdict insists there's something to report. Suggested fix: when `structured && len(fs) == 0 && r.verdict != verdictClean`, fall back to `parseFindings` (treat a non-clean verdict with an empty block as "model under-reported", not "nothing found"). - **`findings.go:61-63 / 69-72` — no validation of the parsed line number; negative/zero lines pass through.** `strconv.Atoi("-5")` succeeds, so a malformed/hallucinated `"line": -5` becomes `ln = -5` and is emitted verbatim (and used as a map key for prose-detail lookup at line 67). The heuristic path can't produce this because `pathLineRe` only matches `\d+`. Low impact today, but Phase 3 inline-PR-review anchoring will use these as coordinates, where a negative/zero line is bad. Suggested fix: clamp/guard `ln < 0` (or drop the line, keeping the finding with `line = 0` as "file-level"), consistent with how the regex path can never go negative. Minor / not worth changing, noted for completeness: - `findingsBlock`/`stripFindingsBlock` match the fence with `strings.Contains(t, findingsFence)` (findings.go:96, 126), so any fence whose info-string merely *contains* `gadfly-findings` (e.g. a model quoting the contract example from system-prompt.txt inside a ``` block) would be parsed/stripped. Unlikely in practice; flagging only because it's the kind of input a weak model could produce. - **Outside my lens:** `confidence` is extracted and normalized (findings.go:79) but `reportPayload` (emit.go:83-95) has no confidence field and nothing else reads the struct field, so it's silently discarded at emit time; the per-finding confidence the PR adds never reaches the store. </details> Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 3m 18s

gitea-actions bot commented

2026-06-28 22:03:18 +00:00

🪰 Gadfly review — `minimax-m3:cloud` (ollama-cloud)

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

VERDICT: No material issues found

Security assessment of this PR (structured findings contract):

JSON parsing of untrusted model output (findings.go:50, findings.go:25): json.Unmarshal into a struct of string/json.Number fields. No interface{}, no executable payload, no prototype-pollution equivalent in Go. A malformed block cleanly returns ok=false and the caller falls back to the heuristic (emit.go:140-143). Safe.
Path-injection / SSRF in telemetry POST: GADFLY_FINDINGS_URL is operator-controlled (pre-existing) and is the only outbound target. The new f.file / f.title / f.detail fields are JSON-marshaled via Go's stdlib encoder into reportPayload (emit.go:150-162), so no string ever lands raw inside HTTP headers or a URL path. No new SSRF surface introduced by this PR.
Regex DoS: pathLineRe (emit.go:47) is [A-Za-z0-9_./-]+\.[A-Za-z0-9]+:\d+ — single linear pass, no nested quantifiers or alternation. No ReDoS.
Resource bounds on attacker-influenced data: maxFindingsPerLens = 20 (pre-existing, emit.go:39) and truncate(detail, 500) (findings.go:77, emit.go:336) cap what a hostile/poisoned model output can contribute to the POST payload. The map built by proseParagraphs is bounded by the regex match count on the model's output, which is already bounded by the model's own token limit. No unbounded memory growth.
Information disclosure: The prose detail field is borrowed from the model's own markdown (findings.go:67) — it is content the model already produced and the heuristic path was already posting the same paragraph text via parseFindings (emit.go:212). The structured path does not widen the disclosure surface; it just carries severity/confidence per-finding instead of using the lens-level verdict.
Authn on telemetry: Bearer token still attached via postJSON (emit.go:391-393); unchanged. The new code does not touch credential handling.
Prompt-injection resilience: A malicious PR could attempt to make the model emit a gadfly-findings block referencing attacker-chosen paths. f.file is a string that ends up only in the JSON body — never in a filesystem path, shell command, URL host, or template. No code in the new diff interprets it as anything other than data. f.line is parsed as int and likewise only used as a number. No injection sink is reachable.
stripFindingsBlock correctness for the rendered comment (consolidate.go:105, findings.go:115-133): The fenced block is removed before being rendered to the PR comment, so a poisoned-looking JSON block (e.g. one that looks like a fake "verdict" or tries to manipulate downstream tooling) is invisible to humans and never reaches the rendered markdown. Good.

Nothing in this PR introduces an authn/authz gap, injection sink, SSRF, unsafe deserialization, secret leakage, missing input validation, or insecure default.

🎯 Correctness — Minor issues

VERDICT: Minor issues

Severity synonym table is internally inconsistent — cmd/gadfly/findings.go:170-172. "minor" maps to "medium" (line 170) but the hyphenated "minor-issue" maps to "small" (line 172). A model that emits "severity":"minor" gets "medium", while "minor-issue" gets "small" — opposite ends of the scale for the same semantic intent. TestNormalizeSeverity at findings_test.go:90 explicitly pins "minor"→"medium", so this is a deliberate-but-questionable choice that's now baked in. Impact: per-finding severity bucket is silently inconsistent.
Finding with no usable title or detail is still emitted — cmd/gadfly/findings.go:65-80. The only filter on the structured path is file == "" (line 58). When the JSON omits both title and detail and the prose has no matching path:line, detail stays "", then title = truncate("", 120) == "", and the finding is appended at line 73 with empty title and detail. Telemetry then POSTs a reportPayload with empty Title and Detail; the store will store it. Impact: phantom findings in the grading store with no human-readable anchor. Fix: drop the finding (mirror the file-empty branch at line 58) or require non-empty title before emitting.
ln=0 fallthrough produces a phantom key — cmd/gadfly/findings.go:61-67. When sf.Line doesn't parse (e.g. "line":"abc"), ln stays 0 (line 62 silently swallows the error). The prose-lookup key at line 67 becomes "file.go:0", which never matches prose, and the finding is emitted with line=0. Impact: a finding pointing at line 0 — not a real source line — is recorded. Fix: continue when Line fails to parse, mirroring the file-empty branch.
findingsBlock / stripFindingsBlock do substring matching on the info-string — cmd/gadfly/findings.go:96 and cmd/gadfly/findings.go:126. strings.Contains(t, findingsFence) matches any fence whose info-string merely contains "gadfly-findings", e.g. ```gadfly-findings-v2 or ```some-gadfly-findings-thing. Wider match than the contract documents. Fix: require the info-string to equal findingsFence (modulo leading whitespace), or at least be a token boundary.
isFenceClose accepts 4+ backticks as a close fence even after a 3-backtick open — cmd/gadfly/findings.go:137-140. strings.TrimLeft(t, "")` strips all leading backticks, so "````" passes as a valid close. CommonMark requires the close fence to match the open fence in length, so a 4-backtick close after a 3-backtick open is not a valid close and the block should be treated as unterminated. Low-impact for real models but worth tightening if this is ever exposed to adversarial input.

🧹 Code cleanliness & maintainability — Minor issues

VERDICT: Minor issues

Findings:

cmd/gadfly/findings_test.go:104-110 — contains reinvents strings.Contains. The new test file defines its own substring helper (contains) that duplicates Go stdlib strings.Contains. The codebase already uses strings.Contains in 20+ callsites (including findings.go itself at lines 96 and 126), so this is gratuitous. Suggested fix: replace contains(s, sub) with strings.Contains(s, sub) and add "strings" to the existing import "testing" block. The package isn't currently imported in this test file, so the import block does need to grow — that's a one-line edit.
cmd/gadfly/findings_test.go:102 — containsFence is a single-use one-liner. containsFence is only called once (line 70, inside TestStripFindingsBlock). Even if you keep contains for some reason, containsFence adds no value beyond strings.Contains(s, findingsFence) — which is what the call site would become after the fix above. Inlining it would remove the helper entirely.
cmd/gadfly/findings.go:131 — stripFindingsBlock's docstring claims "verbatim" but the function trims trailing newlines. The function comment says "Other content is preserved verbatim," but return strings.TrimRight(strings.Join(kept, "\n"), "\n") strips trailing newlines from the joined output unconditionally. TestStripFindingsBlockNoBlock (line 81) passes only because its input happens to have no trailing newline; for inputs that do, the output is shorter than the input would suggest. This is fine, but the comment should say so (e.g., "...preserved verbatim except trailing newlines are normalized away"). Low-priority doc nit.
cmd/gadfly/findings.go:115-133 — stripFindingsBlock duplicates fence-scanning logic in a structurally different style from findingsBlock. Verified: findingsBlock (lines 91-110) scans with two separate loops and returns the inner text; stripFindingsBlock scans with a single skipping boolean and keeps-or-drops. Both functions answer "where are the findings-block fences?", but any tightening of the fence grammar (e.g., requiring exactly `` ```gadfly-findings ```` with no trailing whitespace, or supporting nested fences) would need to be made in two places. A shared []fenceRange scan that returns start/end line indexes, used by both functions, would prevent drift. Low priority — not a bug today.
**cmd/gadfly/recheck.go:52-56 — + "" + concatenation in a raw string reads awkwardly.** The prompt-building lines splice literal backticks into backtick-delimited raw strings via + "" + . This is the standard idiom for embedding a backtick in a Go raw string (raw strings can't contain backticks), so it works — but the first + "" + at line 52 is a sharp edge for anyone unfamiliar with the trick. Adding a one-line comment at that site (" + "" + splices a literal backtick into the raw string") would help future readers. Same pattern appears in scripts/run.sh, so this isn't a new pattern in the codebase, just a new instance.
examples/reusable.yml:14-21 and :53-54 — documentation churn bundled with the structured-findings contract change. Verified by reading the diff: the reusable.yml comment rewrite (changing @v1 pinning guidance to talk about act_runner ref caching) has nothing to do with the new `` ```gadfly-findings ```` contract in cmd/gadfly/findings.go, cmd/gadfly/emit.go, cmd/gadfly/consolidate.go, or scripts/system-prompt.txt. Both edits are fine on their merits, but mixing them in one PR makes bisecting either change harder. Process/cleanliness, not a code defect — would suggest splitting.

[
  {"file": "cmd/gadfly/findings_test.go", "line": 104, "severity": "trivial", "confidence": "high", "title": "local `contains` helper duplicates strings.Contains"},
  {"file": "cmd/gadfly/findings_test.go", "line": 102, "severity": "trivial", "confidence": "high", "title": "single-use `containsFence` helper should be inlined"},
  {"file": "cmd/gadfly/findings.go", "line": 131, "severity": "trivial", "confidence": "medium", "title": "stripFindingsBlock docstring says 'verbatim' but TrimRight strips trailing newlines"},
  {"file": "cmd/gadfly/findings.go", "line": 115, "severity": "small", "confidence": "medium", "title": "findingsBlock and stripFindingsBlock duplicate fence-scanning logic in two different styles"},
  {"file": "cmd/gadfly/recheck.go", "line": 52, "severity": "trivial", "confidence": "high", "title": "` + \"`\" + ` raw-string splice for literal backtick would benefit from a one-line comment"},
  {"file": "examples/reusable.yml", "line": 14, "severity": "trivial", "confidence": "high", "title": "unrelated docs churn (act_runner pinning guidance) bundled with the findings contract change"}
]

⚡ Performance — Minor issues

VERDICT: Minor issues

Findings (through the Performance lens):

cmd/gadfly/findings.go:155 — avoidable O(N·M) per-lens scan in proseParagraphs. The line li := strings.Count(prose[:loc[0]], "\n") is invoked inside the pathLineRe match loop. Each call re-scans from the start of the prose to the current match byte offset, so the total work is O(N · M) where N = prose bytes and M = number of path:line matches. A single forward pass that tracks the running newline count (advance on each match) brings this to O(N). Verified against findings.go:146–158.
cmd/gadfly/findings.go:54,147 — proseParagraphs (and therefore stripFindingsBlock + the full regex sweep) runs unconditionally, even when every structured finding already carries a detail. extractStructuredFindings parses the JSON first, then builds the prose map before iterating raw. When every sf.Detail is non-empty the prose[file+":"+ln] lookup on line 67 is never hit, yet stripFindingsBlock and pathLineRe.FindAllStringSubmatchIndex over the whole output already ran. On a long adversarial-review output this is pure waste on the per-lens hot path. Lazily build the map on first miss, or skip the call when sf.Detail != "" for every element of raw.
cmd/gadfly/findings.go:73–83 — work done for a finding that will be dropped at the cap. The full finding struct (including truncate(detail, 500) on the possibly-borrowed prose) is built before the len(findings) >= maxFindingsPerLens check. Cheap individually, but combined with finding #2 it means the cap doesn't short-circuit the prose/enrichment work; reordering to check the cap first would let the hot loop exit earlier.
cmd/gadfly/consolidate.go:105 + cmd/gadfly/findings.go:147 — stripFindingsBlock(r.out) runs twice for the same lens output. renderConsolidated strips for the rendered comment, and the subsequent emit → extractStructuredFindings → proseParagraphs strips again on the same string. A single pre-computed stripped view (e.g. computed in specialistResult once and reused) would halve the fence-line scan. Minor in absolute terms; flagging only because it lands on the per-lens hot path and is trivially cacheable.

Nothing in this review touches correctness, severity semantics, or the prose-rendering path — those are out of my lane and left to other reviewers.

[
  {"file": "cmd/gadfly/findings.go", "line": 155, "severity": "small", "confidence": "high", "title": "O(N*M) newline scan in proseParagraphs"},
  {"file": "cmd/gadfly/findings.go", "line": 54, "severity": "small", "confidence": "high", "title": "proseParagraphs built unconditionally even when all details present"},
  {"file": "cmd/gadfly/findings.go", "line": 73, "severity": "trivial", "confidence": "medium", "title": "cap check runs after per-finding struct build"},
  {"file": "cmd/gadfly/consolidate.go", "line": 105, "severity": "trivial", "confidence": "high", "title": "stripFindingsBlock runs twice per lens output"}
]

🧯 Error handling & edge cases — Minor issues

VERDICT: Minor issues

Findings through the error-handling / edge-case lens (all verified by reading the diff and the surrounding code in cmd/gadfly/):

cmd/gadfly/emit.go:144 — heuristic fallback writes a non-bucket severity. r.verdict.label() returns human phrases ("No material issues found", "Minor issues", "Blocking issues found", "Reviewed") — see consolidate.go:18-29. These get written verbatim into RawSeverity whenever f.severity == "". The structured path normalizes via normalizeSeverity into the canonical critical/high/medium/small/trivial set, but the fallback path bypasses that, so a lens that omits the new block emits "Minor issues" into a field whose contract is the 5-bucket set. Suggested fix: map verdictBlocking→"critical", verdictMinor→"small", verdictClean→"trivial" (clean lenses are skipped above so the last is academic), verdictUnknown→"" (or whatever the store treats as unknown) instead of calling .label().
**cmd/gadfly/findings.go:115 — unterminated \``gadfly-findingsblock silently swallows the rest of the comment.**findingsBlockcorrectly returnsok=falsefor an unterminated block, butstripFindingsBlockjust setsskipping = trueand never resets it without seeing anisFenceClose. The model is asked to terminate the block, but findings_test.goshowsstripFindingsBlockisn't even exercised for the unterminated case. A model that opens the fence and forgets the closing fence will have its entire tail (verdict line, sign-off, anything after) silently dropped from the rendered consolidated comment. Suggested fix: in theskippingbranch, ifisFenceClose(ln)is false AND we hit EOF insideskipping, also drop the original opening fence line so the prose around it isn't accidentally kept; or cap skippingto the same logicfindingsBlock` uses so an unterminated opener is left in place rather than eating the rest.
cmd/gadfly/findings.go:44 — silent finding loss on a structured block whose entries all lack file. Every entry that fails the file == "" guard is continue'd, but the function still returns ok=true. The caller in emit.go:141-142 then never falls back to the heuristic scrape, so any path:line finding the model wrote in prose is lost. This contradicts the PR's "graceful fallback" framing — the fallback is only taken when the JSON is unparseable, not when every JSON entry is empty/invalid. Suggested fix: if the raw slice was non-empty but every entry was dropped, return ok=false so the caller runs parseFindings.
cmd/gadfly/findings.go:62 — line of null or a negative number silently becomes 0. json.Number.String() for null is the string "null", strconv.Atoi("null") errors, so ln stays at the zero value 0. Same for non-numeric strings. A 0 line is then written to reports and looks like a real hit at line 0. Suggested fix: on err != nil, either continue (drop the entry) or return an error so the caller falls back; at minimum, reject ln <= 0 before accepting the finding.

Verified by reading findings.go, emit.go, consolidate.go, and findings_test.go directly; no claims rely on the diff alone.

_{Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 25s}

### 🪰 Gadfly review — `minimax-m3:cloud` (ollama-cloud) **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary>🔒 Security — No material issues found</summary> ## VERDICT: No material issues found Security assessment of this PR (structured findings contract): - **JSON parsing of untrusted model output** (`findings.go:50`, `findings.go:25`): `json.Unmarshal` into a struct of `string`/`json.Number` fields. No `interface{}`, no executable payload, no prototype-pollution equivalent in Go. A malformed block cleanly returns `ok=false` and the caller falls back to the heuristic (`emit.go:140-143`). Safe. - **Path-injection / SSRF in telemetry POST**: `GADFLY_FINDINGS_URL` is operator-controlled (pre-existing) and is the only outbound target. The new `f.file` / `f.title` / `f.detail` fields are JSON-marshaled via Go's stdlib encoder into `reportPayload` (`emit.go:150-162`), so no string ever lands raw inside HTTP headers or a URL path. No new SSRF surface introduced by this PR. - **Regex DoS**: `pathLineRe` (`emit.go:47`) is `[A-Za-z0-9_./-]+\.[A-Za-z0-9]+:\d+` — single linear pass, no nested quantifiers or alternation. No ReDoS. - **Resource bounds on attacker-influenced data**: `maxFindingsPerLens = 20` (pre-existing, `emit.go:39`) and `truncate(detail, 500)` (`findings.go:77`, `emit.go:336`) cap what a hostile/poisoned model output can contribute to the POST payload. The map built by `proseParagraphs` is bounded by the regex match count on the model's output, which is already bounded by the model's own token limit. No unbounded memory growth. - **Information disclosure**: The prose `detail` field is borrowed from the model's own markdown (`findings.go:67`) — it is content the model already produced and the heuristic path was already posting the same paragraph text via `parseFindings` (`emit.go:212`). The structured path does not widen the disclosure surface; it just carries `severity`/`confidence` per-finding instead of using the lens-level verdict. - **Authn on telemetry**: Bearer token still attached via `postJSON` (`emit.go:391-393`); unchanged. The new code does not touch credential handling. - **Prompt-injection resilience**: A malicious PR could attempt to make the model emit a `gadfly-findings` block referencing attacker-chosen paths. `f.file` is a string that ends up only in the JSON body — never in a filesystem path, shell command, URL host, or template. No code in the new diff interprets it as anything other than data. `f.line` is parsed as `int` and likewise only used as a number. No injection sink is reachable. - **`stripFindingsBlock` correctness for the rendered comment** (`consolidate.go:105`, `findings.go:115-133`): The fenced block is removed before being rendered to the PR comment, so a poisoned-looking JSON block (e.g. one that looks like a fake "verdict" or tries to manipulate downstream tooling) is invisible to humans and never reaches the rendered markdown. Good. Nothing in this PR introduces an authn/authz gap, injection sink, SSRF, unsafe deserialization, secret leakage, missing input validation, or insecure default. </details> <details><summary>🎯 Correctness — Minor issues</summary> ## VERDICT: Minor issues - **Severity synonym table is internally inconsistent** — `cmd/gadfly/findings.go:170-172`. `"minor"` maps to `"medium"` (line 170) but the hyphenated `"minor-issue"` maps to `"small"` (line 172). A model that emits `"severity":"minor"` gets `"medium"`, while `"minor-issue"` gets `"small"` — opposite ends of the scale for the same semantic intent. `TestNormalizeSeverity` at `findings_test.go:90` explicitly pins `"minor"→"medium"`, so this is a deliberate-but-questionable choice that's now baked in. Impact: per-finding severity bucket is silently inconsistent. - **Finding with no usable title or detail is still emitted** — `cmd/gadfly/findings.go:65-80`. The only filter on the structured path is `file == ""` (line 58). When the JSON omits both `title` and `detail` and the prose has no matching `path:line`, `detail` stays `""`, then `title = truncate("", 120) == ""`, and the finding is appended at line 73 with empty title and detail. Telemetry then POSTs a `reportPayload` with empty `Title` and `Detail`; the store will store it. Impact: phantom findings in the grading store with no human-readable anchor. Fix: drop the finding (mirror the file-empty branch at line 58) or require non-empty title before emitting. - **`ln=0` fallthrough produces a phantom key** — `cmd/gadfly/findings.go:61-67`. When `sf.Line` doesn't parse (e.g. `"line":"abc"`), `ln` stays `0` (line 62 silently swallows the error). The prose-lookup key at line 67 becomes `"file.go:0"`, which never matches prose, and the finding is emitted with `line=0`. Impact: a finding pointing at line 0 — not a real source line — is recorded. Fix: `continue` when `Line` fails to parse, mirroring the file-empty branch. - **`findingsBlock` / `stripFindingsBlock` do substring matching on the info-string** — `cmd/gadfly/findings.go:96` and `cmd/gadfly/findings.go:126`. `strings.Contains(t, findingsFence)` matches any fence whose info-string merely *contains* `"gadfly-findings"`, e.g. ```` ```gadfly-findings-v2 ```` or ```` ```some-gadfly-findings-thing ````. Wider match than the contract documents. Fix: require the info-string to *equal* `findingsFence` (modulo leading whitespace), or at least be a token boundary. - **`isFenceClose` accepts 4+ backticks as a close fence even after a 3-backtick open** — `cmd/gadfly/findings.go:137-140`. `strings.TrimLeft(t, "`")` strips *all* leading backticks, so "````" passes as a valid close. CommonMark requires the close fence to match the open fence in length, so a 4-backtick close after a 3-backtick open is not a valid close and the block should be treated as unterminated. Low-impact for real models but worth tightening if this is ever exposed to adversarial input. </details> <details><summary>🧹 Code cleanliness & maintainability — Minor issues</summary> **VERDICT: Minor issues** Findings: - **`cmd/gadfly/findings_test.go:104-110` — `contains` reinvents `strings.Contains`.** The new test file defines its own substring helper (`contains`) that duplicates Go stdlib `strings.Contains`. The codebase already uses `strings.Contains` in 20+ callsites (including `findings.go` itself at lines 96 and 126), so this is gratuitous. Suggested fix: replace `contains(s, sub)` with `strings.Contains(s, sub)` and add `"strings"` to the existing `import "testing"` block. The package isn't currently imported in this test file, so the import block does need to grow — that's a one-line edit. - **`cmd/gadfly/findings_test.go:102` — `containsFence` is a single-use one-liner.** `containsFence` is only called once (line 70, inside `TestStripFindingsBlock`). Even if you keep `contains` for some reason, `containsFence` adds no value beyond `strings.Contains(s, findingsFence)` — which is what the call site would become after the fix above. Inlining it would remove the helper entirely. - **`cmd/gadfly/findings.go:131` — `stripFindingsBlock`'s docstring claims "verbatim" but the function trims trailing newlines.** The function comment says "Other content is preserved verbatim," but `return strings.TrimRight(strings.Join(kept, "\n"), "\n")` strips trailing newlines from the joined output unconditionally. `TestStripFindingsBlockNoBlock` (line 81) passes only because its input happens to have no trailing newline; for inputs that do, the output is shorter than the input would suggest. This is fine, but the comment should say so (e.g., "...preserved verbatim except trailing newlines are normalized away"). Low-priority doc nit. - **`cmd/gadfly/findings.go:115-133` — `stripFindingsBlock` duplicates fence-scanning logic in a structurally different style from `findingsBlock`.** Verified: `findingsBlock` (lines 91-110) scans with two separate loops and returns the inner text; `stripFindingsBlock` scans with a single `skipping` boolean and keeps-or-drops. Both functions answer "where are the findings-block fences?", but any tightening of the fence grammar (e.g., requiring exactly `` ```gadfly-findings ```` with no trailing whitespace, or supporting nested fences) would need to be made in two places. A shared `[]fenceRange` scan that returns start/end line indexes, used by both functions, would prevent drift. Low priority — not a bug today. - **`cmd/gadfly/recheck.go:52-56` — ` + "`" + ` concatenation in a raw string reads awkwardly.** The prompt-building lines splice literal backticks into backtick-delimited raw strings via ` + "`" + `. This is the standard idiom for embedding a backtick in a Go raw string (raw strings can't contain backticks), so it works — but the first ` + "`" + ` at line 52 is a sharp edge for anyone unfamiliar with the trick. Adding a one-line comment at that site ("` + "`" + ` splices a literal backtick into the raw string") would help future readers. Same pattern appears in `scripts/run.sh`, so this isn't a new pattern in the codebase, just a new instance. - **`examples/reusable.yml:14-21` and `:53-54` — documentation churn bundled with the structured-findings contract change.** Verified by reading the diff: the reusable.yml comment rewrite (changing `@v1` pinning guidance to talk about `act_runner` ref caching) has nothing to do with the new `` ```gadfly-findings ```` contract in `cmd/gadfly/findings.go`, `cmd/gadfly/emit.go`, `cmd/gadfly/consolidate.go`, or `scripts/system-prompt.txt`. Both edits are fine on their merits, but mixing them in one PR makes bisecting either change harder. Process/cleanliness, not a code defect — would suggest splitting. ```gadfly-findings [ {"file": "cmd/gadfly/findings_test.go", "line": 104, "severity": "trivial", "confidence": "high", "title": "local `contains` helper duplicates strings.Contains"}, {"file": "cmd/gadfly/findings_test.go", "line": 102, "severity": "trivial", "confidence": "high", "title": "single-use `containsFence` helper should be inlined"}, {"file": "cmd/gadfly/findings.go", "line": 131, "severity": "trivial", "confidence": "medium", "title": "stripFindingsBlock docstring says 'verbatim' but TrimRight strips trailing newlines"}, {"file": "cmd/gadfly/findings.go", "line": 115, "severity": "small", "confidence": "medium", "title": "findingsBlock and stripFindingsBlock duplicate fence-scanning logic in two different styles"}, {"file": "cmd/gadfly/recheck.go", "line": 52, "severity": "trivial", "confidence": "high", "title": "` + \"`\" + ` raw-string splice for literal backtick would benefit from a one-line comment"}, {"file": "examples/reusable.yml", "line": 14, "severity": "trivial", "confidence": "high", "title": "unrelated docs churn (act_runner pinning guidance) bundled with the findings contract change"} ] ``` </details> <details><summary>⚡ Performance — Minor issues</summary> ## VERDICT: Minor issues Findings (through the **Performance** lens): - **`cmd/gadfly/findings.go:155` — avoidable O(N·M) per-lens scan in `proseParagraphs`.** The line `li := strings.Count(prose[:loc[0]], "\n")` is invoked inside the `pathLineRe` match loop. Each call re-scans from the start of the prose to the current match byte offset, so the total work is O(N · M) where N = prose bytes and M = number of `path:line` matches. A single forward pass that tracks the running newline count (advance on each match) brings this to O(N). Verified against `findings.go:146–158`. - **`cmd/gadfly/findings.go:54,147` — `proseParagraphs` (and therefore `stripFindingsBlock` + the full regex sweep) runs unconditionally, even when every structured finding already carries a `detail`.** `extractStructuredFindings` parses the JSON first, then builds the prose map *before* iterating `raw`. When every `sf.Detail` is non-empty the `prose[file+":"+ln]` lookup on line 67 is never hit, yet `stripFindingsBlock` and `pathLineRe.FindAllStringSubmatchIndex` over the whole output already ran. On a long adversarial-review output this is pure waste on the per-lens hot path. Lazily build the map on first miss, or skip the call when `sf.Detail != ""` for every element of `raw`. - **`cmd/gadfly/findings.go:73–83` — work done for a finding that will be dropped at the cap.** The full finding struct (including `truncate(detail, 500)` on the possibly-borrowed prose) is built *before* the `len(findings) >= maxFindingsPerLens` check. Cheap individually, but combined with finding #2 it means the cap doesn't short-circuit the prose/enrichment work; reordering to check the cap first would let the hot loop exit earlier. - **`cmd/gadfly/consolidate.go:105` + `cmd/gadfly/findings.go:147` — `stripFindingsBlock(r.out)` runs twice for the same lens output.** `renderConsolidated` strips for the rendered comment, and the subsequent `emit` → `extractStructuredFindings` → `proseParagraphs` strips again on the same string. A single pre-computed stripped view (e.g. computed in `specialistResult` once and reused) would halve the fence-line scan. Minor in absolute terms; flagging only because it lands on the per-lens hot path and is trivially cacheable. Nothing in this review touches correctness, severity semantics, or the prose-rendering path — those are out of my lane and left to other reviewers. ```gadfly-findings [ {"file": "cmd/gadfly/findings.go", "line": 155, "severity": "small", "confidence": "high", "title": "O(N*M) newline scan in proseParagraphs"}, {"file": "cmd/gadfly/findings.go", "line": 54, "severity": "small", "confidence": "high", "title": "proseParagraphs built unconditionally even when all details present"}, {"file": "cmd/gadfly/findings.go", "line": 73, "severity": "trivial", "confidence": "medium", "title": "cap check runs after per-finding struct build"}, {"file": "cmd/gadfly/consolidate.go", "line": 105, "severity": "trivial", "confidence": "high", "title": "stripFindingsBlock runs twice per lens output"} ] ``` </details> <details><summary>🧯 Error handling & edge cases — Minor issues</summary> **VERDICT: Minor issues** Findings through the error-handling / edge-case lens (all verified by reading the diff and the surrounding code in `cmd/gadfly/`): - **`cmd/gadfly/emit.go:144` — heuristic fallback writes a non-bucket severity.** `r.verdict.label()` returns human phrases (`"No material issues found"`, `"Minor issues"`, `"Blocking issues found"`, `"Reviewed"`) — see `consolidate.go:18-29`. These get written verbatim into `RawSeverity` whenever `f.severity == ""`. The structured path normalizes via `normalizeSeverity` into the canonical `critical/high/medium/small/trivial` set, but the fallback path bypasses that, so a lens that omits the new block emits `"Minor issues"` into a field whose contract is the 5-bucket set. Suggested fix: map `verdictBlocking→"critical"`, `verdictMinor→"small"`, `verdictClean→"trivial"` (clean lenses are skipped above so the last is academic), `verdictUnknown→""` (or whatever the store treats as unknown) instead of calling `.label()`. - **`cmd/gadfly/findings.go:115` — unterminated `\`\`\`gadfly-findings` block silently swallows the rest of the comment.** `findingsBlock` correctly returns `ok=false` for an unterminated block, but `stripFindingsBlock` just sets `skipping = true` and never resets it without seeing an `isFenceClose`. The model is asked to terminate the block, but `findings_test.go` shows `stripFindingsBlock` isn't even exercised for the unterminated case. A model that opens the fence and forgets the closing fence will have its entire tail (verdict line, sign-off, anything after) silently dropped from the rendered consolidated comment. Suggested fix: in the `skipping` branch, if `isFenceClose(ln)` is false AND we hit EOF inside `skipping`, also drop the original opening fence line so the prose around it isn't accidentally kept; or cap `skipping` to the same logic `findingsBlock` uses so an unterminated opener is left in place rather than eating the rest. - **`cmd/gadfly/findings.go:44` — silent finding loss on a structured block whose entries all lack `file`.** Every entry that fails the `file == ""` guard is `continue`'d, but the function still returns `ok=true`. The caller in `emit.go:141-142` then never falls back to the heuristic scrape, so any `path:line` finding the model wrote in prose is lost. This contradicts the PR's "graceful fallback" framing — the fallback is only taken when the JSON is unparseable, not when every JSON entry is empty/invalid. Suggested fix: if the raw slice was non-empty but every entry was dropped, return `ok=false` so the caller runs `parseFindings`. - **`cmd/gadfly/findings.go:62` — `line` of `null` or a negative number silently becomes 0.** `json.Number.String()` for `null` is the string `"null"`, `strconv.Atoi("null")` errors, so `ln` stays at the zero value `0`. Same for non-numeric strings. A `0` line is then written to `reports` and looks like a real hit at line 0. Suggested fix: on `err != nil`, either `continue` (drop the entry) or return an error so the caller falls back; at minimum, reject `ln <= 0` before accepting the finding. Verified by reading `findings.go`, `emit.go`, `consolidate.go`, and `findings_test.go` directly; no claims rely on the diff alone. </details> Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 25s

gitea-actions bot commented

2026-06-28 22:03:18 +00:00

🪰 Gadfly review — `claude-code/sonnet` (claude-code)

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

Verdict: No material issues found

I traced the full data flow introduced by this PR through a security lens.

Untrusted model output → telemetry POST

extractStructuredFindings parses LLM-generated JSON and feeds file, title, detail, severity, confidence into reportPayload, which is POSTed to the telemetry API (emit.go:150-162). I checked whether any of these values reach gadfly's own filesystem or a shell:

Filesystem: All file I/O in tools.go goes through r.resolve() (sandboxed to repo root via filepath.Clean + prefix check). The finding.file field from structured output is never passed to r.resolve() — it only populates reportPayload.File and is used as a map key in proseParagraphs. No path traversal is possible from this PR's code.
Shell: No shell execution anywhere in the codebase.
JSON injection into the HTTP POST: Go's encoding/json escapes special characters; all string fields are serialized safely.
Telemetry payload size: title is truncated to 120 runes, detail to 500 runes before transmission (findings.go:71,77).

The file field already flowed to telemetry from the pre-existing parseFindings heuristic scrape; this PR adds a more structured path for the same data without widening the attack surface meaningfully.

stripFindingsBlock / content injection into PR comment

The PR comment renders stripFindingsBlock(r.out) (consolidate.go:105), not any parsed finding field — so model-controlled title/detail values never reach the rendered Gitea comment. isFenceClose does its own TrimSpace regardless of whether it receives the raw or trimmed line, so the strip logic is consistent.

Prompt injection in recheck (recheck.go:97, buildRecheckTask): raw first-pass output is embedded unsanitized into the recheck prompt. This pre-dates the PR and is not introduced or materially worsened here (the gadfly-findings block simply rides along in what was already an unescaped draft embedding).

Nothing in my security lane introduced by this change.

🎯 Correctness — Minor issues

The finding is confirmed. The code at line 170 places "minor" in the "medium" case, while "minor-issue" at line 172 maps to "small" — the inconsistency is real. The test at line 90 asserts "minor": "medium", locking in the current behavior.

Minor issues

`normalizeSeverity`: "minor" maps to "medium" — severity inflation

cmd/gadfly/findings.go:170

case "medium", "moderate", "minor":
    return "medium"
case "small", "low", "minor-issue":
    return "small"

"minor" is a ubiquitous code-review adjective meaning "small, non-blocking issue." A model that uses "minor" to label a nit gets that finding posted to the telemetry store as severity=medium — one tier higher than intended.

The inconsistency is visible within the same function: "minor-issue" → "small" but "minor" → "medium". The hyphenated form lands at the correct tier; the natural word lands one step up. Since the PR's stated goal is "the gadfly-reports store and grading get real signal," an incorrect normalization directly undermines that goal.

Fix: move "minor" to the "small" case:

case "small", "low", "minor-issue", "minor":
    return "small"

The test at findings_test.go:90 asserts the current (wrong) mapping "minor": "medium" — that assertion should flip to "minor": "small" alongside the code fix.

🧹 Code cleanliness & maintainability — Minor issues

All four findings check out against the actual files. Here is the verified review:

Minor issues

emit.go:13–17 — package comment now contradicts the code. The doc block says "Findings are extracted heuristically from each lens's markdown … not a structured contract the reviewer guarantees." This PR makes the structured block the preferred path (emit.go:140–143) and the prose scrape a fallback; the heuristic-only description and the "not a structured contract" claim are both now wrong and will mislead future readers.
findings.go:164–179 — normalizeSeverity evaluates strings.ToLower(strings.TrimSpace(s)) twice. Go evaluates the switch expression once, but the default branch at line 177 re-evaluates it independently with return strings.ToLower(strings.TrimSpace(s)). The fix is one line: lower := strings.ToLower(strings.TrimSpace(s)) before the switch, and return lower in the default.
findings.go:96 and findings.go:126 — duplicated fence-open predicate. The expression strings.HasPrefix(t, "```") && strings.Contains(t, findingsFence) is copy-pasted verbatim in both findingsBlock and stripFindingsBlock. The fence-close check is already factored into isFenceClose; the fence-open deserves the same treatment. If the info-string ever changes it is a two-site edit.
findings_test.go:102–111 — hand-rolled contains instead of strings.Contains. The test file avoids importing "strings" by defining its own byte-by-byte substring search and a containsFence wrapper. Adding "strings" and using strings.Contains directly is idiomatic Go and removes ~8 lines of noise.

⚡ Performance — No material issues found

No material issues found

The new code (findings.go, emit.go changes) touches only the end-of-run telemetry path (emit() is called once, after all lens results are collected, before process exit). Performance concerns here have negligible real-world impact.

Verified findings:

proseParagraphs line-counting loop (findings.go:154): strings.Count(prose[:loc[0]], "\n") rescans the string from byte 0 for every regex match — technically O(N×M) where N=prose length and M=path:line match count. I confirmed this is the actual code. However, extractStructuredFindings is invoked at most once per non-clean lens in emit() (see emit.go:140), emit() itself is called once per run (main.go:216), and lens output has no enforced size cap but is bounded in practice by LLM output. For typical inputs (a few KB of prose, ≤20 path:line matches), this is microseconds. Not material.
Repeated strings.Split per lens: findingsBlock, stripFindingsBlock (inside proseParagraphs), and proseParagraphs itself each call strings.Split(out, "\n") on the same string — three allocations per lens. consolidate.go adds a fourth via stripFindingsBlock(r.out). All four happen once per lens outside any review loop. Inconsequential.
proseParagraphs called on empty arrays (findings.go:54): the map is built even when raw is empty, so the work is discarded unused. Wasted O(prose) work per lens that emits []. Trivial given call frequency.

None of these cross the bar for a real performance issue — they are off the hot path entirely.

🧯 Error handling & edge cases — No material issues found

No material issues found

I verified the following paths carefully:

findingsBlock — unterminated fence (findings.go:105–110): returns ("", false) when no close fence is found, so extractStructuredFindings returns (nil, false) and the caller falls back to the heuristic scrape. Correct.

Empty body inside a gadfly-findings fence: start+1 == j produces body="", json.Unmarshal([]byte(""), …) errors, returns (nil, false). Falls back correctly.

json.Number with string-encoded line (findings.go:63): json.Number's underlying kind is reflect.String, so "line":"42" (JSON string) decodes to json.Number("42"), and sf.Line.String() → "42" → strconv.Atoi → 42. TestExtractStructuredFindingsStringLine covers this.

json.Number zero value / null / float: json.Number("") or json.Number("1.5") → strconv.Atoi errors → ln=0. Prose-detail lookup for file:0 misses, detail stays "". Finding still emitted. No panic.

isFenceClose correctness (findings.go:138–140): strings.TrimLeft(t, "``") trims all leading backticks; a line with only backticks (3+) correctly returns true; any info string after the backticks prevents a false match. Handles ≥3 backtick open/close pairs correctly.

stripFindingsBlock unterminated block (findings.go:116–133): if no close fence appears, skipping stays true through end-of-input — all content after the opening fence is stripped from the rendered comment, silently. In practice the system prompt instructs "Keep the block last", so there should be no content after it. This is consistent with findingsBlock returning (nil, false) for the same input, so the telemetry path correctly falls back to heuristic scraping on the original r.out. Not a real-world risk given the contract, but worth being aware of if a model truncates mid-block.

paragraphAt bounds (emit.go:321): guards li < 0 || li >= len(lines). Byte-index arithmetic in proseParagraphs is safe because pathLineRe only matches ASCII.

maxFindingsPerLens cap: applied in extractStructuredFindings (findings.go:82) as well as the existing heuristic path — no payload explosion possible.

Empty title+detail degenerate finding: if both sf.Title and sf.Detail are empty and no prose match is found, title="" is submitted to telemetry. Not a crash; the store receives a finding with an empty title. The if file == "" guard (findings.go:58) at least ensures location is present. Benign but worth noting as a telemetry hygiene point.

No panics, no swallowed errors on unhappy paths, no missing cleanup. The graceful-degradation design (ok=false → heuristic scrape) is sound.

_{Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 7m 21s}

### 🪰 Gadfly review — `claude-code/sonnet` (claude-code) **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary>🔒 Security — No material issues found</summary> **Verdict: No material issues found** I traced the full data flow introduced by this PR through a security lens. **Untrusted model output → telemetry POST** `extractStructuredFindings` parses LLM-generated JSON and feeds `file`, `title`, `detail`, `severity`, `confidence` into `reportPayload`, which is POSTed to the telemetry API (`emit.go:150-162`). I checked whether any of these values reach gadfly's own filesystem or a shell: - **Filesystem**: All file I/O in `tools.go` goes through `r.resolve()` (sandboxed to repo root via `filepath.Clean` + prefix check). The `finding.file` field from structured output is never passed to `r.resolve()` — it only populates `reportPayload.File` and is used as a map key in `proseParagraphs`. No path traversal is possible from this PR's code. - **Shell**: No shell execution anywhere in the codebase. - **JSON injection into the HTTP POST**: Go's `encoding/json` escapes special characters; all string fields are serialized safely. - **Telemetry payload size**: `title` is truncated to 120 runes, `detail` to 500 runes before transmission (`findings.go:71,77`). The `file` field already flowed to telemetry from the pre-existing `parseFindings` heuristic scrape; this PR adds a more structured path for the same data without widening the attack surface meaningfully. **`stripFindingsBlock` / content injection into PR comment** The PR comment renders `stripFindingsBlock(r.out)` (`consolidate.go:105`), not any parsed `finding` field — so model-controlled `title`/`detail` values never reach the rendered Gitea comment. `isFenceClose` does its own `TrimSpace` regardless of whether it receives the raw or trimmed line, so the strip logic is consistent. **Prompt injection in recheck** (`recheck.go:97`, `buildRecheckTask`): raw first-pass output is embedded unsanitized into the recheck prompt. This pre-dates the PR and is not introduced or materially worsened here (the `gadfly-findings` block simply rides along in what was already an unescaped draft embedding). Nothing in my security lane introduced by this change. </details> <details><summary>🎯 Correctness — Minor issues</summary> The finding is confirmed. The code at line 170 places `"minor"` in the `"medium"` case, while `"minor-issue"` at line 172 maps to `"small"` — the inconsistency is real. The test at line 90 asserts `"minor": "medium"`, locking in the current behavior. **Minor issues** --- ### `normalizeSeverity`: "minor" maps to "medium" — severity inflation **`cmd/gadfly/findings.go:170`** ```go case "medium", "moderate", "minor": return "medium" case "small", "low", "minor-issue": return "small" ``` "minor" is a ubiquitous code-review adjective meaning "small, non-blocking issue." A model that uses "minor" to label a nit gets that finding posted to the telemetry store as `severity=medium` — one tier higher than intended. The inconsistency is visible within the same function: **"minor-issue" → "small"** but **"minor" → "medium"**. The hyphenated form lands at the correct tier; the natural word lands one step up. Since the PR's stated goal is "the gadfly-reports store and grading get real signal," an incorrect normalization directly undermines that goal. Fix: move `"minor"` to the `"small"` case: ```go case "small", "low", "minor-issue", "minor": return "small" ``` The test at `findings_test.go:90` asserts the current (wrong) mapping `"minor": "medium"` — that assertion should flip to `"minor": "small"` alongside the code fix. </details> <details><summary>🧹 Code cleanliness & maintainability — Minor issues</summary> All four findings check out against the actual files. Here is the verified review: --- **Minor issues** - **`emit.go:13–17` — package comment now contradicts the code.** The doc block says "Findings are extracted heuristically from each lens's markdown … not a structured contract the reviewer guarantees." This PR makes the structured block the *preferred* path (`emit.go:140–143`) and the prose scrape a *fallback*; the heuristic-only description and the "not a structured contract" claim are both now wrong and will mislead future readers. - **`findings.go:164–179` — `normalizeSeverity` evaluates `strings.ToLower(strings.TrimSpace(s))` twice.** Go evaluates the switch expression once, but the `default` branch at line 177 re-evaluates it independently with `return strings.ToLower(strings.TrimSpace(s))`. The fix is one line: `lower := strings.ToLower(strings.TrimSpace(s))` before the switch, and `return lower` in the default. - **`findings.go:96` and `findings.go:126` — duplicated fence-open predicate.** The expression `strings.HasPrefix(t, "```") && strings.Contains(t, findingsFence)` is copy-pasted verbatim in both `findingsBlock` and `stripFindingsBlock`. The fence-close check is already factored into `isFenceClose`; the fence-open deserves the same treatment. If the info-string ever changes it is a two-site edit. - **`findings_test.go:102–111` — hand-rolled `contains` instead of `strings.Contains`.** The test file avoids importing `"strings"` by defining its own byte-by-byte substring search and a `containsFence` wrapper. Adding `"strings"` and using `strings.Contains` directly is idiomatic Go and removes ~8 lines of noise. </details> <details><summary>⚡ Performance — No material issues found</summary> **No material issues found** The new code (`findings.go`, `emit.go` changes) touches only the end-of-run telemetry path (`emit()` is called once, after all lens results are collected, before process exit). Performance concerns here have negligible real-world impact. Verified findings: - **`proseParagraphs` line-counting loop** (`findings.go:154`): `strings.Count(prose[:loc[0]], "\n")` rescans the string from byte 0 for every regex match — technically O(N×M) where N=prose length and M=path:line match count. I confirmed this is the actual code. However, `extractStructuredFindings` is invoked at most once per non-clean lens in `emit()` (see `emit.go:140`), `emit()` itself is called once per run (`main.go:216`), and lens output has no enforced size cap but is bounded in practice by LLM output. For typical inputs (a few KB of prose, ≤20 path:line matches), this is microseconds. Not material. - **Repeated `strings.Split` per lens**: `findingsBlock`, `stripFindingsBlock` (inside `proseParagraphs`), and `proseParagraphs` itself each call `strings.Split(out, "\n")` on the same string — three allocations per lens. `consolidate.go` adds a fourth via `stripFindingsBlock(r.out)`. All four happen once per lens outside any review loop. Inconsequential. - **`proseParagraphs` called on empty arrays** (`findings.go:54`): the map is built even when `raw` is empty, so the work is discarded unused. Wasted O(prose) work per lens that emits `[]`. Trivial given call frequency. None of these cross the bar for a real performance issue — they are off the hot path entirely. </details> <details><summary>🧯 Error handling & edge cases — No material issues found</summary> **No material issues found** I verified the following paths carefully: **`findingsBlock` — unterminated fence** (`findings.go:105–110`): returns `("", false)` when no close fence is found, so `extractStructuredFindings` returns `(nil, false)` and the caller falls back to the heuristic scrape. Correct. **Empty body inside a `gadfly-findings` fence**: `start+1 == j` produces `body=""`, `json.Unmarshal([]byte(""), …)` errors, returns `(nil, false)`. Falls back correctly. **`json.Number` with string-encoded line** (`findings.go:63`): `json.Number`'s underlying kind is `reflect.String`, so `"line":"42"` (JSON string) decodes to `json.Number("42")`, and `sf.Line.String()` → `"42"` → `strconv.Atoi` → 42. `TestExtractStructuredFindingsStringLine` covers this. **`json.Number` zero value / null / float**: `json.Number("")` or `json.Number("1.5")` → `strconv.Atoi` errors → `ln=0`. Prose-detail lookup for `file:0` misses, detail stays `""`. Finding still emitted. No panic. **`isFenceClose` correctness** (`findings.go:138–140`): `strings.TrimLeft(t, "``")` trims all leading backticks; a line with only backticks (3+) correctly returns true; any info string after the backticks prevents a false match. Handles ≥3 backtick open/close pairs correctly. **`stripFindingsBlock` unterminated block** (`findings.go:116–133`): if no close fence appears, `skipping` stays true through end-of-input — all content after the opening fence is stripped from the rendered comment, silently. In practice the system prompt instructs "Keep the block last", so there should be no content after it. This is consistent with `findingsBlock` returning `(nil, false)` for the same input, so the telemetry path correctly falls back to heuristic scraping on the original `r.out`. Not a real-world risk given the contract, but worth being aware of if a model truncates mid-block. **`paragraphAt` bounds** (`emit.go:321`): guards `li < 0 || li >= len(lines)`. Byte-index arithmetic in `proseParagraphs` is safe because `pathLineRe` only matches ASCII. **`maxFindingsPerLens` cap**: applied in `extractStructuredFindings` (`findings.go:82`) as well as the existing heuristic path — no payload explosion possible. **Empty title+detail degenerate finding**: if both `sf.Title` and `sf.Detail` are empty and no prose match is found, `title=""` is submitted to telemetry. Not a crash; the store receives a finding with an empty title. The `if file == ""` guard (`findings.go:58`) at least ensures location is present. Benign but worth noting as a telemetry hygiene point. No panics, no swallowed errors on unhappy paths, no missing cleanup. The graceful-degradation design (`ok=false` → heuristic scrape) is sound. </details> Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 7m 21s

gitea-actions bot commented

2026-06-28 22:03:19 +00:00

🪰 Gadfly review — `claude-code/opus:max` (claude-code)

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

I've traced the taint path through the new code. Here's my security review.

VERDICT: No material issues found

I reviewed this change specifically for the security-relevant question it raises: structured findings are parsed from LLM output, which is influenced by the (potentially attacker-authored) PR diff under review, and that data flows into a telemetry POST. I traced that path end-to-end.

Taint sink is safely encoded. extractStructuredFindings (cmd/gadfly/findings.go:44) populates file/line/title/detail/severity/confidence straight from model-controlled JSON, and those land in reportPayload and are sent via postJSON (cmd/gadfly/emit.go:150, :381). postJSON uses json.Marshal, which escapes control characters/quotes, so model-supplied strings cannot break out of the JSON body — no injection into the telemetry call. Verified by reading postJSON.
No SSRF / URL tainting. The POST target is base+"/runs" / base+"/reports" where base comes only from GADFLY_FINDINGS_URL (operator env), never from the model. The model-controlled file field is payload data, not part of the URL or any path. Verified at emit.go:102 and :169-173.
No new auth/secret exposure. The Bearer token comes from GADFLY_FINDINGS_TOKEN (env) as before; nothing in this diff logs it or routes secrets into the new fields. detail/title can carry repo prose, but that was already true of the pre-existing parseFindings scrape — no new leakage surface.
stripFindingsBlock is strip-only. An unterminated or malicious ```gadfly-findings block makes it drop content, never reveal more, so hiding the tooling block can't be subverted into leaking it (findings.go:115). The consolidate.go change only removes the block from the rendered comment — a mild positive.

One thing I considered and rejected as a finding: unlike the regex-extracted path (constrained to [A-Za-z0-9_./-]), the structured file field is unvalidated/unbounded model text. Within this binary that's harmless (JSON-escaped on egress, count capped at maxFindingsPerLens). It could matter only if the downstream gadfly-reports store renders/uses file unsafely — but that store is out of this repo's scope, so it's not a security defect in this change.

Nothing actionable in the security lens.

🎯 Correctness — Minor issues

All four findings verify against the actual code:

Heterogeneous raw_severity — confirmed: emit.go:146-148 writes f.severity (canonical critical/high/medium/small/trivial) for structured findings but lensSev = r.verdict.label() for heuristic ones, and label() (consolidate.go:18-29) returns the full sentences "Minor issues" / "Blocking issues found". Two vocabularies into one RawSeverity field (emit.go:93).
normalizeSeverity("minor") → "medium" — confirmed at findings.go:170, while "low"/"minor-issue" route to small.
confidence collected but dropped — confirmed: set at findings.go:79, but reportPayload (emit.go:83-95) has no confidence field; grep shows it is never read into any payload.
Empty block suppresses heuristic — confirmed: extractStructuredFindings returns ok=true for [] (tested), so emit.go:141 skips the prose fallback.

VERDICT: Minor issues

Correctness lens. The change compiles (all referenced helpers resolve, struct fields are additive/keyed) and the advisory invariant is untouched. The issues below are semantic/data-correctness in the telemetry the PR exists to feed.

Heterogeneous raw_severity vocabulary — emit.go:144-160 + findings.go:164. The two code paths write incompatible value spaces into the same raw_severity field. A model that emits a structured block produces canonical severities (critical/high/medium/small/trivial); a model that doesn't falls back to lensSev = r.verdict.label(), which is "Minor issues" / "Blocking issues found". Before this PR every finding's raw_severity was a verdict label — one consistent vocabulary. Now the field mixes per-finding severities and verdict-label sentences depending on which model emitted the block. The PR's stated goal is "grading gets real signal" from per-finding severity; any downstream bucketing keyed on severity must now special-case both vocabularies, or it silently mis-buckets the fallback rows. Suggest mapping the verdict to a canonical severity in the fallback (e.g. blocking→high/critical, minor→small) so the column has one vocabulary.
normalizeSeverity("minor") → "medium" — findings.go:170. Against the declared 5-level scale (critical > high > medium > small > trivial): bare "minor" is colloquially a low severity, and the table itself routes "minor-issue"→small and "low"→small. Mapping bare "minor" to the middle of the scale over-escalates and is internally inconsistent. It also contradicts Gadfly's own verdict vocabulary, where "Minor issues" is the lowest non-clean verdict — so this inflates the telemetry severity of exactly the off-spec/weak models the synonym table exists to serve. Suggest "minor"→small. (Deliberate per findings_test.go, so confirm intent — but the mapping looks wrong.)
confidence is parsed, normalized, then dropped — findings.go:79 populates finding.confidence, but reportPayload (emit.go:83-95) has no confidence field and nothing else reads it. So the per-finding confidence the system-prompt and recheck now demand never reaches the store in this PR. The PR body frames confidence as signal for telemetry/grading; here it reaches nothing. Likely intentional staging for a later phase — worth a one-line confirmation, otherwise it's collected-but-discarded. (Borderline maintainability.)
(Lower confidence) Empty structured block suppresses the heuristic — emit.go:140-143. extractStructuredFindings returns ok=true for a parseable but empty [], so the fallback scrape never runs. If a non-compliant model writes prose findings + a non-clean verdict but emits [] (or an under-populated array), that lens records zero /reports rows where the pre-PR prose scrape would have captured them — and parseVerdict still reads Blocking/Minor from the prose, so /runs and /reports disagree. Gadfly explicitly targets weak/provider-agnostic models, so this mismatch is plausible. Consider falling back (or merging) when the block is empty but the verdict is non-clean.

No blocking correctness defects: parsing, fence stripping (both fences removed; unterminated block correctly yields ok=false), json.Number line handling, and dedup-free structured emission are all behaviorally sound.

🧹 Code cleanliness & maintainability — Minor issues

All three findings verify against the actual code:

confidence write-only: confirmed — findings.go:79 sets it, emit.go:62 declares it, but reportPayload (emit.go:83-95) has no confidence field and the grep shows no downstream reader.
Duplicated open-fence predicate: confirmed at findings.go:96 and findings.go:126, with isFenceClose helper at line 137.
Test reinvents strings.Contains: confirmed at findings_test.go:104-111.

All findings survive verification.

Verified against the checked-out code. My lens is code cleanliness & maintainability only.

VERDICT: Minor issues

finding.confidence is parsed, normalized, tested — but never consumed (write-only field). cmd/gadfly/findings.go:79 populates confidence, and emit.go:62 declares the struct field, but no consumer reads it: reportPayload (emit.go:83-95) has no confidence field, and a grep for confidence across non-test cmd/gadfly shows only the prompt text, the JSON tag, and the assignment — nothing downstream. So the recheck prompt regenerates it and the parser carefully lowercases it, then it's dropped at the /reports boundary. Either wire it into reportPayload in this change, or add a one-line comment that it's collected ahead of a later consumer — otherwise the next maintainer can't tell whether the omission is a bug.
Duplicated open-fence predicate. The exact expression strings.HasPrefix(t, "```") && strings.Contains(t, findingsFence) appears in both findingsBlock (findings.go:96) and stripFindingsBlock (findings.go:126). There's already an isFenceClose helper (findings.go:137) for the close side; an isFindingsFenceOpen(line) sibling would remove the copy-paste and keep the two detectors from drifting. Low-churn, in keeping with the existing helper pattern.
Test reinvents strings.Contains. findings_test.go hand-rolls a substring loop in contains (findings_test.go:104-111, wrapped by containsFence) instead of importing strings. The rest of the package uses strings freely; the manual loop is just more code to read with no benefit. Replace with strings.Contains. Minor.

⚡ Performance — No material issues found

I've verified the call paths. My performance review:

VERDICT: No material issues found

I reviewed this change through the Performance lens and traced every call site of the new code.

What I checked:

extractStructuredFindings, proseParagraphs, findingsBlock, stripFindingsBlock all operate on a single lens's review-markdown string (a few KB), and every call site runs once per lens at the end of the run — never inside a hot loop, retry, or per-source-file iteration:
- emit.go:140 — once per non-clean lens in the post-review telemetry pass.
- consolidate.go:105 — once per lens at render time.
- main.go:212 — renderConsolidated called once.
No redundant work on the fallback path: when there's no block, findingsBlock returns early (findings.go:45-48) before proseParagraphs runs; when the block is malformed JSON, it returns after json.Unmarshal (findings.go:50-52) — again before the prose work. So a model that ignores the contract pays only one strings.Split, then falls through to the pre-existing parseFindings. No double-scan.
Growth is bounded: the extraction loop caps at maxFindingsPerLens (20) (findings.go:81-83), matching the existing heuristic cap, so a pathological lens output can't blow up allocations or the /reports payload.
No new blocking calls on a hot path: the HTTP POSTs and their findingsHTTPTimeout are unchanged; the diff adds only in-memory string parsing before them.

Considered and dismissed (not material):

proseParagraphs (findings.go:155) uses strings.Count(prose[:loc[0]], "\n") per regex match — technically O(n·m). But it's bounded by review-string length, runs once per lens, and is the same pattern already in parseFindings (emit.go:207), so it's not a regression.
r.out is line-split up to three times per lens on the structured path (findingsBlock, then stripFindingsBlock + a re-split inside proseParagraphs). On inputs this small and called this rarely, deduping the split would be a micro-optimization, not a real win — not worth flagging.

Nothing in my lane is materially wrong here.

🧯 Error handling & edge cases — Minor issues

All three findings are confirmed against the actual code:

emit.go:140-143 — confirmed: extractStructuredFindings returns ok=true once the JSON parses (findings.go:85), and elements with a blank file are continued (findings.go:57-60), so an empty-after-filter result yields (fs=[], structured=true), and the if !structured fallback at emit.go:141 is skipped.
findings.go:115-133 — confirmed: stripFindingsBlock only clears skipping on isFenceClose, dropping all lines to EOF on an unterminated fence, while findingsBlock returns ok=false (findings.go:109) for the same input — a real asymmetry, and the stripped output feeds the rendered comment at consolidate.go:105.
findings.go:56-85 — confirmed: the structured loop has no seen map, unlike parseFindings (emit.go:192,201-205).

VERDICT: Minor issues

All findings below are within the error-handling / edge-cases lens on the new findings.go and its emit.go integration. This is advisory telemetry plumbing — emit only logs to stderr, never touches stdout, and never calls os.Exit (verified, emit.go:97-177), so nothing here can crash the process or affect a review/CI. Nothing is blocking, but two unhappy-path behaviors are worth fixing.

emit.go:140-143 — a structurally-valid-but-empty block suppresses the heuristic fallback, silently dropping findings. extractStructuredFindings returns ok=true whenever the JSON array parses (findings.go:85), even if every element is then discarded. In the loop at findings.go:56-60, any element with a blank file is continued. So a weak model that emits [{"severity":"high","title":"..."}] (valid JSON, no file) yields fs=[] and structured=true — the if !structured fallback to parseFindings (emit.go:141) never runs, even though the prose almost certainly contains scrapable path:line refs the heuristic would have recovered. Net effect: a non-clean lens contributes zero per-finding telemetry, undercutting the PR's "weak models still contribute" goal for exactly the malformed case it's meant to tolerate. Suggest: when structured && len(fs)==0 on a non-clean verdict, fall back to parseFindings (the truly-empty [] "nothing found" case is already filtered by the verdictClean guard at emit.go:134, so this wouldn't reintroduce noise on clean lenses).
findings.go:115-133 — an unterminated gadfly-findings fence swallows all following content from the rendered comment. In stripFindingsBlock, once skipping is set on the opening fence it is only cleared by an isFenceClose. If the model omits the closing ``` (or the block isn't last), every subsequent line is dropped from the human comment at consolidate.go:105. Note the asymmetry: findingsBlock treats the same unterminated input as no block (ok=false, findings.go:109), so the two functions disagree about where the block ends. Low probability, but a silent content-swallow on malformed input. Suggest: if no close fence is found, keep the lines (mirror findingsBlock's "unterminated ⇒ no block" stance) rather than discarding to EOF.
Minor — findings.go:56-85 does not dedupe by (file,line). The heuristic parseFindings dedupes via its seen map (emit.go:192,201-205); the structured path does not, so a model that repeats a finding in the JSON array emits duplicate /reports rows. Telemetry data-quality only.

Non-issues checked and cleared: no panic paths — strconv.Atoi on a blank/garbage/null line is error-guarded to ln=0 (findings.go:61-63); the quoted-"42" case works because json.Number accepts a numeric string literal; truncate's rune-slicing and all prose[...] indexing (indices come from FindAllStringSubmatchIndex over the same string) are in-bounds; malformed JSON correctly returns ok=false for clean fallback (findings.go:50-52).

_{Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 3m 55s}

### 🪰 Gadfly review — `claude-code/opus:max` (claude-code) **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary>🔒 Security — No material issues found</summary> I've traced the taint path through the new code. Here's my security review. --- **VERDICT: No material issues found** I reviewed this change specifically for the security-relevant question it raises: structured findings are parsed from **LLM output**, which is influenced by the (potentially attacker-authored) PR diff under review, and that data flows into a telemetry POST. I traced that path end-to-end. - **Taint sink is safely encoded.** `extractStructuredFindings` (`cmd/gadfly/findings.go:44`) populates `file`/`line`/`title`/`detail`/`severity`/`confidence` straight from model-controlled JSON, and those land in `reportPayload` and are sent via `postJSON` (`cmd/gadfly/emit.go:150`, `:381`). `postJSON` uses `json.Marshal`, which escapes control characters/quotes, so model-supplied strings cannot break out of the JSON body — no injection into the telemetry call. Verified by reading `postJSON`. - **No SSRF / URL tainting.** The POST target is `base+"/runs"` / `base+"/reports"` where `base` comes only from `GADFLY_FINDINGS_URL` (operator env), never from the model. The model-controlled `file` field is payload data, not part of the URL or any path. Verified at `emit.go:102` and `:169-173`. - **No new auth/secret exposure.** The Bearer token comes from `GADFLY_FINDINGS_TOKEN` (env) as before; nothing in this diff logs it or routes secrets into the new fields. `detail`/`title` can carry repo prose, but that was already true of the pre-existing `parseFindings` scrape — no *new* leakage surface. - **`stripFindingsBlock` is strip-only.** An unterminated or malicious ` ```gadfly-findings ` block makes it drop content, never reveal more, so hiding the tooling block can't be subverted into leaking it (`findings.go:115`). The consolidate.go change only removes the block from the rendered comment — a mild positive. One thing I considered and rejected as a finding: unlike the regex-extracted path (constrained to `[A-Za-z0-9_./-]`), the structured `file` field is unvalidated/unbounded model text. Within this binary that's harmless (JSON-escaped on egress, count capped at `maxFindingsPerLens`). It could matter only if the downstream `gadfly-reports` store renders/uses `file` unsafely — but that store is out of this repo's scope, so it's not a security defect in this change. Nothing actionable in the security lens. </details> <details><summary>🎯 Correctness — Minor issues</summary> All four findings verify against the actual code: 1. **Heterogeneous `raw_severity`** — confirmed: `emit.go:146-148` writes `f.severity` (canonical `critical/high/medium/small/trivial`) for structured findings but `lensSev = r.verdict.label()` for heuristic ones, and `label()` (`consolidate.go:18-29`) returns the full sentences `"Minor issues"` / `"Blocking issues found"`. Two vocabularies into one `RawSeverity` field (`emit.go:93`). 2. **`normalizeSeverity("minor") → "medium"`** — confirmed at `findings.go:170`, while `"low"`/`"minor-issue"` route to `small`. 3. **`confidence` collected but dropped** — confirmed: set at `findings.go:79`, but `reportPayload` (`emit.go:83-95`) has no confidence field; grep shows it is never read into any payload. 4. **Empty block suppresses heuristic** — confirmed: `extractStructuredFindings` returns `ok=true` for `[]` (tested), so `emit.go:141` skips the prose fallback. --- **VERDICT: Minor issues** Correctness lens. The change compiles (all referenced helpers resolve, struct fields are additive/keyed) and the advisory invariant is untouched. The issues below are semantic/data-correctness in the telemetry the PR exists to feed. - **Heterogeneous `raw_severity` vocabulary** — `emit.go:144-160` + `findings.go:164`. The two code paths write *incompatible value spaces* into the same `raw_severity` field. A model that emits a structured block produces canonical severities (`critical/high/medium/small/trivial`); a model that doesn't falls back to `lensSev = r.verdict.label()`, which is `"Minor issues"` / `"Blocking issues found"`. Before this PR *every* finding's `raw_severity` was a verdict label — one consistent vocabulary. Now the field mixes per-finding severities and verdict-label sentences depending on which model emitted the block. The PR's stated goal is "grading gets real signal" from per-finding severity; any downstream bucketing keyed on severity must now special-case both vocabularies, or it silently mis-buckets the fallback rows. Suggest mapping the verdict to a canonical severity in the fallback (e.g. blocking→high/critical, minor→small) so the column has one vocabulary. - **`normalizeSeverity("minor") → "medium"`** — `findings.go:170`. Against the declared 5-level scale (`critical > high > medium > small > trivial`): bare "minor" is colloquially a *low* severity, and the table itself routes `"minor-issue"`→`small` and `"low"`→`small`. Mapping bare `"minor"` to the **middle** of the scale over-escalates and is internally inconsistent. It also contradicts Gadfly's own verdict vocabulary, where `"Minor issues"` is the *lowest* non-clean verdict — so this inflates the telemetry severity of exactly the off-spec/weak models the synonym table exists to serve. Suggest `"minor"`→`small`. (Deliberate per `findings_test.go`, so confirm intent — but the mapping looks wrong.) - **`confidence` is parsed, normalized, then dropped** — `findings.go:79` populates `finding.confidence`, but `reportPayload` (`emit.go:83-95`) has **no** confidence field and nothing else reads it. So the per-finding confidence the system-prompt and recheck now demand never reaches the store in this PR. The PR body frames confidence as signal for telemetry/grading; here it reaches nothing. Likely intentional staging for a later phase — worth a one-line confirmation, otherwise it's collected-but-discarded. (Borderline maintainability.) - **(Lower confidence) Empty structured block suppresses the heuristic** — `emit.go:140-143`. `extractStructuredFindings` returns `ok=true` for a parseable but empty `[]`, so the fallback scrape never runs. If a non-compliant model writes prose findings + a non-clean verdict but emits `[]` (or an under-populated array), that lens records **zero** `/reports` rows where the pre-PR prose scrape would have captured them — and `parseVerdict` still reads `Blocking`/`Minor` from the prose, so `/runs` and `/reports` disagree. Gadfly explicitly targets weak/provider-agnostic models, so this mismatch is plausible. Consider falling back (or merging) when the block is empty but the verdict is non-clean. No blocking correctness defects: parsing, fence stripping (both fences removed; unterminated block correctly yields `ok=false`), `json.Number` line handling, and dedup-free structured emission are all behaviorally sound. </details> <details><summary>🧹 Code cleanliness & maintainability — Minor issues</summary> All three findings verify against the actual code: 1. **`confidence` write-only**: confirmed — `findings.go:79` sets it, `emit.go:62` declares it, but `reportPayload` (`emit.go:83-95`) has no confidence field and the grep shows no downstream reader. 2. **Duplicated open-fence predicate**: confirmed at `findings.go:96` and `findings.go:126`, with `isFenceClose` helper at line 137. 3. **Test reinvents `strings.Contains`**: confirmed at `findings_test.go:104-111`. All findings survive verification. --- Verified against the checked-out code. My lens is code cleanliness & maintainability only. ## VERDICT: Minor issues - **`finding.confidence` is parsed, normalized, tested — but never consumed (write-only field).** `cmd/gadfly/findings.go:79` populates `confidence`, and `emit.go:62` declares the struct field, but no consumer reads it: `reportPayload` (`emit.go:83-95`) has no confidence field, and a grep for `confidence` across non-test `cmd/gadfly` shows only the prompt text, the JSON tag, and the assignment — nothing downstream. So the recheck prompt regenerates it and the parser carefully lowercases it, then it's dropped at the `/reports` boundary. Either wire it into `reportPayload` in this change, or add a one-line comment that it's collected ahead of a later consumer — otherwise the next maintainer can't tell whether the omission is a bug. - **Duplicated open-fence predicate.** The exact expression `strings.HasPrefix(t, "```") && strings.Contains(t, findingsFence)` appears in both `findingsBlock` (`findings.go:96`) and `stripFindingsBlock` (`findings.go:126`). There's already an `isFenceClose` helper (`findings.go:137`) for the close side; an `isFindingsFenceOpen(line)` sibling would remove the copy-paste and keep the two detectors from drifting. Low-churn, in keeping with the existing helper pattern. - **Test reinvents `strings.Contains`.** `findings_test.go` hand-rolls a substring loop in `contains` (`findings_test.go:104-111`, wrapped by `containsFence`) instead of importing `strings`. The rest of the package uses `strings` freely; the manual loop is just more code to read with no benefit. Replace with `strings.Contains`. Minor. </details> <details><summary>⚡ Performance — No material issues found</summary> I've verified the call paths. My performance review: ## VERDICT: No material issues found I reviewed this change through the **Performance** lens and traced every call site of the new code. **What I checked:** - `extractStructuredFindings`, `proseParagraphs`, `findingsBlock`, `stripFindingsBlock` all operate on a single lens's review-markdown string (a few KB), and every call site runs **once per lens at the end of the run** — never inside a hot loop, retry, or per-source-file iteration: - `emit.go:140` — once per non-clean lens in the post-review telemetry pass. - `consolidate.go:105` — once per lens at render time. - `main.go:212` — `renderConsolidated` called once. - **No redundant work on the fallback path:** when there's no block, `findingsBlock` returns early (`findings.go:45-48`) before `proseParagraphs` runs; when the block is malformed JSON, it returns after `json.Unmarshal` (`findings.go:50-52`) — again before the prose work. So a model that ignores the contract pays only one `strings.Split`, then falls through to the pre-existing `parseFindings`. No double-scan. - **Growth is bounded:** the extraction loop caps at `maxFindingsPerLens` (20) (`findings.go:81-83`), matching the existing heuristic cap, so a pathological lens output can't blow up allocations or the `/reports` payload. - **No new blocking calls on a hot path:** the HTTP POSTs and their `findingsHTTPTimeout` are unchanged; the diff adds only in-memory string parsing before them. **Considered and dismissed (not material):** - `proseParagraphs` (`findings.go:155`) uses `strings.Count(prose[:loc[0]], "\n")` per regex match — technically O(n·m). But it's bounded by review-string length, runs once per lens, and is the *same* pattern already in `parseFindings` (`emit.go:207`), so it's not a regression. - `r.out` is line-split up to three times per lens on the structured path (`findingsBlock`, then `stripFindingsBlock` + a re-split inside `proseParagraphs`). On inputs this small and called this rarely, deduping the split would be a micro-optimization, not a real win — not worth flagging. Nothing in my lane is materially wrong here. </details> <details><summary>🧯 Error handling & edge cases — Minor issues</summary> All three findings are confirmed against the actual code: 1. **emit.go:140-143** — confirmed: `extractStructuredFindings` returns `ok=true` once the JSON parses (findings.go:85), and elements with a blank `file` are `continue`d (findings.go:57-60), so an empty-after-filter result yields `(fs=[], structured=true)`, and the `if !structured` fallback at emit.go:141 is skipped. 2. **findings.go:115-133** — confirmed: `stripFindingsBlock` only clears `skipping` on `isFenceClose`, dropping all lines to EOF on an unterminated fence, while `findingsBlock` returns `ok=false` (findings.go:109) for the same input — a real asymmetry, and the stripped output feeds the rendered comment at consolidate.go:105. 3. **findings.go:56-85** — confirmed: the structured loop has no `seen` map, unlike `parseFindings` (emit.go:192,201-205). --- VERDICT: **Minor issues** All findings below are within the error-handling / edge-cases lens on the new `findings.go` and its `emit.go` integration. This is advisory telemetry plumbing — `emit` only logs to stderr, never touches stdout, and never calls `os.Exit` (verified, emit.go:97-177), so nothing here can crash the process or affect a review/CI. Nothing is blocking, but two unhappy-path behaviors are worth fixing. - **`emit.go:140-143` — a structurally-valid-but-empty block suppresses the heuristic fallback, silently dropping findings.** `extractStructuredFindings` returns `ok=true` whenever the JSON array *parses* (findings.go:85), even if every element is then discarded. In the loop at `findings.go:56-60`, any element with a blank `file` is `continue`d. So a weak model that emits `[{"severity":"high","title":"..."}]` (valid JSON, no `file`) yields `fs=[]` **and** `structured=true` — the `if !structured` fallback to `parseFindings` (emit.go:141) never runs, even though the prose almost certainly contains scrapable `path:line` refs the heuristic would have recovered. Net effect: a non-clean lens contributes *zero* per-finding telemetry, undercutting the PR's "weak models still contribute" goal for exactly the malformed case it's meant to tolerate. Suggest: when `structured && len(fs)==0` on a non-clean verdict, fall back to `parseFindings` (the truly-empty `[]` "nothing found" case is already filtered by the `verdictClean` guard at emit.go:134, so this wouldn't reintroduce noise on clean lenses). - **`findings.go:115-133` — an unterminated `gadfly-findings` fence swallows all following content from the rendered comment.** In `stripFindingsBlock`, once `skipping` is set on the opening fence it is only cleared by an `isFenceClose`. If the model omits the closing ``` (or the block isn't last), every subsequent line is dropped from the human comment at `consolidate.go:105`. Note the asymmetry: `findingsBlock` treats the same unterminated input as *no block* (`ok=false`, findings.go:109), so the two functions disagree about where the block ends. Low probability, but a silent content-swallow on malformed input. Suggest: if no close fence is found, keep the lines (mirror `findingsBlock`'s "unterminated ⇒ no block" stance) rather than discarding to EOF. - **Minor — `findings.go:56-85` does not dedupe by `(file,line)`.** The heuristic `parseFindings` dedupes via its `seen` map (emit.go:192,201-205); the structured path does not, so a model that repeats a finding in the JSON array emits duplicate `/reports` rows. Telemetry data-quality only. Non-issues checked and cleared: no panic paths — `strconv.Atoi` on a blank/garbage/`null` `line` is error-guarded to `ln=0` (findings.go:61-63); the quoted-`"42"` case works because `json.Number` accepts a numeric string literal; `truncate`'s rune-slicing and all `prose[...]` indexing (indices come from `FindAllStringSubmatchIndex` over the same string) are in-bounds; malformed JSON correctly returns `ok=false` for clean fallback (findings.go:50-52). </details> Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 3m 55s

gitea-actions bot commented

2026-06-28 22:03:19 +00:00

🪰 Gadfly review — `deepseek-v4-pro:cloud` (ollama-cloud)

Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

Verdict: No material issues found

I traced every path where model-controlled data enters the new findings.go pipeline and where it exits. Here's what I verified:

file field (model-controlled): Used only in reportPayload.File → JSON-marshalled → POSTed to the telemetry store. Never passed to os.Open, exec.Command, filepath.Join, or any filesystem/command API. Verified by grepping all uses of .file across cmd/gadfly/ — the sole consumer is emit.go:154 setting the telemetry struct field.
detail/title/severity/confidence fields: Same path — JSON-marshalled into the telemetry POST body. severity passes through normalizeSeverity (whitelist-like mapping with unknown→lowercase fallthrough), confidence is lowercased, detail/title are truncated. No template rendering, no HTML construction, no shell interpolation.
json.Unmarshal into typed []structuredFinding: Safe. No interface{} or map[string]any deserialization that could enable type-confusion attacks. Malformed JSON → ok=false → graceful fallback to the existing heuristic scrape.
stripFindingsBlock in consolidate.go: The JSON block is stripped before the comment is rendered. Even if a model embeds something malicious in the JSON, it never reaches the PR comment. The prose outside the block is rendered as-is, which is the same trust model as before this PR (markdown rendered by the hosting platform).
findingsBlock fence parsing: The isFenceClose check (strings.TrimLeft(t, "") == ""`) correctly distinguishes a real closing fence from backticks embedded inside JSON string values. A multi-line JSON string containing a bare ``` would cause parse failure → fallback, not injection.
No new secrets/env vars/config: The change adds no new credential surface. GADFLY_FINDINGS_TOKEN and GADFLY_FINDINGS_URL are pre-existing.
recheck.go prompt injection: The recheck prompt now instructs the model to regenerate the gadfly-findings block. This is a prompt template change — the model sees it, not untrusted user input. No injection vector.

🎯 Correctness — Blocking issues found

I've confirmed the finding against the actual code. The normalizeSeverity function at lines 170-173 indeed maps "minor" to "medium" while mapping "minor-issue" to "small" — an inconsistency since both are synonyms for the same severity tier. The test at line 90 encodes the current mapping. The finding is real, but it's not a blocking issue — it's a minor inconsistency in synonym handling that only affects models that deviate from the prompt's canonical set.

Minor issues

cmd/gadfly/findings.go:170-173 — normalizeSeverity maps "minor" to "medium" but "minor-issue" to "small" (verified by reading the function). These are the same semantic concept — "minor" means "lesser in importance" and is synonymous with "small"/"low", not "medium"/"moderate". A model that writes "severity": "minor" gets its finding classified as the middle tier, while "severity": "minor-issue" lands at the second-lowest tier. This inflates severity for any model that uses the bare word "minor" (which is the more natural choice — the prompt says critical/high/medium/small/trivial, so a model reaching for a synonym is more likely to write "minor" than "minor-issue"). Fix: move "minor" into the "small" case:
```
case "small", "low", "minor", "minor-issue":
    return "small"
case "medium", "moderate":
    return "medium"
```
The test at findings_test.go:90 encodes the current (wrong) mapping "minor": "medium" and would need the corresponding update.

🧹 Code cleanliness & maintainability — Minor issues

All three findings are confirmed against the actual code. Let me verify the line numbers and details precisely:

normalizeSeverity: Lines 170-171 map "minor" → "medium", lines 172-173 map "minor-issue" → "small". Confirmed.
Hand-rolled contains: findings_test.go line 3 imports only "testing", and lines 104-111 define a manual contains loop. Confirmed.
Empty title on unparseable line: Lines 61-64 leave ln at 0 when strconv.Atoi fails; lines 65-68 look up prose["file:0"] which returns ""; lines 69-72 set title = truncate("", 120) which returns "" (verified truncate at emit.go:370-376). Confirmed.

All three findings survive verification. The verdict remains "Minor issues."

Minor issues

cmd/gadfly/findings.go:170-173 — normalizeSeverity maps "minor" → "medium" but "minor-issue" → "small". These are near-synonyms that land on different canonical severities. A model that writes "minor" gets medium; one that writes "minor-issue" gets small. This is confusing for anyone maintaining the mapping table and risks inconsistent telemetry when different models use slightly different wording. Pick one bucket (likely "small" for both, since "minor" in common usage means low severity) or at least add a comment explaining the distinction.
cmd/gadfly/findings_test.go:104-111 — hand-rolled contains reimplements strings.Contains. The test file avoids importing "strings" and instead writes a manual substring loop. In production code this would be a clear "use the stdlib" nit; in a test file it's minor but still adds ~8 lines of unnecessary logic that a future reader has to mentally verify. Just import "strings" and call strings.Contains.
cmd/gadfly/findings.go:61-71 — empty title when line fails to parse. If a model emits an unparseable line value (e.g. "line": "n/a"), ln stays 0, the prose lookup for "file:0" returns "", and if the model also omitted title, the finding gets an empty title (truncate("", 120) = ""). An empty-title finding in telemetry is confusing. Consider either skipping findings where ln == 0 after a failed parse (since the location is unusable), or falling back to a sentinel like "(unparseable line)".

⚡ Performance — No material issues found

No material issues found

The new code runs exclusively on the cold path — renderConsolidated and emit are each called exactly once at the end of a review run (main.go:212 and main.go:216). Lens outputs are typically a few KB of markdown, and the number of lenses is small (3–5). Within that context, I spot-checked for the usual suspects and found nothing material:

No N+1 / hot-loop allocations: extractStructuredFindings, stripFindingsBlock, proseParagraphs, and findingsBlock all operate on a single lens output string. The for _, r := range results loops in emit and renderConsolidated iterate over a handful of results, each capped at maxFindingsPerLens (20). No network calls, no DB queries, no channel operations inside these functions.
No unbounded growth: reports in emit grows to at most len(results) × 20 entries — trivial. renderConsolidated uses a strings.Builder — efficient.
Minor redundancies exist but are negligible at this scale: proseParagraphs calls stripFindingsBlock (which strings.Splits the output) and then strings.Splits the result again. proseParagraphs is always built even when every structured finding already carries its own detail. findingsBlock also strings.Splits the same output that stripFindingsBlock already split in renderConsolidated. On a 2–5 KB string processed once per run, none of this registers.
truncate converts to []rune (line 371–372 of emit.go), which allocates. Called at most 2× per finding (title + detail), max 20 findings per lens — again, cold path, immaterial.

No changes warranted from a performance standpoint.

🧯 Error handling & edge cases — Minor issues

Now let me verify each finding against the actual code.

Finding 1: stripFindingsBlock (lines 115-133) — I can see the state machine: when skipping is true and the loop ends without hitting isFenceClose, every line from the opening fence onward is discarded. findingsBlock (line 109) correctly returns false for unterminated blocks, but stripFindingsBlock has no equivalent guard. Confirmed.

Finding 2: Line 78-79 — severity goes through normalizeSeverity (which maps synonyms), but confidence only gets strings.ToLower(strings.TrimSpace(...)). No normalizeConfidence function exists. Confirmed.

Finding 3: Line 61-64 — when sf.Line can't parse, ln stays 0 and the finding is kept. In parseFindings (emit.go:197-199), unparseable line numbers cause continue (the finding is dropped). Confirmed — there is an inconsistency.

All three findings survive. However, none of these are "blocking" — they are edge cases (unclosed fence, model ignoring the confidence vocabulary, unparseable line numbers) that degrade gracefully in practice. The verdict should be "Minor issues".

Minor issues

stripFindingsBlock silently drops all content after an unclosed gadfly-findings fence — findings.go:115-133. If a model emits the opening fence but forgets the closing ``` (e.g., truncated output, model confusion), skipping stays true forever and every subsequent line is discarded from the rendered comment. findingsBlock (line 109) correctly returns false for unterminated blocks, but stripFindingsBlock has no equivalent guard — it just loses the prose. Since this function feeds the human-visible PR comment (consolidate.go:105), a single missing backtick trio means the reader never sees anything after the fence. Fix: treat EOF as an implicit fence close — when the loop ends with skipping == true, append the remaining lines from the fence onward (or at minimum the lines after the fence) to kept.
No confidence normalization, unlike severity — findings.go:78-79. normalizeSeverity canonicalizes severity synonyms (nit→trivial, blocker→critical, etc.), but confidence is only lowercased with strings.ToLower(strings.TrimSpace(sf.Confidence)). A model that writes "confidence": "very sure" or "confidence": "definitely" passes through as-is instead of being mapped to high/medium/low. The telemetry store gets uncontrolled vocabulary in the confidence field while severity is clean. Fix: add a normalizeConfidence function (or at minimum validate against the allowed set and default to "low" for unrecognized values).
Structured findings with unparseable line are kept at line 0; heuristic path drops them — findings.go:61-64 vs emit.go:197-199. When sf.Line can't parse as an integer, extractStructuredFindings keeps the finding with ln=0 and emits it to telemetry. But parseFindings (the heuristic fallback) does if err != nil { continue } — it drops findings with bad line numbers entirely. This inconsistency means the structured path can emit line:0 entries that the heuristic path would have filtered out. Fix: either continue (skip) when sf.Line is unparseable, matching parseFindings behavior, or require a valid line number.

_{Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5m 19s}

### 🪰 Gadfly review — `deepseek-v4-pro:cloud` (ollama-cloud) **Verdict: Blocking issues found** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary>🔒 Security — No material issues found</summary> ## Verdict: No material issues found I traced every path where model-controlled data enters the new `findings.go` pipeline and where it exits. Here's what I verified: - **`file` field (model-controlled):** Used only in `reportPayload.File` → JSON-marshalled → POSTed to the telemetry store. Never passed to `os.Open`, `exec.Command`, `filepath.Join`, or any filesystem/command API. Verified by grepping all uses of `.file` across `cmd/gadfly/` — the sole consumer is `emit.go:154` setting the telemetry struct field. - **`detail`/`title`/`severity`/`confidence` fields:** Same path — JSON-marshalled into the telemetry POST body. `severity` passes through `normalizeSeverity` (whitelist-like mapping with unknown→lowercase fallthrough), `confidence` is lowercased, `detail`/`title` are truncated. No template rendering, no HTML construction, no shell interpolation. - **`json.Unmarshal` into typed `[]structuredFinding`:** Safe. No `interface{}` or `map[string]any` deserialization that could enable type-confusion attacks. Malformed JSON → `ok=false` → graceful fallback to the existing heuristic scrape. - **`stripFindingsBlock` in `consolidate.go`:** The JSON block is stripped before the comment is rendered. Even if a model embeds something malicious in the JSON, it never reaches the PR comment. The prose outside the block is rendered as-is, which is the same trust model as before this PR (markdown rendered by the hosting platform). - **`findingsBlock` fence parsing:** The `isFenceClose` check (`strings.TrimLeft(t, "`") == ""`) correctly distinguishes a real closing fence from backticks embedded inside JSON string values. A multi-line JSON string containing a bare `` ``` `` would cause parse failure → fallback, not injection. - **No new secrets/env vars/config:** The change adds no new credential surface. `GADFLY_FINDINGS_TOKEN` and `GADFLY_FINDINGS_URL` are pre-existing. - **`recheck.go` prompt injection:** The recheck prompt now instructs the model to regenerate the `gadfly-findings` block. This is a prompt template change — the model sees it, not untrusted user input. No injection vector. </details> <details><summary>🎯 Correctness — Blocking issues found</summary> I've confirmed the finding against the actual code. The `normalizeSeverity` function at lines 170-173 indeed maps `"minor"` to `"medium"` while mapping `"minor-issue"` to `"small"` — an inconsistency since both are synonyms for the same severity tier. The test at line 90 encodes the current mapping. The finding is real, but it's not a blocking issue — it's a minor inconsistency in synonym handling that only affects models that deviate from the prompt's canonical set. **Minor issues** - **`cmd/gadfly/findings.go:170-173` — `normalizeSeverity` maps `"minor"` to `"medium"` but `"minor-issue"` to `"small"`** (verified by reading the function). These are the same semantic concept — "minor" means "lesser in importance" and is synonymous with "small"/"low", not "medium"/"moderate". A model that writes `"severity": "minor"` gets its finding classified as the middle tier, while `"severity": "minor-issue"` lands at the second-lowest tier. This inflates severity for any model that uses the bare word "minor" (which is the more natural choice — the prompt says `critical/high/medium/small/trivial`, so a model reaching for a synonym is more likely to write `"minor"` than `"minor-issue"`). Fix: move `"minor"` into the `"small"` case: ```go case "small", "low", "minor", "minor-issue": return "small" case "medium", "moderate": return "medium" ``` The test at `findings_test.go:90` encodes the current (wrong) mapping `"minor": "medium"` and would need the corresponding update. </details> <details><summary>🧹 Code cleanliness & maintainability — Minor issues</summary> All three findings are confirmed against the actual code. Let me verify the line numbers and details precisely: 1. **`normalizeSeverity`**: Lines 170-171 map `"minor"` → `"medium"`, lines 172-173 map `"minor-issue"` → `"small"`. Confirmed. 2. **Hand-rolled `contains`**: `findings_test.go` line 3 imports only `"testing"`, and lines 104-111 define a manual `contains` loop. Confirmed. 3. **Empty title on unparseable line**: Lines 61-64 leave `ln` at 0 when `strconv.Atoi` fails; lines 65-68 look up `prose["file:0"]` which returns `""`; lines 69-72 set `title = truncate("", 120)` which returns `""` (verified `truncate` at `emit.go:370-376`). Confirmed. All three findings survive verification. The verdict remains "Minor issues." --- **Minor issues** - **`cmd/gadfly/findings.go:170-173` — `normalizeSeverity` maps `"minor"` → `"medium"` but `"minor-issue"` → `"small"`.** These are near-synonyms that land on different canonical severities. A model that writes `"minor"` gets medium; one that writes `"minor-issue"` gets small. This is confusing for anyone maintaining the mapping table and risks inconsistent telemetry when different models use slightly different wording. Pick one bucket (likely `"small"` for both, since "minor" in common usage means low severity) or at least add a comment explaining the distinction. - **`cmd/gadfly/findings_test.go:104-111` — hand-rolled `contains` reimplements `strings.Contains`.** The test file avoids importing `"strings"` and instead writes a manual substring loop. In production code this would be a clear "use the stdlib" nit; in a test file it's minor but still adds ~8 lines of unnecessary logic that a future reader has to mentally verify. Just `import "strings"` and call `strings.Contains`. - **`cmd/gadfly/findings.go:61-71` — empty title when `line` fails to parse.** If a model emits an unparseable `line` value (e.g. `"line": "n/a"`), `ln` stays 0, the prose lookup for `"file:0"` returns `""`, and if the model also omitted `title`, the finding gets an empty title (`truncate("", 120)` = `""`). An empty-title finding in telemetry is confusing. Consider either skipping findings where `ln == 0` after a failed parse (since the location is unusable), or falling back to a sentinel like `"(unparseable line)"`. </details> <details><summary>⚡ Performance — No material issues found</summary> **No material issues found** The new code runs exclusively on the cold path — `renderConsolidated` and `emit` are each called exactly once at the end of a review run (`main.go:212` and `main.go:216`). Lens outputs are typically a few KB of markdown, and the number of lenses is small (3–5). Within that context, I spot-checked for the usual suspects and found nothing material: - **No N+1 / hot-loop allocations**: `extractStructuredFindings`, `stripFindingsBlock`, `proseParagraphs`, and `findingsBlock` all operate on a single lens output string. The `for _, r := range results` loops in `emit` and `renderConsolidated` iterate over a handful of results, each capped at `maxFindingsPerLens` (20). No network calls, no DB queries, no channel operations inside these functions. - **No unbounded growth**: `reports` in `emit` grows to at most `len(results) × 20` entries — trivial. `renderConsolidated` uses a `strings.Builder` — efficient. - **Minor redundancies exist but are negligible at this scale**: `proseParagraphs` calls `stripFindingsBlock` (which `strings.Split`s the output) and then `strings.Split`s the result again. `proseParagraphs` is always built even when every structured finding already carries its own `detail`. `findingsBlock` also `strings.Split`s the same output that `stripFindingsBlock` already split in `renderConsolidated`. On a 2–5 KB string processed once per run, none of this registers. - **`truncate` converts to `[]rune`** (line 371–372 of emit.go), which allocates. Called at most 2× per finding (title + detail), max 20 findings per lens — again, cold path, immaterial. No changes warranted from a performance standpoint. </details> <details><summary>🧯 Error handling & edge cases — Minor issues</summary> Now let me verify each finding against the actual code. **Finding 1**: `stripFindingsBlock` (lines 115-133) — I can see the state machine: when `skipping` is true and the loop ends without hitting `isFenceClose`, every line from the opening fence onward is discarded. `findingsBlock` (line 109) correctly returns `false` for unterminated blocks, but `stripFindingsBlock` has no equivalent guard. **Confirmed.** **Finding 2**: Line 78-79 — `severity` goes through `normalizeSeverity` (which maps synonyms), but `confidence` only gets `strings.ToLower(strings.TrimSpace(...))`. No `normalizeConfidence` function exists. **Confirmed.** **Finding 3**: Line 61-64 — when `sf.Line` can't parse, `ln` stays 0 and the finding is kept. In `parseFindings` (emit.go:197-199), unparseable line numbers cause `continue` (the finding is dropped). **Confirmed** — there is an inconsistency. All three findings survive. However, none of these are "blocking" — they are edge cases (unclosed fence, model ignoring the confidence vocabulary, unparseable line numbers) that degrade gracefully in practice. The verdict should be "Minor issues". --- **Minor issues** - **`stripFindingsBlock` silently drops all content after an unclosed `gadfly-findings` fence** — `findings.go:115-133`. If a model emits the opening fence but forgets the closing ` ``` ` (e.g., truncated output, model confusion), `skipping` stays `true` forever and every subsequent line is discarded from the rendered comment. `findingsBlock` (line 109) correctly returns `false` for unterminated blocks, but `stripFindingsBlock` has no equivalent guard — it just loses the prose. Since this function feeds the human-visible PR comment (`consolidate.go:105`), a single missing backtick trio means the reader never sees anything after the fence. **Fix:** treat EOF as an implicit fence close — when the loop ends with `skipping == true`, append the remaining lines from the fence onward (or at minimum the lines after the fence) to `kept`. - **No confidence normalization, unlike severity** — `findings.go:78-79`. `normalizeSeverity` canonicalizes severity synonyms (`nit`→`trivial`, `blocker`→`critical`, etc.), but confidence is only lowercased with `strings.ToLower(strings.TrimSpace(sf.Confidence))`. A model that writes `"confidence": "very sure"` or `"confidence": "definitely"` passes through as-is instead of being mapped to `high`/`medium`/`low`. The telemetry store gets uncontrolled vocabulary in the confidence field while severity is clean. **Fix:** add a `normalizeConfidence` function (or at minimum validate against the allowed set and default to `"low"` for unrecognized values). - **Structured findings with unparseable `line` are kept at line 0; heuristic path drops them** — `findings.go:61-64` vs `emit.go:197-199`. When `sf.Line` can't parse as an integer, `extractStructuredFindings` keeps the finding with `ln=0` and emits it to telemetry. But `parseFindings` (the heuristic fallback) does `if err != nil { continue }` — it drops findings with bad line numbers entirely. This inconsistency means the structured path can emit `line:0` entries that the heuristic path would have filtered out. **Fix:** either `continue` (skip) when `sf.Line` is unparseable, matching `parseFindings` behavior, or require a valid line number. </details> Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5m 19s

steve referenced this issue from a commit

2026-06-28 22:22:45 +00:00

fix: address swarm review of the structured-findings contract

steve added 1 commit 2026-06-28 22:22:45 +00:00

fix: address swarm review of the structured-findings contract

Build & push image / build-and-push (pull_request) Successful in 5s

Details

66682aa1d5

From PR #16's own adversarial review (graded in gadfly-reports):

- Empty/empty-result fallback: emit now scrapes the prose when the structured
  block is absent, unterminated, malformed, OR parses to zero usable findings
  (e.g. an empty [] emitted alongside real prose) — the safety net is no longer
  defeated by a contradictory empty block. (extractStructuredFindingsOrScrape)
- Unterminated fence no longer swallows the comment: extract + strip share
  findingsSpan, which only recognizes a TERMINATED block; an unterminated fence
  is left in place.
- confidence is now consumed: normalized (normalizeConfidence) and emitted to
  the telemetry store (reportPayload.confidence), not write-only.
- Canonical raw_severity everywhere: heuristic findings derive a canonical word
  from the verdict (verdict.severity) instead of the full verdict phrase.
- normalizeSeverity table made consistent ("minor"/"low" -> small).
- Exact info-string match (isFindingsOpen) instead of substring; shared
  fenceInfo predicate removes the duplicated open/close fence logic.
- Dedupe structured findings by file:line; never emit an empty title (fall back
  to detail, then the location); validate the parsed line number.
- Skip the prose scan when findings already carry detail; refresh stale
  emit.go doc comments.

Tests added for the empty-[] fallback, unterminated-fence strip, confidence
normalization, and verdict.severity. All green, gofmt clean, vet quiet.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

steve merged commit 53971603d3 into main

2026-06-28 22:23:02 +00:00

steve deleted branch feat/structured-findings

2026-06-28 22:23:02 +00:00

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: steve/gadfly#16

feat: structured findings contract (machine-readable gadfly-findings block) #16

What

Why

How

🪰 Gadfly — live review status

claude-code/opus · claude-code — ✅ done

claude-code/opus:max · claude-code — ✅ done

claude-code/sonnet · claude-code — ✅ done

deepseek-v4-pro:cloud · ollama-cloud — ✅ done

glm-5.2:cloud · ollama-cloud — ✅ done

minimax-m3:cloud · ollama-cloud — ✅ done

ragnaros/qwen3.6-27b · ragnaros — ✅ done

🪰 Gadfly review — glm-5.2:cloud (ollama-cloud)

🪰 Gadfly review — ragnaros/qwen3.6-27b (ragnaros)

🪰 Gadfly review — claude-code/opus (claude-code)

VERDICT: No material issues found

Findings

Minor / optional

VERDICT: Minor issues

🪰 Gadfly review — minimax-m3:cloud (ollama-cloud)

VERDICT: No material issues found

VERDICT: Minor issues

VERDICT: Minor issues

🪰 Gadfly review — claude-code/sonnet (claude-code)

normalizeSeverity: "minor" maps to "medium" — severity inflation

🪰 Gadfly review — claude-code/opus:max (claude-code)

VERDICT: Minor issues

VERDICT: No material issues found

🪰 Gadfly review — deepseek-v4-pro:cloud (ollama-cloud)

Verdict: No material issues found

`claude-code/opus` · claude-code — ✅ done

`claude-code/opus:max` · claude-code — ✅ done

`claude-code/sonnet` · claude-code — ✅ done

`deepseek-v4-pro:cloud` · ollama-cloud — ✅ done

`glm-5.2:cloud` · ollama-cloud — ✅ done

`minimax-m3:cloud` · ollama-cloud — ✅ done

`ragnaros/qwen3.6-27b` · ragnaros — ✅ done

🪰 Gadfly review — `glm-5.2:cloud` (ollama-cloud)

🪰 Gadfly review — `ragnaros/qwen3.6-27b` (ragnaros)

🪰 Gadfly review — `claude-code/opus` (claude-code)

🪰 Gadfly review — `minimax-m3:cloud` (ollama-cloud)

🪰 Gadfly review — `claude-code/sonnet` (claude-code)

`normalizeSeverity`: "minor" maps to "medium" — severity inflation

🪰 Gadfly review — `claude-code/opus:max` (claude-code)

🪰 Gadfly review — `deepseek-v4-pro:cloud` (ollama-cloud)