feat: structured findings contract (machine-readable gadfly-findings block) #16
Reference in New Issue
Block a user
Delete Branch "feat/structured-findings"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Phase 1 of the review-representation work: give each lens a machine-readable findings contract so downstream tooling stops scraping prose with a
path:lineregex.Each lens now appends a fenced
```gadfly-findingsJSON array after its prose review:Why
gadfly-reportsstore and grading get real signal.How
system-prompt.txt+ recheck prompt: ask for / regenerate the block (recheck lists only survivors).findings.go:extractStructuredFindings+stripFindingsBlock, severity normalization, prose-paragraph detail enrichment.emit.go: prefer the structured block; gracefully fall back to the heuristic scrape when a model doesn't emit a parseable one (stays provider-agnostic — weak models still contribute).consolidate.go: strip the block from the rendered comment (tooling-only, never shown).No new env vars / config; the displayed comment is unchanged except the JSON block is hidden. New
findings_test.gocovers extraction, empty-array, malformed-fallback, quoted line numbers, stripping, and severity normalization.Advisory invariant untouched — pure producer-side change.
🤖 Generated with Claude Code
Each lens now appends a fenced ```gadfly-findings JSON array of its findings ({file, line, severity, confidence, title}) after the prose review. The binary parses it exactly instead of scraping prose with a path:line regex, so telemetry carries PER-FINDING severity (not just the lens-level verdict) and downstream consolidation gets a clean contract to cluster on. - system-prompt.txt + recheck prompt: ask for / regenerate the block. - findings.go: extractStructuredFindings + stripFindingsBlock, with severity normalization and prose-paragraph detail enrichment. - emit.go: prefer the structured block; gracefully fall back to the heuristic scrape when a model doesn't emit a parseable one (provider-agnostic, robust). - consolidate.go: strip the block from the rendered comment (tooling-only). This is the enabling refactor for consensus consolidation + inline reviews. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>🪰 Gadfly — live review status
7/7 reviewers finished · updated 2026-06-28 22:47:42Z
claude-code/opus· claude-code — ✅ doneclaude-code/opus:max· claude-code — ✅ doneclaude-code/sonnet· claude-code — ✅ donedeepseek-v4-pro:cloud· ollama-cloud — ✅ doneglm-5.2:cloud· ollama-cloud — ✅ doneminimax-m3:cloud· ollama-cloud — ✅ doneragnaros/qwen3.6-27b· ragnaros — ✅ doneLive status board. Findings are posted in each model's own comment. Advisory only — does not block merge.
🪰 Gadfly review —
glm-5.2:cloud(ollama-cloud)Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
I verified the full data flow for the security lens. The newly parsed untrusted model fields (
file,line,severity,confidence,title,detail) are never used to open files, build URLs, or interpolate into commands — they only flow intoreportPayloadand are JSON-marshaled into an advisory telemetry POST to an operator-configuredGADFLY_FINDINGS_URL.json.Unmarshalinto a concrete struct withjson.Numberis safe (no unsafe deserialization, no float-precision pitfalls).stripFindingsBlockonly removes markdown lines; it doesn't evaluate anything.VERDICT: No material issues found
cmd/gadfly/findings.go:44-86andcmd/gadfly/emit.go:140-163: the untrusted structured fields are consumed only as map keys (prose[file+":"+...]) and asreportPayloadvalues marshaled byencoding/json(emit.go:150-162). No path/command/URL sink is reached by tainted model output.postJSON(emit.go:381-404) sets only fixed headers (Content-Type, conditionalAuthorizationfrom env, not from model output) and POSTs toGADFLY_FINDINGS_URL(operator env). No SSRF from model data.detailborrowing (findings.go:65-68) reuses the pre-existingpathLineReheuristic over already-stripped prose; it introduces no new disclosure surface beyond the existing prose-scrape path (the model could already exfiltrate PR content into prose that gets POSTed — this PR neither widens nor narrows that).normalizeSeverity(findings.go:164-179) lowercases unknown severities and passes them through; this is a grading-integrity concern (a model could self-report misleading severities), but it is an inherent advisory/telemetry trust property, not an injection or authz issue, and is outside the security lens's "real" threats.Nothing in my lane to block on.
🎯 Correctness — Minor issues
All three findings verified against the actual source. Keeping them.
VERDICT: Minor issues found
cmd/gadfly/findings.go:65-80— structured findings can be emitted with empty title and empty detail. When a model omitstitlein its JSON object, the code falls back totruncate(detail, 120)(line 70-71); whendetailis also absent it falls back to an exactprose[file+":"+ln]key lookup (line 66-67). If that key misses — e.g. JSON says"file": "model.go", "line": 184but the prose only referencesrun/executor.go:166, or the model gives a title-less/detail-less object whosefile:lineisn't cited verbatim in prose — bothtitleanddetailresolve to"". The finding is still appended (line 73) and sent to the store as areportPayloadwith emptyTitle/Detail(verified atemit.go:150-162). The heuristic path (parseFindings,deriveTitle/paragraphTitleatemit.go:224-297) generally yields a non-empty title because it scrapes the nearest heading/paragraph. Suggested fix: skip a structured finding when bothtitleanddetailare empty (it carries only a location + severity and no human signal), or fall back to a nearest-paragraph scrape rather than an exactfile:linekey lookup.cmd/gadfly/findings.go:170-172—"minor"normalizes to"medium", collapsing two distinct severity tiers. The canonical set iscritical/high/medium/small/trivial."minor"is mapped into the"medium"bucket alongside"moderate"(line 170), while"low"maps to"small"(line 172). A model writing"minor"(a low-severity nit) is silently inflated tomediumin telemetry. This is a judgment call rather than a hard bug, and the test atfindings_test.gocodifies"minor": "medium"— worth a deliberate decision (move"minor"to the"small"case, or keep with a justifying comment).cmd/gadfly/emit.go:144-149(pre-existing, not a regression) — heuristic-fallback severity uses full verdict phrases, not canonical severity words. When the structured block is absent,lensSev := r.verdict.label()(line 144) yields"Blocking issues found"/"Minor issues"/"No material issues found"/"Reviewed"(verified atconsolidate.go:18-29), soraw_severitysent to the store is a full sentence, notcritical/high/.... The old code did the samer.verdict.label(), so this is unchanged by the PR — flagged only so the claim that "telemetry now carries per-finding severity" is understood to hold only on the structured path, not the fallback path.🧹 Code cleanliness & maintainability — Minor issues
All four draft findings verified against the actual source. Confirming each:
findings.go:96&:126— Confirmed verbatim: both usestrings.HasPrefix(t, "```") && strings.Contains(t, findingsFence), a substring match that would match```gadfly-findings-extra. The predicate is duplicated. Keep.findings.go:146-159vsemit.go:194-217— Confirmed:proseParagraphsre-runs thepathLineRe.FindAllStringSubmatchIndex→strings.Count(...,"\n")→paragraphAt(lines, li)pipeline already inparseFindings, differing only in producing a dedup map vs a slice. Keep.findings_test.go:102-111— Confirmed: the file imports onlytestingand defines a hand-rolledcontains+containsFencewrapper; the rest of the suite (e.g.engine_test.go:101,recheck_test.go:55) usesstrings.Containsdirectly. Keep.findings.go:164-179— Confirmed:normalizeSeveritymaps"minor"→mediumbut"minor-issue"→small, and thedefaultbranch passes unknown tokens through lowercased (explicitly asserted by the test atfindings_test.go:93:"weird"→"weird"), so non-canonical severity words leak intoreportPayload.RawSeverity. The comment at lines 162-163 documents this deliberately, but it does leave the contract leaky relative to the canonical set. Keep as minor.VERDICT: Minor issues
cmd/gadfly/findings.go:96&:126— fragile info-string detection viastrings.Contains. BothfindingsBlockandstripFindingsBlockidentify an opening fence withstrings.HasPrefix(t, "```") && strings.Contains(t, findingsFence). This matches any fence whose info string merely contains the substringgadfly-findings— e.g.```gadfly-findings-extraor a legitimately tagged prose block whose info string happens to mention the token. The two functions also duplicate this predicate verbatim rather than sharing one helper. Cleanliness/robustness fix: extract a singleisFindingsOpen(line) bool(trim, checkHasPrefix("```")and that the info-string token equalsfindingsFence) and call it from both places; it removes the copy-paste and tightens the match.cmd/gadfly/findings.go:146-159—proseParagraphsduplicates the scan logic already inparseFindings(emit.go:194-217). The new helper re-implements thepathLineRe.FindAllStringSubmatchIndex→ line-index-via-strings.Count→paragraphAt(lines, li)pipeline thatparseFindingsalready uses, differing only in that it emits a dedup map keyed byfile:lineand operates on strip-block'd text. This is exactly the kind of cross-file duplication the surrounding code avoids (paragraphAt/truncate/maxFindingsPerLens/pathLineReare all shared). A sharedlocateProseRefs(out string) map[string]string(or havingparseFindingsconsume it) would keep the two extraction paths from drifting; right now a change to how refs are located must be made in two places.cmd/gadfly/findings_test.go:102-111— hand-rolledcontains/containsFencereinventstrings.Contains. The rest of the suite usesstrings.Containsdirectly (engine_test.go:101,recheck_test.go:55,tools_test.go:89). The new file imports onlytesting, then defines a 7-linecontainsplus acontainsFencewrapper that just callscontains(s, findingsFence). This is needless churn/inconsistency: usestrings.Contains(stripped, findingsFence)andstrings.Contains(stripped, "Unauthenticated endpoint")(adding thestringsimport) to match established test style.cmd/gadfly/findings.go:164-179—normalizeSeveritypass-through lets non-canonical words reachRawSeverity. The canonical set iscritical/high/medium/small/trivial, yet an unrecognized token (e.g."weird") is returned lowercased and lands inreportPayload.RawSeverityas a non-canonical word — explicitly asserted by the test atfindings_test.go:93("weird" → "weird"). The map also splits synonyms across bands inconsistently ("minor"→mediumbut"minor-issue"→small). The comment at lines 162-163 documents the pass-through as intentional, but it still leaves the contract leaky for downstream tooling that expects the canonical set. Either clamp unknowns to a documented default or note the deliberate escape hatch; either way the current behavior defeats normalization for off-vocabulary severities.No dead code or unused symbols found;
truncate,paragraphAt,maxFindingsPerLens, andpathLineReall resolve to existing definitions inemit.go.⚡ Performance — No material issues found
Verdict: No material issues found
Through the performance lens, I examined the new
findings.go, its integration inemit.go, and theconsolidate.gochange. Here's what I verified:proseParagraphsis called unconditionally (findings.go:54) even when every structured finding already carries adetailfield from the JSON. It does a fullstripFindingsBlock(line split + join), a fullpathLineRe.FindAllStringSubmatchIndex(prose, -1)with no match cap (unlikeparseFindingswhich breaks atmaxFindingsPerLens=20), and an O(n)strings.Count(prose[:loc[0]], "\n")per match (findings.go:155). This is the most wasteful new work, but it is bounded by lens output size (token-limited, typically a few KB) and the number ofpath:linereferences in prose (dozens, not thousands). Not material.Multiple full-output splits occur per structured lens:
findingsBlocksplits once (findings.go:92),proseParagraphs→stripFindingsBlocksplits again (findings.go:147), thenproseParagraphssplits a third time (findings.go:148). All bounded by output size; these are small strings, not a hot path.emitdoes not double-parse:extractStructuredFindingsandparseFindingsare mutually exclusive (the latter only runs on fallback,emit.go:141-143).consolidate.goadds onestripFindingsBlockpass per lens (consolidate.go:105) — a single linear scan, no regression.None of these rise to a material efficiency regression. The quadratic
strings.Count-per-match pattern is pre-existing inparseFindingsand bounded bymaxFindingsPerLens; the new occurrence inproseParagraphsis bounded by output size. The only thing I'd flag as a minor nit (not blocking):proseParagraphscould be deferred/lazy so it's skipped when all findings supply their owndetail, but the current cost is negligible for bounded lens outputs.🧯 Error handling & edge cases — Minor issues
VERDICT: Minor issues
cmd/gadfly/findings.go:115-133—stripFindingsBlocksilently drops all content after an unterminatedgadfly-findingsfence. When a model emits an opening```gadfly-findingsline but the output is truncated before the closing fence (a realistic failure when a model hits its output-token limit right at the block),skippingflips totrueand never resets. Every subsequent line — including legitimate prose findings — is discarded from the rendered comment inconsolidate.go:105. By contrast,findingsBlock(used by the telemetry path inemit.go) correctly returnsok=falseon an unterminated block, so the emit path degrades gracefully to the heuristic scrape; only the human-visible comment loses data. Verified by reading both functions:stripFindingsBlockhas no EOF-while-skipping guard, and the test suite (findings_test.go) has no test for the unterminated case. Suggested fix: when the loop ends withskipping == true, stop skipping at EOF without dropping buffered content (or re-emit the open-fence line so downstreambody == ""detection still works).**
cmd/gadfly/findings.go:96,126 — opening-fence detection usesstrings.Containsrather than exact info-string match.** A line like `` ```gadfly-findings-extra `` (info string with a suffix) would be treated as an opening fence by bothfindingsBlockandstripFindingsBlock. Low probability, but the substring match is loose; not blocking, just noting the match isContains`, not exact info-string equality.cmd/gadfly/findings.go:44-86— an empty structured array on a non-clean lens permanently suppresses the heuristic fallback.extractStructuredFindingsreturnsok=truefor[], soemit.go:141never falls back toparseFindings. If a model's prose describes real findings but its JSON block is erroneously[](model error, not code error), telemetry records zero findings for that lens with no recovery path. This is documented as intentional ("an empty array is a valid 'nothing found'"), so it's not a code bug, but it's an unrecoverable data-loss edge case with no sanity cross-check against the prose. No fix required if the contract is trusted; noting for completeness.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 33s
🪰 Gadfly review —
ragnaros/qwen3.6-27b(ragnaros)Verdict: Minor issues · ⚠️ 2/5 lens(es) errored — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
Verdict: No material issues found
Reviewed
findings.go,emit.go,consolidate.go,recheck.go, andsystem-prompt.txtthrough the security lens. Here's what I verified:JSON parsing of untrusted LLM output (
findings.go:50):json.Unmarshalinto a flat struct of simple types (string,json.Number) is safe in Go — no code execution, no deserialization gadgets.json.Numberstores the raw text, so integer overflow onlineis not possible.SSRF surface (
emit.go:102,emit.go:166):GADFLY_FINDINGS_URLis operator-controlled via environment variable, not derived from user/LLM input. Thehttp.Clienthas a 10s timeout. Standard telemetry pattern, no SSRF vector.Credential handling (
emit.go:167,emit.go:391-393):GADFLY_FINDINGS_TOKENis read from env and sent as a Bearer header. Not logged, not echoed to stdout. Correct.Path traversal: File names from LLM output are used only as JSON field values and map keys — never for file system access.
Prompt injection: The system prompt (
scripts/system-prompt.txt) and recheck prompt (recheck.go) are static templates with no interpolation of untrusted data.Findings block stripping (
findings.go:115-133): ThestripFindingsBlock/isFenceCloselogic correctly handles multi-backtick fences and avoids leaking the JSON block into rendered comments.Confidence normalization (
findings.go:79):confidenceis lowercased and trimmed but not validated against the whitelist (high/medium/low). An LLM could emit arbitrary strings. This is a data quality gap, not a security vulnerability — the downstream store simply receives whatever it gets.No injection, authz, credential leakage, SSRF, or unsafe deserialization issues identified in this change.
🎯 Correctness — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 6: context deadline exceeded
🧹 Code cleanliness & maintainability — Minor issues
Verdict: Minor issues found
cmd/gadfly/findings_test.go:104–111— Customcontainsfunction reinventsstrings.Contains. The test file defines a manual byte-level substring search when the standard library providesstrings.Contains(s, sub)directly. This adds maintenance burden (reviewers will wonder why it's custom) and hides an import. Replace withstrings.Contains(s, sub)and add"strings"to the import.cmd/gadfly/findings_test.go:102—containsFenceone-liner obscures intent.containsFence(s string) bool { return contains(s, findingsFence) }is a one-line wrapper around what is, after the above fix, juststrings.Contains. Consider inlining the call at its single usage site (TestStripFindingsBlock, line 70) rather than adding a named helper for one call.cmd/gadfly/findings.go:81–83vscmd/gadfly/emit.go:214–216—maxFindingsPerLenscap duplicated across two extraction paths. BothextractStructuredFindingsandparseFindingsindependently cap atmaxFindingsPerLenswith identicalif len(findings) >= maxFindingsPerLens { break }. This is harmless duplication (different code paths), but a shared post-processing helper likecapFindings(findings []finding) []findingwould eliminate the repetition and make the cap policy a single point of change.⚡ Performance — No material issues found
Verdict: No material issues found (Performance lens)
The change operates at a scale where performance concerns are immaterial:
extractStructuredFindings,stripFindingsBlock,proseParagraphs) runs once per lens per PR review — not in a hot loop. With ~5–10 lenses and LLM outputs of a few KB each, the total work is trivial.extractStructuredFindingssplits the output text into lines up to 3 times (findingsBlock→stripFindingsBlock→proseParagraphs). This could be refactored to pass pre-split lines, but at the scale of ~2–4 KB strings the allocation cost is negligible.strings.Count:proseParagraphs(line 155) doesstrings.Count(prose[:loc[0]], "\n")per regex match, which is O(N×M). This is capped atmaxFindingsPerLens(20) and M is the full prose text length (~KB). Pre-existing code inparseFindings(emit.go:207) uses the identical pattern, so there's no regression.proseParagraphscalled unconditionally: Even when all structured findings carry their ownDetail, the full prose scan runs. Could be lazy, but again — immaterial at this scale.No unbounded growth, no N+1 queries, no blocking calls on hot paths, and no quadratic behavior that exceeds the previous code's characteristics.
🧯 Error handling & edge cases — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 7: context deadline exceeded
Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 44m 13s
🪰 Gadfly review —
claude-code/opus(claude-code)Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
VERDICT: No material issues found
I reviewed this change strictly through the 🔒 Security lens, tracing how the newly-parsed data flows from untrusted source (LLM lens output, itself influenced by the attacker-controlled PR diff) to its sinks.
What I verified:
Sink trace — no dangerous use of parsed fields.
extractStructuredFindingsproducesfinding{file,line,title,detail,severity,confidence}. I grepped every consumer of those fields (emit.go:140,consolidate.go:105) and confirmed they reach only two places:emit.go:150-162→ areportPayloadthat is sent viapostJSONusingjson.Marshal(emit.go:382). Encoding is structured, so model-supplied strings (file,title,detail) cannot break out of the JSON body or inject headers. The Bearer token comes fromGADFLY_FINDINGS_TOKENenv (emit.go:167) — unchanged, not leaked.consolidate.go:105→stripFindingsBlockremoves the block before render, so the JSON is never emitted into the comment. No new markdown/XSS surface.os.Open/os.ReadFile/execsites (tools.go,engine.go,specialists.go,main.go) do not consume any structured-finding field — so the model-suppliedfilestring is never used as a filesystem path. No path traversal.Deserialization is safe.
json.Unmarshalinto a typed[]structuredFindingwith no customUnmarshalJSONandLine json.Number(findings.go:49-52) — standard stdlib decoding, no gadget/polymorphic risk. Malformed JSON returnsok=falseand falls back gracefully.No SSRF / new outbound targets. The POST URL is still derived solely from
GADFLY_FINDINGS_URL(emit.go:102); the structured block introduces no model-controlled URL.Insecure-default / authz posture unchanged. Telemetry stays opt-in (off unless
GADFLY_FINDINGS_URLset), advisory, and theverdictCleanguard (emit.go:134) still gates emission regardless of which extractor runs.Minor note (not a finding, flagging for completeness): unlike the heuristic path (constrained by
pathLineRe), the structuredfilefield accepts any non-empty string and is not length-capped (findings.go:57-58; onlytitle/detailare truncated).maxFindingsPerLensstill caps count at 20, and the value is JSON-encoded, so this is at most a telemetry-data-quality concern for the downstream store — not an injection or sink vulnerability in this codebase. No action needed within Gadfly's security boundary.Nothing material in my lane.
🎯 Correctness — Blocking issues found
I've verified all three findings against the actual source in
findings.go,emit.go, andconsolidate.go. All three are factually accurate against the code:normalizeSeverity(findings.go:170-172): confirmed"minor"→mediumwhile"minor-issue"→smalland"low"→small, so "minor" outranks "low" and splits on hyphenation.r.verdict.label()("Blocking issues found"/"Minor issues") intoRawSeverity, vs canonical words from the structured path.[]makesextractStructuredFindingsreturnok=true, suppressing theparseFindingsfallback, reachable only for non-clean verdicts.All three survive verification.
VERDICT: Minor issues
normalizeSeveritymaps "minor" →medium, which both over-inflates and contradicts the table's own neighbors (cmd/gadfly/findings.go:170-173). The canonical ladder iscritical > high > medium > small > trivial. "minor" is colloquially low-priority, yet it's mapped to the middle rung (medium), while the near-identical"minor-issue"maps tosmall(one rung lower) and"nit"/"low"map totrivial/small. So the same concept ("minor") lands on two different severities depending on hyphenation, and "minor" outranks "low" — internally inconsistent. Verified by reading the switch:case "medium", "moderate", "minor"vscase "small", "low", "minor-issue". This only bites when a model ignores the 5-word contract and emits "minor", but the telemetry severity will then be wrong/inconsistent. Suggested fix: map"minor"tosmall(alongside"minor-issue"), or drop it from the synonym set so it falls through as the raw word rather than being silently promoted tomedium.raw_severitynow carries two incompatible vocabularies depending on which path produced the finding (cmd/gadfly/emit.go:144-149). Structured findings emit canonical severities (critical/high/.../trivial); the heuristic fallback emitslensSev = r.verdict.label(), i.e. the literal verdict strings"Blocking issues found"/"Minor issues". So any model that doesn't emit a parseable block still writes verdict-label strings into the sameRawSeverityfield. Verified viaverdict.label()atconsolidate.go:18-29. The fallback vocabulary is pre-existing, but mixing it with the new canonical set in one field means downstream grading must special-case both. Worth at least normalizing the fallback to a canonical bucket (e.g. blocking→high/critical, minor→small) so the field has one vocabulary.An empty
gadfly-findingsarray suppresses the heuristic fallback even when the prose clearly lists findings (cmd/gadfly/emit.go:140-143,findings.go:44-85).extractStructuredFindingsreturnsok=truefor[], soparseFindingsis never consulted. This is reached only for non-clean verdicts (theverdictCleanguard atemit.go:134alreadycontinues). So a model that writes a "Blocking issues found" prose review withpath:linefindings but emits a[]block (e.g. a recheck that regenerates the block incorrectly) will contribute zero findings to telemetry while the human comment still shows the issues. This is the documented "empty array = nothing found" design, but it trusts the JSON block over the verdict; a defensive option is to fall back to the scrape when the array is empty and the verdict is non-clean. Low likelihood, real impact on the telemetry signal this PR exists to improve.🧹 Code cleanliness & maintainability — Minor issues
I've verified each finding against the actual source. Let me record my conclusions.
Findings 1–4 are confirmed by reading the real code:
proseParagraphs(findings.go:146-159) andparseFindings(emit.go:194-217) genuinely share the same regex-scan / dedupe /paragraphAtidiom.emit()(emit.go:137-143) now contradicts by preferring the structured block.containswhile importing onlytesting.Finding 5 (the key-mismatch aside) I'm dropping:
detailis explicitly an optional field, a key mismatch only leaves it empty (the same as when the model omits it), and the claim rests on speculation about model output formatting ("likely a full path", "often a bare") that I cannot positively confirm. It describes tolerated degradation, not a defect.VERDICT: Minor issues
Reviewing only through the code cleanliness & maintainability lens.
Findings
Copy-pasted scan loop:
proseParagraphsduplicates the core ofparseFindings—cmd/gadfly/findings.go:146vscmd/gadfly/emit.go:194. Both iteratepathLineRe.FindAllStringSubmatchIndex(...), build the samefile + ":" + linekey, dedupe on first occurrence, recompute the match's line index withstrings.Count(...[:loc[0]], "\n"), and callparagraphAt(lines, li). Only the per-match action differs (build afindingvs map key→paragraph). A future change to the path:line anchoring or paragraph logic must now be made in two places. Suggested fix: extract a helper that yields(file, lineStr, lineIndex)tuples and have both call sites use it.Stale file-level doc comment in
emit.go—cmd/gadfly/emit.go:13-17. The header still states findings are extracted "heuristically … This is best-effort signal … not a structured contract the reviewer guarantees." After this PR,emit()prefers the structuredgadfly-findingscontract and only falls back to the heuristic (emit.go:137-143). The package's introductory comment now contradicts the behavior. Suggested fix: update the comment to describe the structured-block-first / heuristic-fallback flow and cross-referencefindings.go.Test reimplements
strings.Contains—cmd/gadfly/findings_test.go:104-111. The hand-rolledcontains(andcontainsFencewrapper) duplicate stdlibstrings.Containspurely to avoid importingstrings. The rest of the package usesstringsfreely, so this diverges from the surrounding style for no benefit. Suggested fix:import "strings"and callstrings.Containsdirectly.Minor / optional
strings.HasPrefix(t, "```") && strings.Contains(t, findingsFence)appears in bothfindingsBlock(findings.go:96) andstripFindingsBlock(findings.go:126). Pairs naturally with the existingisFenceClosehelper; consider anisFindingsFenceOpen(line string) boolfor symmetry and single-point maintenance. Low churn, optional.⚡ Performance — No material issues found
I've verified the full call graph through the performance lens. Here's my review.
VERDICT: No material issues found
I reviewed the new structured-findings path specifically for efficiency regressions (allocations, hot loops, N+1, unbounded growth) and verified call frequency by reading
emit.go,findings.go,consolidate.go, andmain.go:212.Findings:
No hot path involved.
extractStructuredFindings(emit.go:140) andstripFindingsBlock(consolidate.go:105) each run exactly once per specialist result — andresultsislenses × models(single/low-double digits). Both run at the very end of a review, after minutes of LLM inference per lens. Nothing here is in a loop over diff hunks, files, or findings-at-scale.Telemetry stays at two POSTs total, not per-finding. Verified
emit(emit.go:166-176): findings are accumulated into onereportsslice and POSTed in a single/reportscall, plus one/runscall. The new per-finding severity work adds only field copies, no extra requests. Output is already capped atmaxFindingsPerLens = 20(emit.go:81,:214), so payload growth is bounded.The redundant work I found is immaterial.
proseParagraphs(findings.go:147) callsstripFindingsBlock(out)and thenextractStructuredFindings's caller path effectively walksouta few times (findingsBlock split, stripFindingsBlock split, a regex pass), and the loop atfindings.go:155doesstrings.Count(prose[:loc[0]], "\n")per match → O(n·m). Butnis a few-KB review string andmis bounded by the count ofpath:linereferences — this executes once at process end and is dwarfed by the surrounding network/LLM cost. Flagging it would be a micro-optimization nit, which the lens explicitly excludes.No unbounded growth.
proseParagraphs' map is keyed byfile:linewith first-wins dedup (findings.go:152); the findings slice is pre-sized (make([]finding, 0, len(raw)),findings.go:55) and capped at 20.The change is efficient relative to its actual call frequency. Nothing in my lane warrants a fix.
🧯 Error handling & edge cases — Minor issues
All findings in the draft check out against the actual code. Confirmed: the
verdictCleanskip at emit.go:134, the empty-array →structured=trueno-fallback path (extractStructuredFindings returns([]finding{}, true)for[]),strconv.Atoi("-5")succeeding intolnwith no guard, thestrings.Contains(t, findingsFence)fence match, and thatconfidence(findings.go:79) is never read byreportPayload(emit.go:83-95) or anywhere else.VERDICT: Minor issues
Reviewing only through the error handling & edge cases lens. The code is, on the whole, defensively written — graceful fallback on missing/malformed blocks, bounds-checked
paragraphAt/truncate,json.Numbertolerating quoted line numbers, unterminated-block handling. Two real edge-case gaps:emit.go:140-148— an empty[]block on a non-clean lens silently drops all telemetry, defeating the fallback safety net.r.verdictcomes fromparseVerdict(r.out)scraping the prose ("blocking issue" / "minor issue" — consolidate.go:33-53), whilefs/structuredcome from the JSON block. These are independent sources and can disagree. If a model writes "Blocking issues found" in prose (verdict = non-clean, so it passes theverdictCleanskip at line 134) but emits an emptygadfly-findingsarray, thenstructured == true,fsis empty, and theif !structuredfallback toparseFindingsis not taken. Result: a lens that self-reports issues contributes zero findings to the store — whereas before this PR,parseFindingswould have recovered them from the prosepath:linereferences. The empty-array case is explicitly treated asok=true(findings.go:36-39 +TestExtractStructuredFindingsEmptyArray), so the heuristic is suppressed exactly when the verdict insists there's something to report. Suggested fix: whenstructured && len(fs) == 0 && r.verdict != verdictClean, fall back toparseFindings(treat a non-clean verdict with an empty block as "model under-reported", not "nothing found").findings.go:61-63 / 69-72— no validation of the parsed line number; negative/zero lines pass through.strconv.Atoi("-5")succeeds, so a malformed/hallucinated"line": -5becomesln = -5and is emitted verbatim (and used as a map key for prose-detail lookup at line 67). The heuristic path can't produce this becausepathLineReonly matches\d+. Low impact today, but Phase 3 inline-PR-review anchoring will use these as coordinates, where a negative/zero line is bad. Suggested fix: clamp/guardln < 0(or drop the line, keeping the finding withline = 0as "file-level"), consistent with how the regex path can never go negative.Minor / not worth changing, noted for completeness:
findingsBlock/stripFindingsBlockmatch the fence withstrings.Contains(t, findingsFence)(findings.go:96, 126), so any fence whose info-string merely containsgadfly-findings(e.g. a model quoting the contract example from system-prompt.txt inside a ``` block) would be parsed/stripped. Unlikely in practice; flagging only because it's the kind of input a weak model could produce.Outside my lens:
confidenceis extracted and normalized (findings.go:79) butreportPayload(emit.go:83-95) has no confidence field and nothing else reads the struct field, so it's silently discarded at emit time; the per-finding confidence the PR adds never reaches the store.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 3m 18s
🪰 Gadfly review —
minimax-m3:cloud(ollama-cloud)Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
VERDICT: No material issues found
Security assessment of this PR (structured findings contract):
JSON parsing of untrusted model output (
findings.go:50,findings.go:25):json.Unmarshalinto a struct ofstring/json.Numberfields. Nointerface{}, no executable payload, no prototype-pollution equivalent in Go. A malformed block cleanly returnsok=falseand the caller falls back to the heuristic (emit.go:140-143). Safe.Path-injection / SSRF in telemetry POST:
GADFLY_FINDINGS_URLis operator-controlled (pre-existing) and is the only outbound target. The newf.file/f.title/f.detailfields are JSON-marshaled via Go's stdlib encoder intoreportPayload(emit.go:150-162), so no string ever lands raw inside HTTP headers or a URL path. No new SSRF surface introduced by this PR.Regex DoS:
pathLineRe(emit.go:47) is[A-Za-z0-9_./-]+\.[A-Za-z0-9]+:\d+— single linear pass, no nested quantifiers or alternation. No ReDoS.Resource bounds on attacker-influenced data:
maxFindingsPerLens = 20(pre-existing,emit.go:39) andtruncate(detail, 500)(findings.go:77,emit.go:336) cap what a hostile/poisoned model output can contribute to the POST payload. The map built byproseParagraphsis bounded by the regex match count on the model's output, which is already bounded by the model's own token limit. No unbounded memory growth.Information disclosure: The prose
detailfield is borrowed from the model's own markdown (findings.go:67) — it is content the model already produced and the heuristic path was already posting the same paragraph text viaparseFindings(emit.go:212). The structured path does not widen the disclosure surface; it just carriesseverity/confidenceper-finding instead of using the lens-level verdict.Authn on telemetry: Bearer token still attached via
postJSON(emit.go:391-393); unchanged. The new code does not touch credential handling.Prompt-injection resilience: A malicious PR could attempt to make the model emit a
gadfly-findingsblock referencing attacker-chosen paths.f.fileis a string that ends up only in the JSON body — never in a filesystem path, shell command, URL host, or template. No code in the new diff interprets it as anything other than data.f.lineis parsed asintand likewise only used as a number. No injection sink is reachable.stripFindingsBlockcorrectness for the rendered comment (consolidate.go:105,findings.go:115-133): The fenced block is removed before being rendered to the PR comment, so a poisoned-looking JSON block (e.g. one that looks like a fake "verdict" or tries to manipulate downstream tooling) is invisible to humans and never reaches the rendered markdown. Good.Nothing in this PR introduces an authn/authz gap, injection sink, SSRF, unsafe deserialization, secret leakage, missing input validation, or insecure default.
🎯 Correctness — Minor issues
VERDICT: Minor issues
Severity synonym table is internally inconsistent —
cmd/gadfly/findings.go:170-172."minor"maps to"medium"(line 170) but the hyphenated"minor-issue"maps to"small"(line 172). A model that emits"severity":"minor"gets"medium", while"minor-issue"gets"small"— opposite ends of the scale for the same semantic intent.TestNormalizeSeverityatfindings_test.go:90explicitly pins"minor"→"medium", so this is a deliberate-but-questionable choice that's now baked in. Impact: per-finding severity bucket is silently inconsistent.Finding with no usable title or detail is still emitted —
cmd/gadfly/findings.go:65-80. The only filter on the structured path isfile == ""(line 58). When the JSON omits bothtitleanddetailand the prose has no matchingpath:line,detailstays"", thentitle = truncate("", 120) == "", and the finding is appended at line 73 with empty title and detail. Telemetry then POSTs areportPayloadwith emptyTitleandDetail; the store will store it. Impact: phantom findings in the grading store with no human-readable anchor. Fix: drop the finding (mirror the file-empty branch at line 58) or require non-empty title before emitting.ln=0fallthrough produces a phantom key —cmd/gadfly/findings.go:61-67. Whensf.Linedoesn't parse (e.g."line":"abc"),lnstays0(line 62 silently swallows the error). The prose-lookup key at line 67 becomes"file.go:0", which never matches prose, and the finding is emitted withline=0. Impact: a finding pointing at line 0 — not a real source line — is recorded. Fix:continuewhenLinefails to parse, mirroring the file-empty branch.findingsBlock/stripFindingsBlockdo substring matching on the info-string —cmd/gadfly/findings.go:96andcmd/gadfly/findings.go:126.strings.Contains(t, findingsFence)matches any fence whose info-string merely contains"gadfly-findings", e.g.```gadfly-findings-v2or```some-gadfly-findings-thing. Wider match than the contract documents. Fix: require the info-string to equalfindingsFence(modulo leading whitespace), or at least be a token boundary.isFenceCloseaccepts 4+ backticks as a close fence even after a 3-backtick open —cmd/gadfly/findings.go:137-140.strings.TrimLeft(t, "")` strips all leading backticks, so "````" passes as a valid close. CommonMark requires the close fence to match the open fence in length, so a 4-backtick close after a 3-backtick open is not a valid close and the block should be treated as unterminated. Low-impact for real models but worth tightening if this is ever exposed to adversarial input.🧹 Code cleanliness & maintainability — Minor issues
VERDICT: Minor issues
Findings:
cmd/gadfly/findings_test.go:104-110—containsreinventsstrings.Contains. The new test file defines its own substring helper (contains) that duplicates Go stdlibstrings.Contains. The codebase already usesstrings.Containsin 20+ callsites (includingfindings.goitself at lines 96 and 126), so this is gratuitous. Suggested fix: replacecontains(s, sub)withstrings.Contains(s, sub)and add"strings"to the existingimport "testing"block. The package isn't currently imported in this test file, so the import block does need to grow — that's a one-line edit.cmd/gadfly/findings_test.go:102—containsFenceis a single-use one-liner.containsFenceis only called once (line 70, insideTestStripFindingsBlock). Even if you keepcontainsfor some reason,containsFenceadds no value beyondstrings.Contains(s, findingsFence)— which is what the call site would become after the fix above. Inlining it would remove the helper entirely.cmd/gadfly/findings.go:131—stripFindingsBlock's docstring claims "verbatim" but the function trims trailing newlines. The function comment says "Other content is preserved verbatim," butreturn strings.TrimRight(strings.Join(kept, "\n"), "\n")strips trailing newlines from the joined output unconditionally.TestStripFindingsBlockNoBlock(line 81) passes only because its input happens to have no trailing newline; for inputs that do, the output is shorter than the input would suggest. This is fine, but the comment should say so (e.g., "...preserved verbatim except trailing newlines are normalized away"). Low-priority doc nit.cmd/gadfly/findings.go:115-133—stripFindingsBlockduplicates fence-scanning logic in a structurally different style fromfindingsBlock. Verified:findingsBlock(lines 91-110) scans with two separate loops and returns the inner text;stripFindingsBlockscans with a singleskippingboolean and keeps-or-drops. Both functions answer "where are the findings-block fences?", but any tightening of the fence grammar (e.g., requiring exactly `` ```gadfly-findings ```` with no trailing whitespace, or supporting nested fences) would need to be made in two places. A shared[]fenceRangescan that returns start/end line indexes, used by both functions, would prevent drift. Low priority — not a bug today.**
cmd/gadfly/recheck.go:52-56—+ "" +concatenation in a raw string reads awkwardly.** The prompt-building lines splice literal backticks into backtick-delimited raw strings via+ "" +. This is the standard idiom for embedding a backtick in a Go raw string (raw strings can't contain backticks), so it works — but the first+ "" +at line 52 is a sharp edge for anyone unfamiliar with the trick. Adding a one-line comment at that site ("+ "" +splices a literal backtick into the raw string") would help future readers. Same pattern appears inscripts/run.sh, so this isn't a new pattern in the codebase, just a new instance.examples/reusable.yml:14-21and:53-54— documentation churn bundled with the structured-findings contract change. Verified by reading the diff: the reusable.yml comment rewrite (changing@v1pinning guidance to talk aboutact_runnerref caching) has nothing to do with the new `` ```gadfly-findings ```` contract incmd/gadfly/findings.go,cmd/gadfly/emit.go,cmd/gadfly/consolidate.go, orscripts/system-prompt.txt. Both edits are fine on their merits, but mixing them in one PR makes bisecting either change harder. Process/cleanliness, not a code defect — would suggest splitting.⚡ Performance — Minor issues
VERDICT: Minor issues
Findings (through the Performance lens):
cmd/gadfly/findings.go:155— avoidable O(N·M) per-lens scan inproseParagraphs. The lineli := strings.Count(prose[:loc[0]], "\n")is invoked inside thepathLineRematch loop. Each call re-scans from the start of the prose to the current match byte offset, so the total work is O(N · M) where N = prose bytes and M = number ofpath:linematches. A single forward pass that tracks the running newline count (advance on each match) brings this to O(N). Verified againstfindings.go:146–158.cmd/gadfly/findings.go:54,147—proseParagraphs(and thereforestripFindingsBlock+ the full regex sweep) runs unconditionally, even when every structured finding already carries adetail.extractStructuredFindingsparses the JSON first, then builds the prose map before iteratingraw. When everysf.Detailis non-empty theprose[file+":"+ln]lookup on line 67 is never hit, yetstripFindingsBlockandpathLineRe.FindAllStringSubmatchIndexover the whole output already ran. On a long adversarial-review output this is pure waste on the per-lens hot path. Lazily build the map on first miss, or skip the call whensf.Detail != ""for every element ofraw.cmd/gadfly/findings.go:73–83— work done for a finding that will be dropped at the cap. The full finding struct (includingtruncate(detail, 500)on the possibly-borrowed prose) is built before thelen(findings) >= maxFindingsPerLenscheck. Cheap individually, but combined with finding #2 it means the cap doesn't short-circuit the prose/enrichment work; reordering to check the cap first would let the hot loop exit earlier.cmd/gadfly/consolidate.go:105+cmd/gadfly/findings.go:147—stripFindingsBlock(r.out)runs twice for the same lens output.renderConsolidatedstrips for the rendered comment, and the subsequentemit→extractStructuredFindings→proseParagraphsstrips again on the same string. A single pre-computed stripped view (e.g. computed inspecialistResultonce and reused) would halve the fence-line scan. Minor in absolute terms; flagging only because it lands on the per-lens hot path and is trivially cacheable.Nothing in this review touches correctness, severity semantics, or the prose-rendering path — those are out of my lane and left to other reviewers.
🧯 Error handling & edge cases — Minor issues
VERDICT: Minor issues
Findings through the error-handling / edge-case lens (all verified by reading the diff and the surrounding code in
cmd/gadfly/):cmd/gadfly/emit.go:144— heuristic fallback writes a non-bucket severity.r.verdict.label()returns human phrases ("No material issues found","Minor issues","Blocking issues found","Reviewed") — seeconsolidate.go:18-29. These get written verbatim intoRawSeveritywheneverf.severity == "". The structured path normalizes vianormalizeSeverityinto the canonicalcritical/high/medium/small/trivialset, but the fallback path bypasses that, so a lens that omits the new block emits"Minor issues"into a field whose contract is the 5-bucket set. Suggested fix: mapverdictBlocking→"critical",verdictMinor→"small",verdictClean→"trivial"(clean lenses are skipped above so the last is academic),verdictUnknown→""(or whatever the store treats as unknown) instead of calling.label().**
cmd/gadfly/findings.go:115— unterminated\``gadfly-findingsblock silently swallows the rest of the comment.**findingsBlockcorrectly returnsok=falsefor an unterminated block, butstripFindingsBlockjust setsskipping = trueand never resets it without seeing anisFenceClose. The model is asked to terminate the block, butfindings_test.goshowsstripFindingsBlockisn't even exercised for the unterminated case. A model that opens the fence and forgets the closing fence will have its entire tail (verdict line, sign-off, anything after) silently dropped from the rendered consolidated comment. Suggested fix: in theskippingbranch, ifisFenceClose(ln)is false AND we hit EOF insideskipping, also drop the original opening fence line so the prose around it isn't accidentally kept; or capskippingto the same logicfindingsBlock` uses so an unterminated opener is left in place rather than eating the rest.cmd/gadfly/findings.go:44— silent finding loss on a structured block whose entries all lackfile. Every entry that fails thefile == ""guard iscontinue'd, but the function still returnsok=true. The caller inemit.go:141-142then never falls back to the heuristic scrape, so anypath:linefinding the model wrote in prose is lost. This contradicts the PR's "graceful fallback" framing — the fallback is only taken when the JSON is unparseable, not when every JSON entry is empty/invalid. Suggested fix: if the raw slice was non-empty but every entry was dropped, returnok=falseso the caller runsparseFindings.cmd/gadfly/findings.go:62—lineofnullor a negative number silently becomes 0.json.Number.String()fornullis the string"null",strconv.Atoi("null")errors, solnstays at the zero value0. Same for non-numeric strings. A0line is then written toreportsand looks like a real hit at line 0. Suggested fix: onerr != nil, eithercontinue(drop the entry) or return an error so the caller falls back; at minimum, rejectln <= 0before accepting the finding.Verified by reading
findings.go,emit.go,consolidate.go, andfindings_test.godirectly; no claims rely on the diff alone.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 25s
🪰 Gadfly review —
claude-code/sonnet(claude-code)Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
Verdict: No material issues found
I traced the full data flow introduced by this PR through a security lens.
Untrusted model output → telemetry POST
extractStructuredFindingsparses LLM-generated JSON and feedsfile,title,detail,severity,confidenceintoreportPayload, which is POSTed to the telemetry API (emit.go:150-162). I checked whether any of these values reach gadfly's own filesystem or a shell:tools.gogoes throughr.resolve()(sandboxed to repo root viafilepath.Clean+ prefix check). Thefinding.filefield from structured output is never passed tor.resolve()— it only populatesreportPayload.Fileand is used as a map key inproseParagraphs. No path traversal is possible from this PR's code.encoding/jsonescapes special characters; all string fields are serialized safely.titleis truncated to 120 runes,detailto 500 runes before transmission (findings.go:71,77).The
filefield already flowed to telemetry from the pre-existingparseFindingsheuristic scrape; this PR adds a more structured path for the same data without widening the attack surface meaningfully.stripFindingsBlock/ content injection into PR commentThe PR comment renders
stripFindingsBlock(r.out)(consolidate.go:105), not any parsedfindingfield — so model-controlledtitle/detailvalues never reach the rendered Gitea comment.isFenceClosedoes its ownTrimSpaceregardless of whether it receives the raw or trimmed line, so the strip logic is consistent.Prompt injection in recheck (
recheck.go:97,buildRecheckTask): raw first-pass output is embedded unsanitized into the recheck prompt. This pre-dates the PR and is not introduced or materially worsened here (thegadfly-findingsblock simply rides along in what was already an unescaped draft embedding).Nothing in my security lane introduced by this change.
🎯 Correctness — Minor issues
The finding is confirmed. The code at line 170 places
"minor"in the"medium"case, while"minor-issue"at line 172 maps to"small"— the inconsistency is real. The test at line 90 asserts"minor": "medium", locking in the current behavior.Minor issues
normalizeSeverity: "minor" maps to "medium" — severity inflationcmd/gadfly/findings.go:170"minor" is a ubiquitous code-review adjective meaning "small, non-blocking issue." A model that uses "minor" to label a nit gets that finding posted to the telemetry store as
severity=medium— one tier higher than intended.The inconsistency is visible within the same function: "minor-issue" → "small" but "minor" → "medium". The hyphenated form lands at the correct tier; the natural word lands one step up. Since the PR's stated goal is "the gadfly-reports store and grading get real signal," an incorrect normalization directly undermines that goal.
Fix: move
"minor"to the"small"case:The test at
findings_test.go:90asserts the current (wrong) mapping"minor": "medium"— that assertion should flip to"minor": "small"alongside the code fix.🧹 Code cleanliness & maintainability — Minor issues
All four findings check out against the actual files. Here is the verified review:
Minor issues
emit.go:13–17— package comment now contradicts the code. The doc block says "Findings are extracted heuristically from each lens's markdown … not a structured contract the reviewer guarantees." This PR makes the structured block the preferred path (emit.go:140–143) and the prose scrape a fallback; the heuristic-only description and the "not a structured contract" claim are both now wrong and will mislead future readers.findings.go:164–179—normalizeSeverityevaluatesstrings.ToLower(strings.TrimSpace(s))twice. Go evaluates the switch expression once, but thedefaultbranch at line 177 re-evaluates it independently withreturn strings.ToLower(strings.TrimSpace(s)). The fix is one line:lower := strings.ToLower(strings.TrimSpace(s))before the switch, andreturn lowerin the default.findings.go:96andfindings.go:126— duplicated fence-open predicate. The expressionstrings.HasPrefix(t, "```") && strings.Contains(t, findingsFence)is copy-pasted verbatim in bothfindingsBlockandstripFindingsBlock. The fence-close check is already factored intoisFenceClose; the fence-open deserves the same treatment. If the info-string ever changes it is a two-site edit.findings_test.go:102–111— hand-rolledcontainsinstead ofstrings.Contains. The test file avoids importing"strings"by defining its own byte-by-byte substring search and acontainsFencewrapper. Adding"strings"and usingstrings.Containsdirectly is idiomatic Go and removes ~8 lines of noise.⚡ Performance — No material issues found
No material issues found
The new code (
findings.go,emit.gochanges) touches only the end-of-run telemetry path (emit()is called once, after all lens results are collected, before process exit). Performance concerns here have negligible real-world impact.Verified findings:
proseParagraphsline-counting loop (findings.go:154):strings.Count(prose[:loc[0]], "\n")rescans the string from byte 0 for every regex match — technically O(N×M) where N=prose length and M=path:line match count. I confirmed this is the actual code. However,extractStructuredFindingsis invoked at most once per non-clean lens inemit()(seeemit.go:140),emit()itself is called once per run (main.go:216), and lens output has no enforced size cap but is bounded in practice by LLM output. For typical inputs (a few KB of prose, ≤20 path:line matches), this is microseconds. Not material.Repeated
strings.Splitper lens:findingsBlock,stripFindingsBlock(insideproseParagraphs), andproseParagraphsitself each callstrings.Split(out, "\n")on the same string — three allocations per lens.consolidate.goadds a fourth viastripFindingsBlock(r.out). All four happen once per lens outside any review loop. Inconsequential.proseParagraphscalled on empty arrays (findings.go:54): the map is built even whenrawis empty, so the work is discarded unused. Wasted O(prose) work per lens that emits[]. Trivial given call frequency.None of these cross the bar for a real performance issue — they are off the hot path entirely.
🧯 Error handling & edge cases — No material issues found
No material issues found
I verified the following paths carefully:
findingsBlock— unterminated fence (findings.go:105–110): returns("", false)when no close fence is found, soextractStructuredFindingsreturns(nil, false)and the caller falls back to the heuristic scrape. Correct.Empty body inside a
gadfly-findingsfence:start+1 == jproducesbody="",json.Unmarshal([]byte(""), …)errors, returns(nil, false). Falls back correctly.json.Numberwith string-encoded line (findings.go:63):json.Number's underlying kind isreflect.String, so"line":"42"(JSON string) decodes tojson.Number("42"), andsf.Line.String()→"42"→strconv.Atoi→ 42.TestExtractStructuredFindingsStringLinecovers this.json.Numberzero value / null / float:json.Number("")orjson.Number("1.5")→strconv.Atoierrors →ln=0. Prose-detail lookup forfile:0misses, detail stays"". Finding still emitted. No panic.isFenceClosecorrectness (findings.go:138–140):strings.TrimLeft(t, "``")trims all leading backticks; a line with only backticks (3+) correctly returns true; any info string after the backticks prevents a false match. Handles ≥3 backtick open/close pairs correctly.stripFindingsBlockunterminated block (findings.go:116–133): if no close fence appears,skippingstays true through end-of-input — all content after the opening fence is stripped from the rendered comment, silently. In practice the system prompt instructs "Keep the block last", so there should be no content after it. This is consistent withfindingsBlockreturning(nil, false)for the same input, so the telemetry path correctly falls back to heuristic scraping on the originalr.out. Not a real-world risk given the contract, but worth being aware of if a model truncates mid-block.paragraphAtbounds (emit.go:321): guardsli < 0 || li >= len(lines). Byte-index arithmetic inproseParagraphsis safe becausepathLineReonly matches ASCII.maxFindingsPerLenscap: applied inextractStructuredFindings(findings.go:82) as well as the existing heuristic path — no payload explosion possible.Empty title+detail degenerate finding: if both
sf.Titleandsf.Detailare empty and no prose match is found,title=""is submitted to telemetry. Not a crash; the store receives a finding with an empty title. Theif file == ""guard (findings.go:58) at least ensures location is present. Benign but worth noting as a telemetry hygiene point.No panics, no swallowed errors on unhappy paths, no missing cleanup. The graceful-degradation design (
ok=false→ heuristic scrape) is sound.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 7m 21s
🪰 Gadfly review —
claude-code/opus:max(claude-code)Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
I've traced the taint path through the new code. Here's my security review.
VERDICT: No material issues found
I reviewed this change specifically for the security-relevant question it raises: structured findings are parsed from LLM output, which is influenced by the (potentially attacker-authored) PR diff under review, and that data flows into a telemetry POST. I traced that path end-to-end.
extractStructuredFindings(cmd/gadfly/findings.go:44) populatesfile/line/title/detail/severity/confidencestraight from model-controlled JSON, and those land inreportPayloadand are sent viapostJSON(cmd/gadfly/emit.go:150,:381).postJSONusesjson.Marshal, which escapes control characters/quotes, so model-supplied strings cannot break out of the JSON body — no injection into the telemetry call. Verified by readingpostJSON.base+"/runs"/base+"/reports"wherebasecomes only fromGADFLY_FINDINGS_URL(operator env), never from the model. The model-controlledfilefield is payload data, not part of the URL or any path. Verified atemit.go:102and:169-173.GADFLY_FINDINGS_TOKEN(env) as before; nothing in this diff logs it or routes secrets into the new fields.detail/titlecan carry repo prose, but that was already true of the pre-existingparseFindingsscrape — no new leakage surface.stripFindingsBlockis strip-only. An unterminated or malicious```gadfly-findingsblock makes it drop content, never reveal more, so hiding the tooling block can't be subverted into leaking it (findings.go:115). The consolidate.go change only removes the block from the rendered comment — a mild positive.One thing I considered and rejected as a finding: unlike the regex-extracted path (constrained to
[A-Za-z0-9_./-]), the structuredfilefield is unvalidated/unbounded model text. Within this binary that's harmless (JSON-escaped on egress, count capped atmaxFindingsPerLens). It could matter only if the downstreamgadfly-reportsstore renders/usesfileunsafely — but that store is out of this repo's scope, so it's not a security defect in this change.Nothing actionable in the security lens.
🎯 Correctness — Minor issues
All four findings verify against the actual code:
raw_severity— confirmed:emit.go:146-148writesf.severity(canonicalcritical/high/medium/small/trivial) for structured findings butlensSev = r.verdict.label()for heuristic ones, andlabel()(consolidate.go:18-29) returns the full sentences"Minor issues"/"Blocking issues found". Two vocabularies into oneRawSeverityfield (emit.go:93).normalizeSeverity("minor") → "medium"— confirmed atfindings.go:170, while"low"/"minor-issue"route tosmall.confidencecollected but dropped — confirmed: set atfindings.go:79, butreportPayload(emit.go:83-95) has no confidence field; grep shows it is never read into any payload.extractStructuredFindingsreturnsok=truefor[](tested), soemit.go:141skips the prose fallback.VERDICT: Minor issues
Correctness lens. The change compiles (all referenced helpers resolve, struct fields are additive/keyed) and the advisory invariant is untouched. The issues below are semantic/data-correctness in the telemetry the PR exists to feed.
Heterogeneous
raw_severityvocabulary —emit.go:144-160+findings.go:164. The two code paths write incompatible value spaces into the sameraw_severityfield. A model that emits a structured block produces canonical severities (critical/high/medium/small/trivial); a model that doesn't falls back tolensSev = r.verdict.label(), which is"Minor issues"/"Blocking issues found". Before this PR every finding'sraw_severitywas a verdict label — one consistent vocabulary. Now the field mixes per-finding severities and verdict-label sentences depending on which model emitted the block. The PR's stated goal is "grading gets real signal" from per-finding severity; any downstream bucketing keyed on severity must now special-case both vocabularies, or it silently mis-buckets the fallback rows. Suggest mapping the verdict to a canonical severity in the fallback (e.g. blocking→high/critical, minor→small) so the column has one vocabulary.normalizeSeverity("minor") → "medium"—findings.go:170. Against the declared 5-level scale (critical > high > medium > small > trivial): bare "minor" is colloquially a low severity, and the table itself routes"minor-issue"→smalland"low"→small. Mapping bare"minor"to the middle of the scale over-escalates and is internally inconsistent. It also contradicts Gadfly's own verdict vocabulary, where"Minor issues"is the lowest non-clean verdict — so this inflates the telemetry severity of exactly the off-spec/weak models the synonym table exists to serve. Suggest"minor"→small. (Deliberate perfindings_test.go, so confirm intent — but the mapping looks wrong.)confidenceis parsed, normalized, then dropped —findings.go:79populatesfinding.confidence, butreportPayload(emit.go:83-95) has no confidence field and nothing else reads it. So the per-finding confidence the system-prompt and recheck now demand never reaches the store in this PR. The PR body frames confidence as signal for telemetry/grading; here it reaches nothing. Likely intentional staging for a later phase — worth a one-line confirmation, otherwise it's collected-but-discarded. (Borderline maintainability.)(Lower confidence) Empty structured block suppresses the heuristic —
emit.go:140-143.extractStructuredFindingsreturnsok=truefor a parseable but empty[], so the fallback scrape never runs. If a non-compliant model writes prose findings + a non-clean verdict but emits[](or an under-populated array), that lens records zero/reportsrows where the pre-PR prose scrape would have captured them — andparseVerdictstill readsBlocking/Minorfrom the prose, so/runsand/reportsdisagree. Gadfly explicitly targets weak/provider-agnostic models, so this mismatch is plausible. Consider falling back (or merging) when the block is empty but the verdict is non-clean.No blocking correctness defects: parsing, fence stripping (both fences removed; unterminated block correctly yields
ok=false),json.Numberline handling, and dedup-free structured emission are all behaviorally sound.🧹 Code cleanliness & maintainability — Minor issues
All three findings verify against the actual code:
confidencewrite-only: confirmed —findings.go:79sets it,emit.go:62declares it, butreportPayload(emit.go:83-95) has no confidence field and the grep shows no downstream reader.findings.go:96andfindings.go:126, withisFenceClosehelper at line 137.strings.Contains: confirmed atfindings_test.go:104-111.All findings survive verification.
Verified against the checked-out code. My lens is code cleanliness & maintainability only.
VERDICT: Minor issues
finding.confidenceis parsed, normalized, tested — but never consumed (write-only field).cmd/gadfly/findings.go:79populatesconfidence, andemit.go:62declares the struct field, but no consumer reads it:reportPayload(emit.go:83-95) has no confidence field, and a grep forconfidenceacross non-testcmd/gadflyshows only the prompt text, the JSON tag, and the assignment — nothing downstream. So the recheck prompt regenerates it and the parser carefully lowercases it, then it's dropped at the/reportsboundary. Either wire it intoreportPayloadin this change, or add a one-line comment that it's collected ahead of a later consumer — otherwise the next maintainer can't tell whether the omission is a bug.Duplicated open-fence predicate. The exact expression
strings.HasPrefix(t, "```") && strings.Contains(t, findingsFence)appears in bothfindingsBlock(findings.go:96) andstripFindingsBlock(findings.go:126). There's already anisFenceClosehelper (findings.go:137) for the close side; anisFindingsFenceOpen(line)sibling would remove the copy-paste and keep the two detectors from drifting. Low-churn, in keeping with the existing helper pattern.Test reinvents
strings.Contains.findings_test.gohand-rolls a substring loop incontains(findings_test.go:104-111, wrapped bycontainsFence) instead of importingstrings. The rest of the package usesstringsfreely; the manual loop is just more code to read with no benefit. Replace withstrings.Contains. Minor.⚡ Performance — No material issues found
I've verified the call paths. My performance review:
VERDICT: No material issues found
I reviewed this change through the Performance lens and traced every call site of the new code.
What I checked:
extractStructuredFindings,proseParagraphs,findingsBlock,stripFindingsBlockall operate on a single lens's review-markdown string (a few KB), and every call site runs once per lens at the end of the run — never inside a hot loop, retry, or per-source-file iteration:emit.go:140— once per non-clean lens in the post-review telemetry pass.consolidate.go:105— once per lens at render time.main.go:212—renderConsolidatedcalled once.findingsBlockreturns early (findings.go:45-48) beforeproseParagraphsruns; when the block is malformed JSON, it returns afterjson.Unmarshal(findings.go:50-52) — again before the prose work. So a model that ignores the contract pays only onestrings.Split, then falls through to the pre-existingparseFindings. No double-scan.maxFindingsPerLens(20) (findings.go:81-83), matching the existing heuristic cap, so a pathological lens output can't blow up allocations or the/reportspayload.findingsHTTPTimeoutare unchanged; the diff adds only in-memory string parsing before them.Considered and dismissed (not material):
proseParagraphs(findings.go:155) usesstrings.Count(prose[:loc[0]], "\n")per regex match — technically O(n·m). But it's bounded by review-string length, runs once per lens, and is the same pattern already inparseFindings(emit.go:207), so it's not a regression.r.outis line-split up to three times per lens on the structured path (findingsBlock, thenstripFindingsBlock+ a re-split insideproseParagraphs). On inputs this small and called this rarely, deduping the split would be a micro-optimization, not a real win — not worth flagging.Nothing in my lane is materially wrong here.
🧯 Error handling & edge cases — Minor issues
All three findings are confirmed against the actual code:
extractStructuredFindingsreturnsok=trueonce the JSON parses (findings.go:85), and elements with a blankfilearecontinued (findings.go:57-60), so an empty-after-filter result yields(fs=[], structured=true), and theif !structuredfallback at emit.go:141 is skipped.stripFindingsBlockonly clearsskippingonisFenceClose, dropping all lines to EOF on an unterminated fence, whilefindingsBlockreturnsok=false(findings.go:109) for the same input — a real asymmetry, and the stripped output feeds the rendered comment at consolidate.go:105.seenmap, unlikeparseFindings(emit.go:192,201-205).VERDICT: Minor issues
All findings below are within the error-handling / edge-cases lens on the new
findings.goand itsemit.gointegration. This is advisory telemetry plumbing —emitonly logs to stderr, never touches stdout, and never callsos.Exit(verified, emit.go:97-177), so nothing here can crash the process or affect a review/CI. Nothing is blocking, but two unhappy-path behaviors are worth fixing.emit.go:140-143— a structurally-valid-but-empty block suppresses the heuristic fallback, silently dropping findings.extractStructuredFindingsreturnsok=truewhenever the JSON array parses (findings.go:85), even if every element is then discarded. In the loop atfindings.go:56-60, any element with a blankfileiscontinued. So a weak model that emits[{"severity":"high","title":"..."}](valid JSON, nofile) yieldsfs=[]andstructured=true— theif !structuredfallback toparseFindings(emit.go:141) never runs, even though the prose almost certainly contains scrapablepath:linerefs the heuristic would have recovered. Net effect: a non-clean lens contributes zero per-finding telemetry, undercutting the PR's "weak models still contribute" goal for exactly the malformed case it's meant to tolerate. Suggest: whenstructured && len(fs)==0on a non-clean verdict, fall back toparseFindings(the truly-empty[]"nothing found" case is already filtered by theverdictCleanguard at emit.go:134, so this wouldn't reintroduce noise on clean lenses).findings.go:115-133— an unterminatedgadfly-findingsfence swallows all following content from the rendered comment. InstripFindingsBlock, onceskippingis set on the opening fence it is only cleared by anisFenceClose. If the model omits the closing ``` (or the block isn't last), every subsequent line is dropped from the human comment atconsolidate.go:105. Note the asymmetry:findingsBlocktreats the same unterminated input as no block (ok=false, findings.go:109), so the two functions disagree about where the block ends. Low probability, but a silent content-swallow on malformed input. Suggest: if no close fence is found, keep the lines (mirrorfindingsBlock's "unterminated ⇒ no block" stance) rather than discarding to EOF.Minor —
findings.go:56-85does not dedupe by(file,line). The heuristicparseFindingsdedupes via itsseenmap (emit.go:192,201-205); the structured path does not, so a model that repeats a finding in the JSON array emits duplicate/reportsrows. Telemetry data-quality only.Non-issues checked and cleared: no panic paths —
strconv.Atoion a blank/garbage/nulllineis error-guarded toln=0(findings.go:61-63); the quoted-"42"case works becausejson.Numberaccepts a numeric string literal;truncate's rune-slicing and allprose[...]indexing (indices come fromFindAllStringSubmatchIndexover the same string) are in-bounds; malformed JSON correctly returnsok=falsefor clean fallback (findings.go:50-52).Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 3m 55s
🪰 Gadfly review —
deepseek-v4-pro:cloud(ollama-cloud)Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
Verdict: No material issues found
I traced every path where model-controlled data enters the new
findings.gopipeline and where it exits. Here's what I verified:filefield (model-controlled): Used only inreportPayload.File→ JSON-marshalled → POSTed to the telemetry store. Never passed toos.Open,exec.Command,filepath.Join, or any filesystem/command API. Verified by grepping all uses of.fileacrosscmd/gadfly/— the sole consumer isemit.go:154setting the telemetry struct field.detail/title/severity/confidencefields: Same path — JSON-marshalled into the telemetry POST body.severitypasses throughnormalizeSeverity(whitelist-like mapping with unknown→lowercase fallthrough),confidenceis lowercased,detail/titleare truncated. No template rendering, no HTML construction, no shell interpolation.json.Unmarshalinto typed[]structuredFinding: Safe. Nointerface{}ormap[string]anydeserialization that could enable type-confusion attacks. Malformed JSON →ok=false→ graceful fallback to the existing heuristic scrape.stripFindingsBlockinconsolidate.go: The JSON block is stripped before the comment is rendered. Even if a model embeds something malicious in the JSON, it never reaches the PR comment. The prose outside the block is rendered as-is, which is the same trust model as before this PR (markdown rendered by the hosting platform).findingsBlockfence parsing: TheisFenceClosecheck (strings.TrimLeft(t, "") == ""`) correctly distinguishes a real closing fence from backticks embedded inside JSON string values. A multi-line JSON string containing a bare```would cause parse failure → fallback, not injection.No new secrets/env vars/config: The change adds no new credential surface.
GADFLY_FINDINGS_TOKENandGADFLY_FINDINGS_URLare pre-existing.recheck.goprompt injection: The recheck prompt now instructs the model to regenerate thegadfly-findingsblock. This is a prompt template change — the model sees it, not untrusted user input. No injection vector.🎯 Correctness — Blocking issues found
I've confirmed the finding against the actual code. The
normalizeSeverityfunction at lines 170-173 indeed maps"minor"to"medium"while mapping"minor-issue"to"small"— an inconsistency since both are synonyms for the same severity tier. The test at line 90 encodes the current mapping. The finding is real, but it's not a blocking issue — it's a minor inconsistency in synonym handling that only affects models that deviate from the prompt's canonical set.Minor issues
cmd/gadfly/findings.go:170-173—normalizeSeveritymaps"minor"to"medium"but"minor-issue"to"small"(verified by reading the function). These are the same semantic concept — "minor" means "lesser in importance" and is synonymous with "small"/"low", not "medium"/"moderate". A model that writes"severity": "minor"gets its finding classified as the middle tier, while"severity": "minor-issue"lands at the second-lowest tier. This inflates severity for any model that uses the bare word "minor" (which is the more natural choice — the prompt sayscritical/high/medium/small/trivial, so a model reaching for a synonym is more likely to write"minor"than"minor-issue"). Fix: move"minor"into the"small"case:The test at
findings_test.go:90encodes the current (wrong) mapping"minor": "medium"and would need the corresponding update.🧹 Code cleanliness & maintainability — Minor issues
All three findings are confirmed against the actual code. Let me verify the line numbers and details precisely:
normalizeSeverity: Lines 170-171 map"minor"→"medium", lines 172-173 map"minor-issue"→"small". Confirmed.Hand-rolled
contains:findings_test.goline 3 imports only"testing", and lines 104-111 define a manualcontainsloop. Confirmed.Empty title on unparseable line: Lines 61-64 leave
lnat 0 whenstrconv.Atoifails; lines 65-68 look upprose["file:0"]which returns""; lines 69-72 settitle = truncate("", 120)which returns""(verifiedtruncateatemit.go:370-376). Confirmed.All three findings survive verification. The verdict remains "Minor issues."
Minor issues
cmd/gadfly/findings.go:170-173—normalizeSeveritymaps"minor"→"medium"but"minor-issue"→"small". These are near-synonyms that land on different canonical severities. A model that writes"minor"gets medium; one that writes"minor-issue"gets small. This is confusing for anyone maintaining the mapping table and risks inconsistent telemetry when different models use slightly different wording. Pick one bucket (likely"small"for both, since "minor" in common usage means low severity) or at least add a comment explaining the distinction.cmd/gadfly/findings_test.go:104-111— hand-rolledcontainsreimplementsstrings.Contains. The test file avoids importing"strings"and instead writes a manual substring loop. In production code this would be a clear "use the stdlib" nit; in a test file it's minor but still adds ~8 lines of unnecessary logic that a future reader has to mentally verify. Justimport "strings"and callstrings.Contains.cmd/gadfly/findings.go:61-71— empty title whenlinefails to parse. If a model emits an unparseablelinevalue (e.g."line": "n/a"),lnstays 0, the prose lookup for"file:0"returns"", and if the model also omittedtitle, the finding gets an empty title (truncate("", 120)=""). An empty-title finding in telemetry is confusing. Consider either skipping findings whereln == 0after a failed parse (since the location is unusable), or falling back to a sentinel like"(unparseable line)".⚡ Performance — No material issues found
No material issues found
The new code runs exclusively on the cold path —
renderConsolidatedandemitare each called exactly once at the end of a review run (main.go:212andmain.go:216). Lens outputs are typically a few KB of markdown, and the number of lenses is small (3–5). Within that context, I spot-checked for the usual suspects and found nothing material:No N+1 / hot-loop allocations:
extractStructuredFindings,stripFindingsBlock,proseParagraphs, andfindingsBlockall operate on a single lens output string. Thefor _, r := range resultsloops inemitandrenderConsolidatediterate over a handful of results, each capped atmaxFindingsPerLens(20). No network calls, no DB queries, no channel operations inside these functions.No unbounded growth:
reportsinemitgrows to at mostlen(results) × 20entries — trivial.renderConsolidateduses astrings.Builder— efficient.Minor redundancies exist but are negligible at this scale:
proseParagraphscallsstripFindingsBlock(whichstrings.Splits the output) and thenstrings.Splits the result again.proseParagraphsis always built even when every structured finding already carries its owndetail.findingsBlockalsostrings.Splits the same output thatstripFindingsBlockalready split inrenderConsolidated. On a 2–5 KB string processed once per run, none of this registers.truncateconverts to[]rune(line 371–372 of emit.go), which allocates. Called at most 2× per finding (title + detail), max 20 findings per lens — again, cold path, immaterial.No changes warranted from a performance standpoint.
🧯 Error handling & edge cases — Minor issues
Now let me verify each finding against the actual code.
Finding 1:
stripFindingsBlock(lines 115-133) — I can see the state machine: whenskippingis true and the loop ends without hittingisFenceClose, every line from the opening fence onward is discarded.findingsBlock(line 109) correctly returnsfalsefor unterminated blocks, butstripFindingsBlockhas no equivalent guard. Confirmed.Finding 2: Line 78-79 —
severitygoes throughnormalizeSeverity(which maps synonyms), butconfidenceonly getsstrings.ToLower(strings.TrimSpace(...)). NonormalizeConfidencefunction exists. Confirmed.Finding 3: Line 61-64 — when
sf.Linecan't parse,lnstays 0 and the finding is kept. InparseFindings(emit.go:197-199), unparseable line numbers causecontinue(the finding is dropped). Confirmed — there is an inconsistency.All three findings survive. However, none of these are "blocking" — they are edge cases (unclosed fence, model ignoring the confidence vocabulary, unparseable line numbers) that degrade gracefully in practice. The verdict should be "Minor issues".
Minor issues
stripFindingsBlocksilently drops all content after an unclosedgadfly-findingsfence —findings.go:115-133. If a model emits the opening fence but forgets the closing```(e.g., truncated output, model confusion),skippingstaystrueforever and every subsequent line is discarded from the rendered comment.findingsBlock(line 109) correctly returnsfalsefor unterminated blocks, butstripFindingsBlockhas no equivalent guard — it just loses the prose. Since this function feeds the human-visible PR comment (consolidate.go:105), a single missing backtick trio means the reader never sees anything after the fence. Fix: treat EOF as an implicit fence close — when the loop ends withskipping == true, append the remaining lines from the fence onward (or at minimum the lines after the fence) tokept.No confidence normalization, unlike severity —
findings.go:78-79.normalizeSeveritycanonicalizes severity synonyms (nit→trivial,blocker→critical, etc.), but confidence is only lowercased withstrings.ToLower(strings.TrimSpace(sf.Confidence)). A model that writes"confidence": "very sure"or"confidence": "definitely"passes through as-is instead of being mapped tohigh/medium/low. The telemetry store gets uncontrolled vocabulary in the confidence field while severity is clean. Fix: add anormalizeConfidencefunction (or at minimum validate against the allowed set and default to"low"for unrecognized values).Structured findings with unparseable
lineare kept at line 0; heuristic path drops them —findings.go:61-64vsemit.go:197-199. Whensf.Linecan't parse as an integer,extractStructuredFindingskeeps the finding withln=0and emits it to telemetry. ButparseFindings(the heuristic fallback) doesif err != nil { continue }— it drops findings with bad line numbers entirely. This inconsistency means the structured path can emitline:0entries that the heuristic path would have filtered out. Fix: eithercontinue(skip) whensf.Lineis unparseable, matchingparseFindingsbehavior, or require a valid line number.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5m 19s