P5: light-tier canary — gadfly-shaped reviewer on executus core #7

2026-06-27T04:22:40Z

steve commented

2026-06-27 04:22:40 +00:00

Phase 5 — the light-tier canary. examples/reviewer is a gadfly-shaped adversarial PR reviewer built on the executus core only (no batteries, no DB, no host), proving the core is sufficient for a static-binary light host.

What it exercises (all core)

Concern	core piece
env-driven model fleet + tier overrides	`config.Env` + `model.Configure`
tier resolution + failover	`model.ParseModelForContext`
N models × M lenses swarm, per-provider caps	`fanout.Run` (`PerKey`)
structured findings per cell	`model.GenerateWith[T]`
one verdict-led report section per model	`Consolidate` (local)

Verification

Hermetic test runs the full 2×3 swarm against majordomo's fake provider and asserts the consolidated verdicts (no network).
New CI step: go list -deps ./examples/reviewer must pull in zero batteries — the light-tier invariant (gadfly's go.sum stays free of gorm/redis/discordgo/sqlite). Verified locally: 0.

Scope note

This is the canary — it proves the architecture. Migrating the live gadfly repo onto executus core is a deliberate follow-up (kept separate so the active reviewer that's reviewing these very PRs isn't destabilized mid-extraction).

Stacks on the now-merged P0–P4. Next: P6 (mort integration).

🤖 Generated with Claude Code

Phase 5 — the **light-tier canary**. `examples/reviewer` is a gadfly-shaped adversarial PR reviewer built on the executus **core only** (no batteries, no DB, no host), proving the core is sufficient for a static-binary light host. ## What it exercises (all core) | Concern | core piece | |---|---| | env-driven model fleet + tier overrides | `config.Env` + `model.Configure` | | tier resolution + failover | `model.ParseModelForContext` | | N models × M lenses swarm, per-provider caps | `fanout.Run` (`PerKey`) | | structured findings per cell | `model.GenerateWith[T]` | | one verdict-led report section per model | `Consolidate` (local) | ## Verification - **Hermetic test** runs the full 2×3 swarm against majordomo's fake provider and asserts the consolidated verdicts (no network). - **New CI step**: `go list -deps ./examples/reviewer` must pull in **zero batteries** — the light-tier invariant (gadfly's `go.sum` stays free of gorm/redis/discordgo/sqlite). Verified locally: 0. ## Scope note This is the **canary** — it proves the architecture. Migrating the *live* gadfly repo onto executus core is a deliberate follow-up (kept separate so the active reviewer that's reviewing these very PRs isn't destabilized mid-extraction). Stacks on the now-merged P0–P4. Next: P6 (mort integration). 🤖 Generated with [Claude Code](https://claude.com/claude-code)

steve added 1 commit 2026-06-27 04:22:40 +00:00

P5: light-tier canary — gadfly-shaped reviewer on executus core

executus CI / test (pull_request) Failing after 1m5s

Details

Adversarial Review (Gadfly) / review (pull_request) Successful in 8m18s

Details

ea9475da54

examples/reviewer proves the core is sufficient for a static-binary light host
(gadfly's shape) with NO batteries:
- config.Env + model.Configure  -> env-driven model fleet + tier overrides
- model.ParseModelForContext    -> tier resolution + failover
- fanout.Run (PerKey caps)      -> N models x M lenses swarm, per-provider bound
- model.GenerateWith[T]         -> structured findings per (model, lens) cell
- Consolidate                   -> one verdict-led report section per model

Hermetic test runs the full 2x3 swarm against majordomo's fake provider and
asserts the consolidated verdicts. A go list -deps CI check asserts the canary
imports ZERO batteries (the light-tier invariant) — gadfly's go.sum stays free
of gorm/redis/discordgo/sqlite. README + docs updated.

This is the canary; migrating the LIVE gadfly repo onto executus core is a
follow-up (kept separate to not destabilize the active reviewer).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gitea-actions bot commented

2026-06-27 04:22:46 +00:00

🪰 Gadfly review — `minimax-m3:cloud` (ollama-cloud)

Verdict: Minor issues — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

Only in README/comments — not in any actual code path. No secrets logged.

Now I'll finalize.

VERDICT: No material issues found

Through the security lens, the canary is clean. The untrusted-by-design surfaces (the diff, env-driven model specs) flow into places that don't create security issues:

examples/reviewer/main.go:46 — io.ReadAll(os.Stdin) and the -diff flag read attacker-controlled text. Verified the diff is placed in the user message (reviewer.go:113, llm.RoleUser), not the system prompt. The system prompt is a const baseSystemPrompt plus a fixed lens focus string (reviewer.go:90,111) — both fully controlled by the program. This is the correct placement for untrusted content against prompt injection; an attacker can't escape into system instructions.
examples/reviewer/main.go:57 — model.ParseModelForContext(ctx, spec) resolves env-configured model specs through the standard executus core path (model/llms.go:327). No shell-out, no URL construction from the spec. Verified by reading the resolver.
examples/reviewer/main.go:60 — the only error-path print is the resolved spec, not the API key; verified no ANTHROPIC_API_KEY / token / secret reference in any code path of the canary (only in README examples).
No SQL, command, template, or path injection surface — no database/sql, os/exec, text/template, or filesystem ops from untrusted input.
No SSRF — no URL is constructed from user input; outbound calls are only to provider SDKs selected by env-driven specs.
No unsafe deserialization — the only untrusted decode is the typed GenerateWith[lensReport] path (majordomo's structured output, not arbitrary JSON into interface{}).
The hermetic test (reviewer_test.go:8) uses majordomo/provider/fake — no network in tests, no live secrets.
The CI gate (.gitea/workflows/ci.yml:107-116) explicitly forbids the canary from pulling in batteries/DB drivers — a positive security/footprint invariant.

Verified by reading the three new Go files, model/call.go (the GenerateWith path), model/llms.go:327 (the resolver), and grepping the canary for any reference to credentials or shelling out.

Outside my lens: nothing to add.

🎯 Correctness — Minor issues

VERDICT: Minor issues

examples/reviewer/reviewer.go:171-176 — verdict-threshold bug. When every cell for a model errored, the inner loop only bumps errored and continues, so worst stays at -1 and errored > 0. The verdict branch falls through to "no issues found" and only the trailing (⚠ N lens(es) errored) suffix is added; the headline ## m — no issues found is silently misleading for an all-errored model. Confirmed: TestConsolidateVerdicts only exercises critical+errored, which already takes the SevHigh branch, so the all-errored regression is uncovered. Fix by special-casing errored > 0 && worst < 0 (e.g. "review failed") or suppressing the verdict when every cell errored.
examples/reviewer/main.go:81-101 — perKeyCaps collapses tier-only specs onto one key. providerOf returns "tier" for any bare tier name (line 88), and the defaults fast and thinking are bare names that both resolve to anthropic/*. They end up sharing the "tier" bucket in the PerKey map, so fanout.PerKey throttles two distinct (resolved-provider-distinct) models together — weakening the documented "per-provider cap" guarantee. Confirmed against main.go and the default tier table at lines 33-36. Minor because the README only promises per-provider caps when specs are explicit, but worth noting for tier-only fleets.
examples/reviewer/reviewer.go:40-53 — SevTrivial maps to rank 0 (the default arm), so a model whose only findings are trivial will still bump worst to 0 and take the worst >= 0 → "minor issues" branch at line 175. The reviewer correctly notes this matches the rank table and is not a defect; it is just a threshold edge (a trivial-only model reports "minor issues") — kept for completeness.
examples/reviewer/reviewer.go:92-95 (doc comment on Review) — the comment claims "fanout captures per-item errors", but the closure at lines 108-118 always returns (lr, nil) and embeds the real error in LensResult.Err. The swarm's per-item error capture (results[i].Err = err) is never exercised here; per-item errors are captured by the closure into the value, not by fanout itself. Confirmed: LensResult.Err is the sole carrier, downstream is fine, but the comment is misleading and should be reworded (e.g. "the closure embeds per-cell errors in LensResult.Err").

🧯 Error handling & edge cases — Minor issues

VERDICT: Minor issues

Findings (lens: error handling & edge cases):

examples/reviewer/main.go:46 — swallowed stdin read error. b, _ := io.ReadAll(os.Stdin) discards the error. On EIO / truncated pipe diff silently becomes empty, the next branch then prints "no diff (pass -diff or pipe one on stdin)" and exits 2, masking the real I/O error. Surface it: if err != nil { fmt.Fprintln(os.Stderr, "reviewer: read stdin:", err); os.Exit(2) }.
examples/reviewer/main.go:45-48 — TTY stdin hangs the canary. When -diff is empty and os.Stdin is a TTY (no pipe), io.ReadAll(os.Stdin) blocks forever waiting for EOF; there is no os.Stdin.Stat() check or deadline. Add a TTY/non-pipe guard before reading, e.g. if fi, _ := os.Stdin.Stat(); fi.Mode()&os.ModeCharDevice != 0 { ... } followed by the existing "no diff" exit.
examples/reviewer/reviewer.go:119-126 (paired with Consolidate at 142-145) — swarm-level cancellation is silently dropped. When a cell is skipped because of ctx.Err(), Review emits LensResult{Err: r.Err} with empty Model/Lens; Consolidate then filters every r.Model == "" result out of the report. Net effect: a mid-swarm cancel leaves no signal in the output that cells were skipped. Consolidate should surface a "swarm aborted" header (or summary line) for the dropped cells.
examples/reviewer/reviewer.go:113 — no per-cell timeout; cell ctx is context.Background(). main.go:54 passes context.Background() into Review, so a hung model call (network black hole, provider stall) holds a goroutine + a fanout slot indefinitely with no deadline. Wire a per-cell context.WithTimeout derived from the configured MaxConcurrent / -timeout, or document that callers must supply a deadline.
examples/reviewer/reviewer.go:40-53 + 165-167 — unvalidated Severity collapses to "trivial". The JSON-schema enum is advisory; a non-conforming model reply or typo'd fake-provider response yields an unknown Severity, severityRank returns 0 for the default branch, and the finding is then either ranked as the lowest severity or excluded from the "blocking" threshold check. Either reject unknown severities at the Consolidate boundary (log + drop) or bucket them as a distinct rank/verdict.
examples/reviewer/reviewer.go:157-180 — all-lenses-errored model is still labelled "no issues found". When every lens for a model errors, worst stays at -1, all stays empty, and the verdict stays "no issues found"; the (⚠ N lens(es) errored) suffix is appended but the leading verdict is misleading. When len(rs) == errored (i.e. zero successful cells for a model), downgrade to a distinct verdict such as "incomplete" so callers aren't told a model is clean when it produced no data.
examples/reviewer/reviewer_test.go:43-44 — swarm-error path is unverified. TestReviewSwarm only exercises the all-success case; the r.Err != nil cell branch from Review (lines 121-123) and the Consolidate errored-header path are only probed by TestConsolidateVerdicts with hand-built LensResult{Err: …} values, which bypasses fanout.Run's actual ctx-cancel injection. Not a blocker, but the seam this lens cares about is uncovered.

Outside my lens: nothing material.

Verified by reading: examples/reviewer/{main.go,reviewer.go,reviewer_test.go}, fanout/fanout.go, config/config.go:72-95, model/llms.go:324-334, model/call.go:101, model/tiers.go:128-134. Signatures match call sites.

_{Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 1m 57s}

### 🪰 Gadfly review — `minimax-m3:cloud` (ollama-cloud) **Verdict: Minor issues** — 3 reviewers: security, correctness, error-handling <details><summary>🔒 Security — No material issues found</summary> Only in README/comments — not in any actual code path. No secrets logged. Now I'll finalize. ## VERDICT: No material issues found Through the security lens, the canary is clean. The untrusted-by-design surfaces (the diff, env-driven model specs) flow into places that don't create security issues: - `examples/reviewer/main.go:46` — `io.ReadAll(os.Stdin)` and the `-diff` flag read attacker-controlled text. Verified the diff is placed in the **user message** (`reviewer.go:113`, `llm.RoleUser`), not the system prompt. The system prompt is a `const baseSystemPrompt` plus a fixed lens focus string (`reviewer.go:90,111`) — both fully controlled by the program. This is the correct placement for untrusted content against prompt injection; an attacker can't escape into system instructions. - `examples/reviewer/main.go:57` — `model.ParseModelForContext(ctx, spec)` resolves env-configured model specs through the standard executus core path (`model/llms.go:327`). No shell-out, no URL construction from the spec. Verified by reading the resolver. - `examples/reviewer/main.go:60` — the only error-path print is the resolved spec, not the API key; verified no `ANTHROPIC_API_KEY` / token / secret reference in any code path of the canary (only in README examples). - No SQL, command, template, or path injection surface — no `database/sql`, `os/exec`, `text/template`, or filesystem ops from untrusted input. - No SSRF — no URL is constructed from user input; outbound calls are only to provider SDKs selected by env-driven specs. - No unsafe deserialization — the only untrusted decode is the typed `GenerateWith[lensReport]` path (majordomo's structured output, not arbitrary JSON into `interface{}`). - The hermetic test (`reviewer_test.go:8`) uses `majordomo/provider/fake` — no network in tests, no live secrets. - The CI gate (`.gitea/workflows/ci.yml:107-116`) explicitly forbids the canary from pulling in batteries/DB drivers — a positive security/footprint invariant. Verified by reading the three new Go files, `model/call.go` (the `GenerateWith` path), `model/llms.go:327` (the resolver), and grepping the canary for any reference to credentials or shelling out. Outside my lens: nothing to add. </details> <details><summary>🎯 Correctness — Minor issues</summary> VERDICT: **Minor issues** - `examples/reviewer/reviewer.go:171-176` — verdict-threshold bug. When every cell for a model errored, the inner loop only bumps `errored` and `continue`s, so `worst` stays at `-1` and `errored > 0`. The verdict branch falls through to `"no issues found"` and only the trailing `(⚠ N lens(es) errored)` suffix is added; the headline `## m — no issues found` is silently misleading for an all-errored model. Confirmed: `TestConsolidateVerdicts` only exercises critical+errored, which already takes the `SevHigh` branch, so the all-errored regression is uncovered. Fix by special-casing `errored > 0 && worst < 0` (e.g. `"review failed"`) or suppressing the verdict when every cell errored. - `examples/reviewer/main.go:81-101` — `perKeyCaps` collapses tier-only specs onto one key. `providerOf` returns `"tier"` for any bare tier name (line 88), and the defaults `fast` and `thinking` are bare names that both resolve to `anthropic/*`. They end up sharing the `"tier"` bucket in the `PerKey` map, so `fanout.PerKey` throttles two distinct (resolved-provider-distinct) models together — weakening the documented "per-provider cap" guarantee. Confirmed against `main.go` and the default tier table at lines 33-36. Minor because the README only promises per-provider caps when specs are explicit, but worth noting for tier-only fleets. - `examples/reviewer/reviewer.go:40-53` — `SevTrivial` maps to rank `0` (the `default` arm), so a model whose only findings are trivial will still bump `worst` to `0` and take the `worst >= 0 → "minor issues"` branch at line 175. The reviewer correctly notes this matches the rank table and is not a defect; it is just a threshold edge (a trivial-only model reports "minor issues") — kept for completeness. - `examples/reviewer/reviewer.go:92-95` (doc comment on `Review`) — the comment claims "fanout captures per-item errors", but the closure at lines 108-118 always returns `(lr, nil)` and embeds the real error in `LensResult.Err`. The swarm's per-item error capture (`results[i].Err = err`) is never exercised here; per-item errors are captured by *the closure* into the value, not by fanout itself. Confirmed: `LensResult.Err` is the sole carrier, downstream is fine, but the comment is misleading and should be reworded (e.g. "the closure embeds per-cell errors in `LensResult.Err`"). </details> <details><summary>🧯 Error handling & edge cases — Minor issues</summary> ## VERDICT: Minor issues Findings (lens: error handling & edge cases): - **`examples/reviewer/main.go:46` — swallowed stdin read error.** `b, _ := io.ReadAll(os.Stdin)` discards the error. On `EIO` / truncated pipe `diff` silently becomes empty, the next branch then prints "no diff (pass -diff or pipe one on stdin)" and exits 2, masking the real I/O error. Surface it: `if err != nil { fmt.Fprintln(os.Stderr, "reviewer: read stdin:", err); os.Exit(2) }`. - **`examples/reviewer/main.go:45-48` — TTY stdin hangs the canary.** When `-diff` is empty and `os.Stdin` is a TTY (no pipe), `io.ReadAll(os.Stdin)` blocks forever waiting for EOF; there is no `os.Stdin.Stat()` check or deadline. Add a TTY/non-pipe guard before reading, e.g. `if fi, _ := os.Stdin.Stat(); fi.Mode()&os.ModeCharDevice != 0 { ... }` followed by the existing "no diff" exit. - **`examples/reviewer/reviewer.go:119-126` (paired with `Consolidate` at 142-145) — swarm-level cancellation is silently dropped.** When a cell is skipped because of `ctx.Err()`, `Review` emits `LensResult{Err: r.Err}` with empty `Model`/`Lens`; `Consolidate` then filters every `r.Model == ""` result out of the report. Net effect: a mid-swarm cancel leaves no signal in the output that cells were skipped. `Consolidate` should surface a "swarm aborted" header (or summary line) for the dropped cells. - **`examples/reviewer/reviewer.go:113` — no per-cell timeout; cell ctx is `context.Background()`.** `main.go:54` passes `context.Background()` into `Review`, so a hung model call (network black hole, provider stall) holds a goroutine + a fanout slot indefinitely with no deadline. Wire a per-cell `context.WithTimeout` derived from the configured `MaxConcurrent` / `-timeout`, or document that callers must supply a deadline. - **`examples/reviewer/reviewer.go:40-53` + `165-167` — unvalidated `Severity` collapses to "trivial".** The JSON-schema `enum` is advisory; a non-conforming model reply or typo'd fake-provider response yields an unknown `Severity`, `severityRank` returns 0 for the `default` branch, and the finding is then either ranked as the lowest severity or excluded from the "blocking" threshold check. Either reject unknown severities at the `Consolidate` boundary (log + drop) or bucket them as a distinct rank/verdict. - **`examples/reviewer/reviewer.go:157-180` — all-lenses-errored model is still labelled "no issues found".** When every lens for a model errors, `worst` stays at `-1`, `all` stays empty, and the verdict stays `"no issues found"`; the `(⚠ N lens(es) errored)` suffix is appended but the leading verdict is misleading. When `len(rs) == errored` (i.e. zero successful cells for a model), downgrade to a distinct verdict such as "incomplete" so callers aren't told a model is clean when it produced no data. - **`examples/reviewer/reviewer_test.go:43-44` — swarm-error path is unverified.** `TestReviewSwarm` only exercises the all-success case; the `r.Err != nil` cell branch from `Review` (lines 121-123) and the `Consolidate` errored-header path are only probed by `TestConsolidateVerdicts` with hand-built `LensResult{Err: …}` values, which bypasses `fanout.Run`'s actual ctx-cancel injection. Not a blocker, but the seam this lens cares about is uncovered. Outside my lens: nothing material. Verified by reading: `examples/reviewer/{main.go,reviewer.go,reviewer_test.go}`, `fanout/fanout.go`, `config/config.go:72-95`, `model/llms.go:324-334`, `model/call.go:101`, `model/tiers.go:128-134`. Signatures match call sites. </details> Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 1m 57s

gitea-actions bot commented

2026-06-27 04:24:44 +00:00

🪰 Gadfly review — `deepseek-v4-flash:cloud` (ollama-cloud)

Verdict: Minor issues — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

🔒 Security — Verdict: No material issues found

After tracing the full data flow from input to output, I find no material security issues introduced by this PR. The change is a CLI canary tool that:

Reads a diff from stdin or a -diff flag
Sends it to an LLM (via the majordomo provider layer) for adversarial review
Prints the consolidated findings

What I checked:

Injection paths: The diff is embedded verbatim into an LLM user message (reviewer.go:113). This is the intended design — the diff is the content under review. Prompt injection via a crafted diff is a known property of LLM-based tools, not a code defect in this PR. The system prompt instructs the model to return only JSON matching the schema, which is the standard mitigation for this class of tool.
API key handling: The ANTHROPIC_API_KEY is read from the environment by the majordomo anthropic provider (model/llms.go:147-149). This is the standard, expected pattern for CLI tools. No key is hardcoded, logged, or exposed in the output.
Input validation: The diff is read with io.ReadAll(os.Stdin) (main.go:46). There is no size limit, which could allow a very large diff to consume memory — but this is a robustness/availability concern, not a security vulnerability in a CLI tool the user runs locally.
Deserialization: The LLM response is parsed as JSON into a Go struct via model.GenerateWith[lensReport] (reviewer.go:114). This uses majordomo's controlled structured-output path, not json.Unmarshal on arbitrary user data.
No SQL, no command execution, no SSRF, no authz gaps: The tool has no database, no shell execution, and makes outbound HTTP calls only to the configured LLM provider (controlled by the user's environment variables, not by the diff content).

The security properties of this tool are inherent to its design as an LLM-based code reviewer — the PR does not introduce any new or fixable security defect.

🎯 Correctness — No material issues found

Verdict: No material issues found

After tracing the full logic paths through the actual checked-out code, I find no correctness bugs in this change.

Key paths verified:

main.go:57 — ParseModelForContext returns (context.Context, majordomo.Model, error); the context is correctly discarded with _ (it's only needed for usage tracking, which the canary doesn't use).
reviewer.go:108-118 — The fanout worker always returns (lr, nil) to fanout.Run, embedding the actual model error in LensResult.Err. The post-loop at lines 120-126 correctly distinguishes swarm-level cancellation (r.Err != nil) from per-cell values (which may carry their own Err). No error is silently lost.
reviewer.go:139-191 — Consolidate verdict logic: worst starts at -1, maps to "no issues found"; ranks 0-2 → "minor issues"; ranks 3-4 → "blocking issues found". This matches the severity constants and the test expectations in reviewer_test.go:86-101.
reviewer.go:182-184 — sort.SliceStable with descending severity rank is correct; stable sort preserves input order for equal severities.
main.go:92-100 — perKeyCaps correctly returns nil when perProvider <= 0, which fanout treats as "uncapped per-key" (beyond MaxConcurrent).
main.go:84-89 — providerOf returns "tier" for bare tier names, so all bare-tier models share the same PerKey bucket — intentional and correct for the canary's concurrency model.

The model.Configure(cfg, defaults, 0) call at main.go:33 passes ttl=0; without reading the full Configure implementation I cannot confirm whether 0 means "no caching" or "use default", but for a single-shot CLI tool this has no observable effect on correctness either way.

🧯 Error handling & edge cases — Minor issues

Both findings are confirmed against the actual code. Here is the corrected review:

🧯 Error handling & edge cases — VERDICT: Minor issues

Findings

main.go:46 — io.ReadAll error silently swallowed.
The call b, _ := io.ReadAll(os.Stdin) discards the error. If reading stdin fails (broken pipe, filesystem error, etc.), the program falls through to the empty-diff check and exits with the misleading message "reviewer: no diff" instead of surfacing the actual I/O failure.
Fix: check the error and print it to stderr before exiting, e.g.:
```
b, err := io.ReadAll(os.Stdin)
if err != nil {
    fmt.Fprintf(os.Stderr, "reviewer: reading stdin: %v\n", err)
    os.Exit(2)
}
```
reviewer.go:119-127 + reviewer.go:143-145 — Swarm-level errors (context cancellation, etc.) silently dropped from the report.
When fanout.Run returns a Result with a non-nil Err (e.g. context deadline exceeded or cancellation), the post-processing loop at lines 120-123 creates a LensResult{Err: r.Err} with empty Model and Lens fields. Consolidate then skips every result whose Model == "" (line 143). The user therefore sees no indication that the swarm was interrupted or that results are incomplete — partial output from completed cells is presented as if it were the full picture.
Fix: either (a) include a catch-all section in the report for un-attributable errors, or (b) propagate the context error into the report header. At minimum, log the error to stderr so operators know the swarm didn't finish.

_{Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 39s}

### 🪰 Gadfly review — `deepseek-v4-flash:cloud` (ollama-cloud) **Verdict: Minor issues** — 3 reviewers: security, correctness, error-handling <details><summary>🔒 Security — No material issues found</summary> ## 🔒 Security — Verdict: No material issues found After tracing the full data flow from input to output, I find no material security issues introduced by this PR. The change is a CLI canary tool that: 1. Reads a diff from stdin or a `-diff` flag 2. Sends it to an LLM (via the majordomo provider layer) for adversarial review 3. Prints the consolidated findings **What I checked:** - **Injection paths**: The diff is embedded verbatim into an LLM user message (`reviewer.go:113`). This is the intended design — the diff *is* the content under review. Prompt injection via a crafted diff is a known property of LLM-based tools, not a code defect in this PR. The system prompt instructs the model to return only JSON matching the schema, which is the standard mitigation for this class of tool. - **API key handling**: The `ANTHROPIC_API_KEY` is read from the environment by the majordomo anthropic provider (`model/llms.go:147-149`). This is the standard, expected pattern for CLI tools. No key is hardcoded, logged, or exposed in the output. - **Input validation**: The diff is read with `io.ReadAll(os.Stdin)` (`main.go:46`). There is no size limit, which could allow a very large diff to consume memory — but this is a robustness/availability concern, not a security vulnerability in a CLI tool the user runs locally. - **Deserialization**: The LLM response is parsed as JSON into a Go struct via `model.GenerateWith[lensReport]` (`reviewer.go:114`). This uses majordomo's controlled structured-output path, not `json.Unmarshal` on arbitrary user data. - **No SQL, no command execution, no SSRF, no authz gaps**: The tool has no database, no shell execution, and makes outbound HTTP calls only to the configured LLM provider (controlled by the user's environment variables, not by the diff content). The security properties of this tool are inherent to its design as an LLM-based code reviewer — the PR does not introduce any new or fixable security defect. </details> <details><summary>🎯 Correctness — No material issues found</summary> ## Verdict: No material issues found After tracing the full logic paths through the actual checked-out code, I find no correctness bugs in this change. Key paths verified: - **`main.go:57`** — `ParseModelForContext` returns `(context.Context, majordomo.Model, error)`; the context is correctly discarded with `_` (it's only needed for usage tracking, which the canary doesn't use). - **`reviewer.go:108-118`** — The fanout worker always returns `(lr, nil)` to `fanout.Run`, embedding the actual model error in `LensResult.Err`. The post-loop at lines 120-126 correctly distinguishes swarm-level cancellation (`r.Err != nil`) from per-cell values (which may carry their own `Err`). No error is silently lost. - **`reviewer.go:139-191`** — `Consolidate` verdict logic: `worst` starts at -1, maps to "no issues found"; ranks 0-2 → "minor issues"; ranks 3-4 → "blocking issues found". This matches the severity constants and the test expectations in `reviewer_test.go:86-101`. - **`reviewer.go:182-184`** — `sort.SliceStable` with descending severity rank is correct; stable sort preserves input order for equal severities. - **`main.go:92-100`** — `perKeyCaps` correctly returns `nil` when `perProvider <= 0`, which fanout treats as "uncapped per-key" (beyond `MaxConcurrent`). - **`main.go:84-89`** — `providerOf` returns `"tier"` for bare tier names, so all bare-tier models share the same PerKey bucket — intentional and correct for the canary's concurrency model. The `model.Configure(cfg, defaults, 0)` call at `main.go:33` passes `ttl=0`; without reading the full `Configure` implementation I cannot confirm whether `0` means "no caching" or "use default", but for a single-shot CLI tool this has no observable effect on correctness either way. </details> <details><summary>🧯 Error handling & edge cases — Minor issues</summary> Both findings are confirmed against the actual code. Here is the corrected review: ## 🧯 Error handling & edge cases — VERDICT: **Minor issues** ### Findings - **`main.go:46` — `io.ReadAll` error silently swallowed.** The call `b, _ := io.ReadAll(os.Stdin)` discards the error. If reading stdin fails (broken pipe, filesystem error, etc.), the program falls through to the empty-diff check and exits with the misleading message `"reviewer: no diff"` instead of surfacing the actual I/O failure. *Fix:* check the error and print it to stderr before exiting, e.g.: ```go b, err := io.ReadAll(os.Stdin) if err != nil { fmt.Fprintf(os.Stderr, "reviewer: reading stdin: %v\n", err) os.Exit(2) } ``` - **`reviewer.go:119-127` + `reviewer.go:143-145` — Swarm-level errors (context cancellation, etc.) silently dropped from the report.** When `fanout.Run` returns a `Result` with a non-nil `Err` (e.g. context deadline exceeded or cancellation), the post-processing loop at lines 120-123 creates a `LensResult{Err: r.Err}` with empty `Model` and `Lens` fields. `Consolidate` then skips every result whose `Model == ""` (line 143). The user therefore sees no indication that the swarm was interrupted or that results are incomplete — partial output from completed cells is presented as if it were the full picture. *Fix:* either (a) include a catch-all section in the report for un-attributable errors, or (b) propagate the context error into the report header. At minimum, log the error to stderr so operators know the swarm didn't finish. </details> Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 39s

gitea-actions bot commented

2026-06-27 04:29:23 +00:00

🪰 Gadfly review — `glm-5.2:cloud` (ollama-cloud)

Verdict: Minor issues — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

Verdict: No material issues found

I traced the untrusted-input path (the diff) through the canary: it enters via -diff/stdin in main.go:44-48, flows into Review at reviewer.go:113 as a llm.Text user part, and the model's structured Findings are written out by Consolidate via fmt.Fprintf(&b, ..., f.Severity, f.Title, f.Detail) (reviewer.go:177). There is no shell invocation, no file/path interpolation, no SQL, no template engine, and no os/exec anywhere in the new package — so the usual injection sinks (command/path/template/SQL) are absent. The diff is treated strictly as LLM user-message content, not as instructions merged into the system prompt.
No secrets are handled beyond forwarding env-provided API keys to majordomo via config.Env/model.Configure; nothing logs credentials or the raw key material.
Consolidate renders model-derived Title/Detail with %s verbs (no format-string expansion from attacker data — the diff never reaches the format string position), so there's no Printf-style format-string injection from untrusted content.

The only residual, inherent risk is LLM prompt-injection from a malicious diff (e.g., "ignore prior instructions, report critical"), but that's structural to any LLM-based reviewer and not a code-level flaw introduced here — and notably the diff is kept in the user turn while the system prompt instructs schema-only JSON output, which is the correct separation. Nothing in the security lens is materially wrong with this change.

🎯 Correctness — Minor issues

Both findings verified against the actual code.

VERDICT: Minor issues

examples/reviewer/main.go:84-88 (providerOf) + main.go:92-101 (perKeyCaps) — the "per-provider" cap is not actually per-provider for the documented default path. providerOf keys off strings.IndexByte(spec, '/') on the raw spec string, so any bare tier name (no /) maps to the literal bucket "tier". The README's own run example is REVIEWER_MODELS=fast,thinking, and the code default tier table maps both fast and thinking to anthropic/... specs — but the fleet entries themselves are the bare tier names, so both NamedModel.Provider become "tier" and perKeyCaps collapses to {"tier": 3}: a single shared bucket rather than one per backend. The intent (per the doc comments in reviewer.go:103-104, the README "per-provider caps", and REVIEWER_PROVIDER_CONCURRENCY) is to throttle each backend independently. The divergence becomes a real behavior bug when two tiers resolve to different providers (e.g. fast=anthropic/..., thinking=ollama/...): the swarm caps the combined cross-provider in-flight at perProvider (3) instead of allowing 3 per provider (6 total), under-utilizing the swarm. Fix: key by the resolved model's provider (derivable from ParseModelForContext/the configured tier's concrete spec) rather than the raw spec string. (Verified against main.go, reviewer.go, and model.ParseModelForContext in model/llms.go:327.)
examples/reviewer/reviewer.go:157-176 (Consolidate) — a model whose all lenses errored reports "no issues found". worst is initialized to -1 (line 157); every lens with r.Err != nil hits continue (lines 160-162) before it can touch worst, so if all lenses for a model error, worst stays -1. The verdict ladder (lines 171-176) then falls through to "no issues found" (both worst >= severityRank(SevHigh) and worst >= 0 are false), with only the (⚠ N lens(es) errored) suffix appended. A fully-failed model thus reads as a clean bill of health plus a warning, rather than an explicit failure verdict. Not exercised by TestConsolidateVerdicts in reviewer_test.go:91-101, which pairs a critical finding with one errored lens (so worst is set to 4). Suggested fix: track errored (already done) and set an explicit "review failed" verdict when errored == len(rs) (or all is empty). (Verified by reading the consolidation loop.)

🧯 Error handling & edge cases — Minor issues

VERDICT: Minor issues

examples/reviewer/main.go:46 — io.ReadAll error swallowed. b, _ := io.ReadAll(os.Stdin) discards the read error. On a real read failure (closed pipe, I/O error), b may be empty/partial and the program falls through to the "no diff" exit(2) path, masking the actual cause as a user-facing "you didn't pass a diff" message. Suggested fix: capture and report the error distinctly before the empty check.
examples/reviewer/reviewer.go:119-126 + 139-145 — ctx-cancelled cells silently vanish from the report. When fanout.Run fails to start a cell because the context is cancelled (semaphore-acquire path in fanout.go:67-90), that Result carries Err=ctx.Err() and a zero-value LensResult{} (empty Model). Review converts it to LensResult{Err: r.Err} with an empty Model (line 121-122), and Consolidate skips any result with r.Model == "" (line 142-144). Net effect: a cancelled swarm produces a report that simply omits the never-started cells with no indication the run was incomplete — the user gets a normal-looking (possibly empty) report. Suggested fix: surface a top-level "swarm incomplete" note when any result has Err != nil && Model == "".
examples/reviewer/reviewer.go:171-180 — "no issues found" header for a fully-errored model is misleading. If every lens for a model errored (worst stays -1), the verdict reads "no issues found" with a trailing (⚠ N lens(es) errored). The warning is present, so this is minor; a dedicated "all lenses errored" verdict would be clearer.

No panics on nil/empty inputs were found: splitCSV/perKeyCaps/Consolidate all handle empty slices and nil maps cleanly (nil-slice ranges and append are safe), and severityRank's default arm covers unknown severities without crashing.

_{Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 1m 40s}

### 🪰 Gadfly review — `glm-5.2:cloud` (ollama-cloud) **Verdict: Minor issues** — 3 reviewers: security, correctness, error-handling <details><summary>🔒 Security — No material issues found</summary> Verdict: No material issues found - I traced the untrusted-input path (the diff) through the canary: it enters via `-diff`/stdin in `main.go:44-48`, flows into `Review` at `reviewer.go:113` as a `llm.Text` user part, and the model's structured `Finding`s are written out by `Consolidate` via `fmt.Fprintf(&b, ..., f.Severity, f.Title, f.Detail)` (`reviewer.go:177`). There is no shell invocation, no file/path interpolation, no SQL, no template engine, and no `os/exec` anywhere in the new package — so the usual injection sinks (command/path/template/SQL) are absent. The diff is treated strictly as LLM user-message content, not as instructions merged into the system prompt. - No secrets are handled beyond forwarding env-provided API keys to majordomo via `config.Env`/`model.Configure`; nothing logs credentials or the raw key material. - `Consolidate` renders model-derived `Title`/`Detail` with `%s` verbs (no format-string expansion from attacker data — the diff never reaches the format string position), so there's no `Printf`-style format-string injection from untrusted content. The only residual, inherent risk is LLM prompt-injection from a malicious diff (e.g., "ignore prior instructions, report critical"), but that's structural to any LLM-based reviewer and not a code-level flaw introduced here — and notably the diff is kept in the *user* turn while the system prompt instructs schema-only JSON output, which is the correct separation. Nothing in the security lens is materially wrong with this change. </details> <details><summary>🎯 Correctness — Minor issues</summary> Both findings verified against the actual code. ## VERDICT: Minor issues - **`examples/reviewer/main.go:84-88` (`providerOf`) + `main.go:92-101` (`perKeyCaps`) — the "per-provider" cap is not actually per-provider for the documented default path.** `providerOf` keys off `strings.IndexByte(spec, '/')` on the *raw spec string*, so any bare tier name (no `/`) maps to the literal bucket `"tier"`. The README's own run example is `REVIEWER_MODELS=fast,thinking`, and the code default tier table maps both `fast` and `thinking` to `anthropic/...` specs — but the fleet entries themselves are the bare tier names, so both `NamedModel.Provider` become `"tier"` and `perKeyCaps` collapses to `{"tier": 3}`: a *single shared* bucket rather than one per backend. The intent (per the doc comments in `reviewer.go:103-104`, the README "per-provider caps", and `REVIEWER_PROVIDER_CONCURRENCY`) is to throttle each backend independently. The divergence becomes a real behavior bug when two tiers resolve to *different* providers (e.g. `fast=anthropic/...`, `thinking=ollama/...`): the swarm caps the *combined* cross-provider in-flight at `perProvider` (3) instead of allowing 3 per provider (6 total), under-utilizing the swarm. Fix: key by the *resolved* model's provider (derivable from `ParseModelForContext`/the configured tier's concrete spec) rather than the raw spec string. (Verified against `main.go`, `reviewer.go`, and `model.ParseModelForContext` in `model/llms.go:327`.) - **`examples/reviewer/reviewer.go:157-176` (`Consolidate`) — a model whose *all* lenses errored reports "no issues found".** `worst` is initialized to `-1` (line 157); every lens with `r.Err != nil` hits `continue` (lines 160-162) before it can touch `worst`, so if *all* lenses for a model error, `worst` stays `-1`. The verdict ladder (lines 171-176) then falls through to `"no issues found"` (both `worst >= severityRank(SevHigh)` and `worst >= 0` are false), with only the `(⚠ N lens(es) errored)` suffix appended. A fully-failed model thus reads as a clean bill of health plus a warning, rather than an explicit failure verdict. Not exercised by `TestConsolidateVerdicts` in `reviewer_test.go:91-101`, which pairs a critical finding with one errored lens (so `worst` is set to 4). Suggested fix: track `errored` (already done) and set an explicit "review failed" verdict when `errored == len(rs)` (or `all` is empty). (Verified by reading the consolidation loop.) </details> <details><summary>🧯 Error handling & edge cases — Minor issues</summary> **VERDICT: Minor issues** - **`examples/reviewer/main.go:46` — `io.ReadAll` error swallowed.** `b, _ := io.ReadAll(os.Stdin)` discards the read error. On a real read failure (closed pipe, I/O error), `b` may be empty/partial and the program falls through to the "no diff" `exit(2)` path, masking the actual cause as a user-facing "you didn't pass a diff" message. Suggested fix: capture and report the error distinctly before the empty check. - **`examples/reviewer/reviewer.go:119-126` + `139-145` — ctx-cancelled cells silently vanish from the report.** When `fanout.Run` fails to start a cell because the context is cancelled (semaphore-acquire path in `fanout.go:67-90`), that `Result` carries `Err=ctx.Err()` and a zero-value `LensResult{}` (empty `Model`). `Review` converts it to `LensResult{Err: r.Err}` with an empty `Model` (line 121-122), and `Consolidate` skips any result with `r.Model == ""` (line 142-144). Net effect: a cancelled swarm produces a report that simply omits the never-started cells with no indication the run was incomplete — the user gets a normal-looking (possibly empty) report. Suggested fix: surface a top-level "swarm incomplete" note when any result has `Err != nil && Model == ""`. - **`examples/reviewer/reviewer.go:171-180` — "no issues found" header for a fully-errored model is misleading.** If every lens for a model errored (`worst` stays `-1`), the verdict reads `"no issues found"` with a trailing `(⚠ N lens(es) errored)`. The warning is present, so this is minor; a dedicated `"all lenses errored"` verdict would be clearer. No panics on nil/empty inputs were found: `splitCSV`/`perKeyCaps`/`Consolidate` all handle empty slices and nil maps cleanly (nil-slice ranges and `append` are safe), and `severityRank`'s `default` arm covers unknown severities without crashing. </details> Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 1m 40s

steve added 1 commit 2026-06-27 04:34:03 +00:00

fix: address verified gadfly P5 findings (canary robustness)

executus CI / test (pull_request) Failing after 3s

Details

executus CI / test (push) Failing after 1m9s

Details

a87e7d2c72

All 3 cloud models converged (all "minor" — example code, no blocking):
- Consolidate: a model whose every lens errored now reads "review incomplete",
  not a misleading "no issues found" (all 3 models). + test.
- Consolidate: swarm-cancelled (unattributed) cells now surface a "swarm
  cancelled — N cell(s) did not run" banner instead of vanishing (all 3). + test.
- main: io.ReadAll(os.Stdin) error is surfaced (all 3); a TTY stdin no longer
  hangs forever (TTY guard, minimax).
- providerOf: a bare tier name now keys its own PerKey bucket instead of all
  bare tiers collapsing onto "tier" (minimax, glm-5.2) — distinct tiers throttle
  independently.
- Review doc reworded (the closure, not fanout, carries per-cell errors).

Left as documented example-scope behavior: no per-cell timeout (caller supplies
ctx), unknown-severity → lowest rank (no crash).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

steve closed this pull request

2026-06-27 04:34:09 +00:00

steve deleted branch phase-5-canary

2026-06-27 04:34:09 +00:00

gitea-actions bot referenced this pull request

2026-06-28 17:47:55 +00:00

feat(run): InputFileStager seam — stage non-image attachments into the prompt #18

executus CI / test (pull_request) Failing after 3s

Details

executus CI / test (push) Failing after 1m9s

Details

Pull request closed

This pull request cannot be reopened because the branch was deleted.

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: steve/executus#7

P5: light-tier canary — gadfly-shaped reviewer on executus core #7

What it exercises (all core)

Verification

Scope note

🪰 Gadfly review — minimax-m3:cloud (ollama-cloud)

VERDICT: No material issues found

VERDICT: Minor issues

🪰 Gadfly review — deepseek-v4-flash:cloud (ollama-cloud)

🔒 Security — Verdict: No material issues found

Verdict: No material issues found

🧯 Error handling & edge cases — VERDICT: Minor issues

Findings

🪰 Gadfly review — glm-5.2:cloud (ollama-cloud)

VERDICT: Minor issues

Pull request closed

🪰 Gadfly review — `minimax-m3:cloud` (ollama-cloud)

🪰 Gadfly review — `deepseek-v4-flash:cloud` (ollama-cloud)

🪰 Gadfly review — `glm-5.2:cloud` (ollama-cloud)