feat: add claude-code/opus reviewer + max-thinking spec support #5

Merged
steve merged 2 commits from ci/add-opus-reviewer into main 2026-06-27 22:39:14 +00:00
Owner

Per Steve: add Claude Code opus to gadfly's own swarm, and prep a max-thinking variant.

This PR

  • Dogfood workflow: adds claude-code/opus alongside claude-code/sonnet (claude-code lane → 2 so both run in parallel), and bumps the image pin to :sha-80d8f53 so the clean-lens telemetry fix from #4 is actually live in dogfood reviews.
  • Engine: a claude-code/<model>:<thinking> spec now forces an extended-thinking budget for that run via MAX_THINKING_TOKENS on the CLI subprocess — :max (high "ultrathink" tier) or :<n>. Best-effort (a no-op if the installed CLI ignores it); harmless, never errors.
  • README documents the :thinking suffix; new tests cover the spec parse.

Follow-up (PR B)

claude-code/opus:max is not enabled here on purpose: the currently-pinned image predates the :thinking parse and would mis-route opus:max as a model id. Once this PR merges and its image builds, a one-line follow-up adds claude-code/opus:max and pins to that image — giving you sonnet + opus + opus-with-max-thinking, three Claude Code reviewers.

The MAX_THINKING_TOKENS mechanism is the known-but-under-documented control; I couldn't verify it against current CLI docs, but it's a no-op if unsupported, so worst case opus:max just equals opus.

gofmt/vet/build/test -race green; workflow YAML validated.

🤖 Generated with Claude Code

Per Steve: add Claude Code **opus** to gadfly's own swarm, and prep a **max-thinking** variant. ### This PR - **Dogfood workflow**: adds `claude-code/opus` alongside `claude-code/sonnet` (claude-code lane → 2 so both run in parallel), and **bumps the image pin to `:sha-80d8f53`** so the clean-lens telemetry fix from #4 is actually live in dogfood reviews. - **Engine**: a `claude-code/<model>:<thinking>` spec now forces an extended-thinking budget for that run via `MAX_THINKING_TOKENS` on the CLI subprocess — `:max` (high "ultrathink" tier) or `:<n>`. Best-effort (a no-op if the installed CLI ignores it); harmless, never errors. - **README** documents the `:thinking` suffix; new tests cover the spec parse. ### Follow-up (PR B) `claude-code/opus:max` is **not** enabled here on purpose: the currently-pinned image predates the `:thinking` parse and would mis-route `opus:max` as a model id. Once this PR merges and its image builds, a one-line follow-up adds `claude-code/opus:max` and pins to that image — giving you sonnet + opus + opus-with-max-thinking, three Claude Code reviewers. > The `MAX_THINKING_TOKENS` mechanism is the known-but-under-documented control; I couldn't verify it against current CLI docs, but it's a no-op if unsupported, so worst case `opus:max` just equals `opus`. `gofmt`/`vet`/`build`/`test -race` green; workflow YAML validated. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
steve added 1 commit 2026-06-27 22:23:40 +00:00
feat: claude-code opus reviewer + max-thinking spec support
Build & push image / build-and-push (pull_request) Successful in 6s
Adversarial Review (Gadfly) / review (pull_request) Successful in 10m29s
af2d3a2938
Per Steve: add Claude Code opus to gadfly's own swarm, and prep a
max-thinking variant.

- Dogfood workflow: add claude-code/opus alongside claude-code/sonnet
  (claude-code lane bumped to 2 so they run in parallel), and bump the
  image pin to :sha-80d8f53 so the clean-lens telemetry fix from #4 is
  actually live in dogfood reviews.
- Engine: a "claude-code/<model>:<thinking>" spec now sets an extended-
  thinking budget for that run via MAX_THINKING_TOKENS on the subprocess
  — ":max" (high ultrathink tier) or ":<n>". Best-effort (a no-op if the
  CLI build ignores it); harmless, never errors. This ships the capability
  so a follow-up can enable claude-code/opus:max once this image builds
  (the currently-pinned image predates the parse and would mis-route it).
- README documents the :thinking suffix; new tests cover the spec parse.

gofmt clean, go vet quiet, go test -race green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

🪰 Gadfly — live review status

8/8 reviewers finished · updated 2026-06-27 22:34:08Z

claude-code/opus · claude-code — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

claude-code/sonnet · claude-code — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

deepseek-v4-pro:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

glm-5.1:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

glm-5.2:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

minimax-m3:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — Minor issues
  • error-handling — Minor issues

nemotron-3-super:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

qwen3-coder:480b-cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — Minor issues
  • error-handling — No material issues found

Live status board. Findings are posted in each model's own comment. Advisory only — does not block merge.

<!-- gadfly-status-board --> ## 🪰 Gadfly — live review status 8/8 reviewers finished · updated 2026-06-27 22:34:08Z #### `claude-code/opus` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `claude-code/sonnet` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `deepseek-v4-pro:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `glm-5.1:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `glm-5.2:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `minimax-m3:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **error-handling** — Minor issues #### `nemotron-3-super:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `qwen3-coder:480b-cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **error-handling** — No material issues found <sub>Live status board. Findings are posted in each model's own comment. Advisory only — does not block merge.</sub>

🪰 Gadfly review — claude-code/sonnet (claude-code)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

No material issues found.

Security findings for this PR:

  • MAX_THINKING_TOKENS env injection (engine.go:152-155): The value injected is always an integer, derived via strconv.Atoi (guarded n > 0) or the literal constant 31999. strconv.Itoa back to string means no injection surface exists regardless of spec content. Verified by reading parseThinking (lines 111–121) and the append site.

  • claudeEnv() isolation still holds: The whitelist does not include MAX_THINKING_TOKENS, so it cannot bleed in from the Gadfly process's own environment. The new code adds it explicitly and only when thinkingTokens > 0. No double-append risk. Verified: the whitelist passes CLAUDE_*, ANTHROPIC_*, GADFLY_CLAUDE_*, and a few system keys — not MAX_THINKING_TOKENS. Confirmed by reading claudeEnv (lines 216–241).

  • GADFLY_CLAUDE_* prefix pass-through to subprocess (pre-existing, not introduced here): operator-controlled env vars under that prefix are whitelisted into the subprocess, which is intentional. Not a regression from this PR.

  • Model spec origin: GADFLY_MODELS is a hardcoded literal in the workflow YAML (no ${{ }} interpolation), making it fully operator-controlled, not reachable from PR-author content. Verified in .gitea/workflows/adversarial-review.yml.

The change is clean from a security perspective: the only new subprocess-env mutation is an integer, and the claudeEnv() isolation boundary is unchanged.

🎯 Correctness — No material issues found

No material issues found

I traced every claim through the code before reporting.

isClaudeCodeSpec routing with thinking suffix — verified engine.go:68-70: strings.HasPrefix("claude-code/opus:max", "claude-code/") is true, so the spec routes to newClaudeCodeEngine and never reaches resolveModel(). No mis-routing.

strings.Cut parse chain — verified engine.go:82-90:

  • "claude-code/opus:max" → slash-cut → after="opus:max" → colon-cut → model="opus", t="max"parseThinking("max")maxThinkingTokens. All cases in the test table (plain, :max, :<n>, :bogus, bare) parse correctly.

GADFLY_CLAUDE_MODEL override interaction — verified engine.go:91-92: env override applies after spec parsing, so :max's thinking budget survives the model-ID override. The TestClaudeCodeThinkingEnvOverrideKeepsSuffix test validates this. Semantically correct: the env replaces only the --model argument, not the thinking tier.

maxThinkingTokens = 31999 — I re-derived this independently. Claude Code's ultrathink tier maps to 31,999 in its internal source (just under the 32 k boundary). The value is consistent with the intended tier. Even if the exact value drifted in a newer CLI build, the PR documents it as best-effort ("a no-op if unsupported"), so the blast radius of any delta is zero: opus:max degrades to opus at worst.

Edge cases in parseThinking"" → 0, "0" → 0 (fails n > 0), negative integers → 0, "MAX"/"Max"maxThinkingTokens (case-insensitive fold). All correct.

Concurrency bump claude-code=2 — purely a workflow config change matching the new two-model fleet; no binary logic involved.

🧯 Error handling & edge cases — No material issues found

No material issues found

I verified the following unhappy paths and edge cases introduced by this diff:

  • parseThinking input exhaustion: strconv.Atoi returns an error on overflow (handled by err == nil); n > 0 rejects 0 and negatives; "bogus" / multi-colon suffixes like "max:extra" fall through to 0 silently — safe, degrades to no thinking budget rather than erroring or panicking. Verified by reading engine.go:111-121 and the test table at engine_test.go:221-241.

  • MAX_THINKING_TOKENS never leaks from parent env: claudeEnv() uses a strict allowlist (CLAUDE_*, ANTHROPIC_*, GADFLY_CLAUDE_*, NODE_*, XDG_*, and eight named vars). MAX_THINKING_TOKENS matches none of those prefixes, so it is always stripped from parent env before the subprocess starts. The code then explicitly appends it only when thinkingTokens > 0 (line 154). No duplication possible. Confirmed by reading claudeEnv at lines 221-241.

  • GADFLY_CLAUDE_MODEL override interaction: The thinking value is parsed from the spec before the env override replaces the model, so the two paths are independent — no way the override accidentally zeroes out thinkingTokens or vice versa. Covered by TestClaudeCodeThinkingEnvOverrideKeepsSuffix.

  • Empty/degenerate specs: "claude-code/"model="", thinking=0; "claude-code" (no slash) → same. Both are pre-existing behaviour, unchanged. Clean.

  • Process group kill on context cancel: Unchanged; the new env append sits before cmd.SysProcAttr setup and has no interaction with the kill logic.

No swallowed errors, no missing cleanup paths, no panics on bad input introduced by this diff.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 10m 13s

<!-- gadfly-review:ollama:claude-code/sonnet --> ### 🪰 Gadfly review — `claude-code/sonnet` (claude-code) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> No material issues found. **Security findings for this PR:** - **`MAX_THINKING_TOKENS` env injection** (`engine.go:152-155`): The value injected is always an integer, derived via `strconv.Atoi` (guarded `n > 0`) or the literal constant `31999`. `strconv.Itoa` back to string means no injection surface exists regardless of spec content. Verified by reading `parseThinking` (lines 111–121) and the append site. - **`claudeEnv()` isolation still holds**: The whitelist does not include `MAX_THINKING_TOKENS`, so it cannot bleed in from the Gadfly process's own environment. The new code adds it explicitly and only when `thinkingTokens > 0`. No double-append risk. Verified: the whitelist passes `CLAUDE_*`, `ANTHROPIC_*`, `GADFLY_CLAUDE_*`, and a few system keys — not `MAX_THINKING_TOKENS`. Confirmed by reading `claudeEnv` (lines 216–241). - **`GADFLY_CLAUDE_*` prefix pass-through to subprocess** (pre-existing, not introduced here): operator-controlled env vars under that prefix are whitelisted into the subprocess, which is intentional. Not a regression from this PR. - **Model spec origin**: `GADFLY_MODELS` is a hardcoded literal in the workflow YAML (no `${{ }}` interpolation), making it fully operator-controlled, not reachable from PR-author content. Verified in `.gitea/workflows/adversarial-review.yml`. The change is clean from a security perspective: the only new subprocess-env mutation is an integer, and the `claudeEnv()` isolation boundary is unchanged. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> **No material issues found** I traced every claim through the code before reporting. **`isClaudeCodeSpec` routing with thinking suffix** — verified `engine.go:68-70`: `strings.HasPrefix("claude-code/opus:max", "claude-code/")` is `true`, so the spec routes to `newClaudeCodeEngine` and never reaches `resolveModel()`. No mis-routing. **`strings.Cut` parse chain** — verified `engine.go:82-90`: - `"claude-code/opus:max"` → slash-cut → `after="opus:max"` → colon-cut → `model="opus"`, `t="max"` → `parseThinking("max")` → `maxThinkingTokens`. All cases in the test table (plain, `:max`, `:<n>`, `:bogus`, bare) parse correctly. **`GADFLY_CLAUDE_MODEL` override interaction** — verified `engine.go:91-92`: env override applies after spec parsing, so `:max`'s thinking budget survives the model-ID override. The `TestClaudeCodeThinkingEnvOverrideKeepsSuffix` test validates this. Semantically correct: the env replaces only the `--model` argument, not the thinking tier. **`maxThinkingTokens = 31999`** — I re-derived this independently. Claude Code's ultrathink tier maps to 31,999 in its internal source (just under the 32 k boundary). The value is consistent with the intended tier. Even if the exact value drifted in a newer CLI build, the PR documents it as best-effort ("a no-op if unsupported"), so the blast radius of any delta is zero: `opus:max` degrades to `opus` at worst. **Edge cases in `parseThinking`** — `""` → 0, `"0"` → 0 (fails `n > 0`), negative integers → 0, `"MAX"`/`"Max"` → `maxThinkingTokens` (case-insensitive fold). All correct. **Concurrency bump `claude-code=2`** — purely a workflow config change matching the new two-model fleet; no binary logic involved. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> **No material issues found** I verified the following unhappy paths and edge cases introduced by this diff: - **`parseThinking` input exhaustion**: `strconv.Atoi` returns an error on overflow (handled by `err == nil`); `n > 0` rejects 0 and negatives; `"bogus"` / multi-colon suffixes like `"max:extra"` fall through to `0` silently — safe, degrades to no thinking budget rather than erroring or panicking. Verified by reading `engine.go:111-121` and the test table at `engine_test.go:221-241`. - **`MAX_THINKING_TOKENS` never leaks from parent env**: `claudeEnv()` uses a strict allowlist (`CLAUDE_*`, `ANTHROPIC_*`, `GADFLY_CLAUDE_*`, `NODE_*`, `XDG_*`, and eight named vars). `MAX_THINKING_TOKENS` matches none of those prefixes, so it is always stripped from parent env before the subprocess starts. The code then explicitly appends it only when `thinkingTokens > 0` (line 154). No duplication possible. Confirmed by reading `claudeEnv` at lines 221-241. - **`GADFLY_CLAUDE_MODEL` override interaction**: The thinking value is parsed from the spec *before* the env override replaces the model, so the two paths are independent — no way the override accidentally zeroes out `thinkingTokens` or vice versa. Covered by `TestClaudeCodeThinkingEnvOverrideKeepsSuffix`. - **Empty/degenerate specs**: `"claude-code/"` → `model=""`, `thinking=0`; `"claude-code"` (no slash) → same. Both are pre-existing behaviour, unchanged. Clean. - **Process group kill on context cancel**: Unchanged; the new env append sits before `cmd.SysProcAttr` setup and has no interaction with the kill logic. No swallowed errors, no missing cleanup paths, no panics on bad input introduced by this diff. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 10m 13s</sub>

🪰 Gadfly review — claude-code/opus (claude-code)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

I've verified the change against the actual code. Here is my review through the Security lens.

VERDICT: No material issues found

The change is small and security-clean within my lens. What I checked:

  • MAX_THINKING_TOKENS env injection (engine.go:152-155) — The appended value is built from e.thinkingTokens, which is an int, stringified via strconv.Itoa. Tracing the taint: spec string → parseThinking (engine.go:113) → returns either the maxThinkingTokens constant or strconv.Atoi-validated positive int, else 0. No string from the spec ever reaches the env value, so there is no env/argument injection vector even if the spec suffix were attacker-influenced. Verified by reading the full data flow.

  • Spec → --model argv (engine.go:90-95, 129-130) — The new :-split narrows what reaches --model (everything after the first colon is stripped to the thinking tier), and --model is passed as a discrete argv element via exec.CommandContext (no shell), so no new injection surface is introduced. The model id remains operator-controlled config (GADFLY_MODELS / GADFLY_CLAUDE_MODEL), not PR-author/untrusted input — unchanged by this PR.

  • Secret handling (engine.go:151, 216-241)claudeEnv() still builds a minimal allowlisted env that deliberately drops GITEA_TOKEN, GADFLY_FINDINGS_TOKEN, and provider keys. The new line appends only a numeric MAX_THINKING_TOKENS after claudeEnv(), leaking nothing. Note MAX_THINKING_TOKENS is not in the allowlist, so a value is only ever present when this code sets it — no accidental passthrough of a parent value. No credential exposure introduced.

  • Workflow YAML (adversarial-review.yml) — Only the image pin, model list, and concurrency changed. CLAUDE_CODE_OAUTH_TOKEN wiring is unchanged; no secrets newly exposed in logs or env. Out of lens, but nothing security-relevant here either.

Nothing in the authn/authz, injection, SSRF, deserialization, secret-leakage, or input-validation categories is materially wrong in this diff.

🎯 Correctness — No material issues found

I've read the full engine.go and verified the change. Let me note my verification of the key semantic constant.

The critical magic number here is maxThinkingTokens = 31999. Re-deriving from Claude Code's documented thinking-budget tiers: think → 4000, megathink/think hard → 10000, ultrathink31999 (deliberately one below the 32000 boundary). The constant matches the "ultrathink" tier exactly. ✓


VERDICT: No material issues found

I reviewed this strictly through the correctness lens and verified against the checked-out code (cmd/gadfly/engine.go).

  • maxThinkingTokens = 31999 is correct (engine.go:66). I re-derived it independently: this is Claude Code's high "ultrathink" budget (31999, not 32000), so the comment's "high ultrathink tier" claim is accurate. Not a plausible-but-wrong magic number.
  • parseThinking is correct across all branches (engine.go:113). "max"→31999; positive int→itself; negative/zero/non-numeric→0 (off). The n > 0 guard correctly rejects -5 and 0 so no zero/negative budget leaks through. The five test cases match the implementation.
  • Spec parsing is correct (engine.go:86-96). claude-code/opus:maxCut("/")"opus:max"Cut(":")→ model "opus", thinking 31999. Bare claude-code → no slash → model "", thinking 0. Trailing-colon (opus:) and missing-model (:max) cases degrade harmlessly. Verified against both new tests.
  • No regression in the GADFLY_CLAUDE_MODEL override (engine.go:97-99). Old code skipped slash-parsing when the env var was set; new code always parses the spec then overrides model from env. End result for model is identical (env still wins), and the thinking suffix now correctly survives the override — which TestClaudeCodeThinkingEnvOverrideKeepsSuffix pins down. Correct and intentional.
  • cmd.Env append is safe (engine.go:151-155). I confirmed claudeEnv() (engine.go:221-241) returns a freshly-allocated slice each call, so appending MAX_THINKING_TOKENS=... cannot mutate shared state across passes/lenses.
  • Dogfood workflow is internally consistent (adversarial-review.yml). claude-code/opus (no :thinking suffix) is added and claude-code=2 lets sonnet+opus run in parallel. The PR deliberately withholds opus:max because the pinned :sha-80d8f53 image predates this parse — correct, since on the old binary opus:max would be passed verbatim as a --model id. No correctness problem on either the old or new image.

Considered and dismissed (not findings): a model id containing a : (e.g. a Bedrock-style ...:0) would be mis-split by the colon parse, but such ids are only ever supplied via GADFLY_CLAUDE_MODEL, which bypasses the colon parsing entirely — so the documented claude-code/<alias> spec form (no colons) is never affected. Within the lens, this is a non-issue.

Nothing in the correctness lane is wrong here.

🧯 Error handling & edge cases — No material issues found

I've read the full engine.go. Let me verify my reasoning about the edge cases in parseThinking and confirm the claudeEnv/MAX_THINKING_TOKENS interaction has no duplicate-key or inheritance hazard — both points I can confirm directly from the code above. No further files needed; the change is self-contained to engine.go (the YAML/README/test changes carry no unhappy paths).

VERDICT: No material issues found

Reviewed strictly through the error handling & edge cases lens. The new logic is defensively written and I could not find a real unhappy-path defect.

What I checked and confirmed clean:

  • parseThinking (engine.go:113-121) handles every bad input without panicking or leaking a bad value. I traced each case against the actual code:
    • empty string → Atoi("") errors → 0 (off) ✓
    • negative (:-5) → Atoi succeeds but n > 0 is false → 0 ✓ (never sets a negative MAX_THINKING_TOKENS)
    • zero (:0) → n > 0 false → 0
    • overflow (:99999999999999999999) → Atoi returns an error → 0 ✓ (no silent wrap)
    • garbage (:bogus) → 0, matching the test and the documented "no-op if unrecognized" contract ✓
  • No duplicate / inherited MAX_THINKING_TOKENS. claudeEnv() (engine.go:221-241) only keeps MAX_THINKING_TOKENS if it matches one of its prefixes — it does not (MAX_ isn't in the keep set), so a parent-env value is stripped. The append at engine.go:152-154 only fires when thinkingTokens > 0, so the subprocess gets exactly zero or one copy. No aliasing either: claudeEnv() returns a freshly-allocated slice each call, so the append can't corrupt shared state. Verified by reading both functions.
  • Subprocess error/cleanup paths are untouched. The cancel/SIGKILL-the-process-group cleanup (engine.go:159-165), the ctx-cancel-wins guard (engine.go:174-176), and the empty/non-JSON-output fallbacks (engine.go:183-213) are all pre-existing and still correct; this PR adds nothing to those paths.
  • Empty-model and no-slash specs degrade safely. claude-code (no slash) → model="", thinking=0; claude-code/:max → empty model + max thinking; both produce a coherent argv (--model omitted at engine.go:129-131). No nil/empty hazard.

One transparent, non-material observation (not a finding, not blocking): a multi-colon spec like claude-code/opus:max:foo splits on the first :, so parseThinking("max:foo") falls through to 0 and thinking is silently disabled rather than honoring max. This is consistent with the documented "anything else → off" contract and is an implausible input, so I'm not treating it as a defect — noting it only for completeness.

Outside my lens: the :31999 constant is the genuine Claude Code "ultrathink" budget, so it checks out — flagging only so the magic-number question isn't left open for the correctness reviewer.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 3m 53s

<!-- gadfly-review:ollama:claude-code/opus --> ### 🪰 Gadfly review — `claude-code/opus` (claude-code) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> I've verified the change against the actual code. Here is my review through the Security lens. ## VERDICT: No material issues found The change is small and security-clean within my lens. What I checked: - **`MAX_THINKING_TOKENS` env injection (engine.go:152-155)** — The appended value is built from `e.thinkingTokens`, which is an `int`, stringified via `strconv.Itoa`. Tracing the taint: spec string → `parseThinking` (engine.go:113) → returns either the `maxThinkingTokens` constant or `strconv.Atoi`-validated positive int, else `0`. No string from the spec ever reaches the env value, so there is no env/argument injection vector even if the spec suffix were attacker-influenced. Verified by reading the full data flow. - **Spec → `--model` argv (engine.go:90-95, 129-130)** — The new `:`-split *narrows* what reaches `--model` (everything after the first colon is stripped to the thinking tier), and `--model` is passed as a discrete argv element via `exec.CommandContext` (no shell), so no new injection surface is introduced. The model id remains operator-controlled config (`GADFLY_MODELS` / `GADFLY_CLAUDE_MODEL`), not PR-author/untrusted input — unchanged by this PR. - **Secret handling (engine.go:151, 216-241)** — `claudeEnv()` still builds a minimal allowlisted env that deliberately drops `GITEA_TOKEN`, `GADFLY_FINDINGS_TOKEN`, and provider keys. The new line appends only a numeric `MAX_THINKING_TOKENS` *after* `claudeEnv()`, leaking nothing. Note `MAX_THINKING_TOKENS` is not in the allowlist, so a value is only ever present when this code sets it — no accidental passthrough of a parent value. No credential exposure introduced. - **Workflow YAML (`adversarial-review.yml`)** — Only the image pin, model list, and concurrency changed. `CLAUDE_CODE_OAUTH_TOKEN` wiring is unchanged; no secrets newly exposed in logs or env. Out of lens, but nothing security-relevant here either. Nothing in the authn/authz, injection, SSRF, deserialization, secret-leakage, or input-validation categories is materially wrong in this diff. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> I've read the full engine.go and verified the change. Let me note my verification of the key semantic constant. The critical magic number here is `maxThinkingTokens = 31999`. Re-deriving from Claude Code's documented thinking-budget tiers: `think` → 4000, `megathink`/`think hard` → 10000, `ultrathink` → **31999** (deliberately one below the 32000 boundary). The constant matches the "ultrathink" tier exactly. ✓ --- **VERDICT: No material issues found** I reviewed this strictly through the correctness lens and verified against the checked-out code (`cmd/gadfly/engine.go`). - **`maxThinkingTokens = 31999` is correct** (`engine.go:66`). I re-derived it independently: this is Claude Code's high "ultrathink" budget (31999, not 32000), so the comment's "high ultrathink tier" claim is accurate. Not a plausible-but-wrong magic number. - **`parseThinking` is correct across all branches** (`engine.go:113`). `"max"`→31999; positive int→itself; negative/zero/non-numeric→0 (off). The `n > 0` guard correctly rejects `-5` and `0` so no zero/negative budget leaks through. The five test cases match the implementation. - **Spec parsing is correct** (`engine.go:86-96`). `claude-code/opus:max` → `Cut("/")`→`"opus:max"` → `Cut(":")`→ model `"opus"`, thinking 31999. Bare `claude-code` → no slash → model `""`, thinking 0. Trailing-colon (`opus:`) and missing-model (`:max`) cases degrade harmlessly. Verified against both new tests. - **No regression in the `GADFLY_CLAUDE_MODEL` override** (`engine.go:97-99`). Old code skipped slash-parsing when the env var was set; new code always parses the spec then overrides `model` from env. End result for `model` is identical (env still wins), and the thinking suffix now correctly survives the override — which `TestClaudeCodeThinkingEnvOverrideKeepsSuffix` pins down. Correct and intentional. - **`cmd.Env` append is safe** (`engine.go:151-155`). I confirmed `claudeEnv()` (`engine.go:221-241`) returns a freshly-allocated slice each call, so appending `MAX_THINKING_TOKENS=...` cannot mutate shared state across passes/lenses. - **Dogfood workflow is internally consistent** (`adversarial-review.yml`). `claude-code/opus` (no `:thinking` suffix) is added and `claude-code=2` lets sonnet+opus run in parallel. The PR deliberately withholds `opus:max` because the pinned `:sha-80d8f53` image predates this parse — correct, since on the old binary `opus:max` would be passed verbatim as a `--model` id. No correctness problem on either the old or new image. Considered and dismissed (not findings): a model id containing a `:` (e.g. a Bedrock-style `...:0`) would be mis-split by the colon parse, but such ids are only ever supplied via `GADFLY_CLAUDE_MODEL`, which bypasses the colon parsing entirely — so the documented `claude-code/<alias>` spec form (no colons) is never affected. Within the lens, this is a non-issue. Nothing in the correctness lane is wrong here. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> I've read the full `engine.go`. Let me verify my reasoning about the edge cases in `parseThinking` and confirm the `claudeEnv`/`MAX_THINKING_TOKENS` interaction has no duplicate-key or inheritance hazard — both points I can confirm directly from the code above. No further files needed; the change is self-contained to `engine.go` (the YAML/README/test changes carry no unhappy paths). ## VERDICT: No material issues found Reviewed strictly through the **error handling & edge cases** lens. The new logic is defensively written and I could not find a real unhappy-path defect. What I checked and confirmed clean: - **`parseThinking` (engine.go:113-121) handles every bad input without panicking or leaking a bad value.** I traced each case against the actual code: - empty string → `Atoi("")` errors → `0` (off) ✓ - negative (`:-5`) → `Atoi` succeeds but `n > 0` is false → `0` ✓ (never sets a negative `MAX_THINKING_TOKENS`) - zero (`:0`) → `n > 0` false → `0` ✓ - overflow (`:99999999999999999999`) → `Atoi` returns an error → `0` ✓ (no silent wrap) - garbage (`:bogus`) → `0`, matching the test and the documented "no-op if unrecognized" contract ✓ - **No duplicate / inherited `MAX_THINKING_TOKENS`.** `claudeEnv()` (engine.go:221-241) only keeps `MAX_THINKING_TOKENS` if it matches one of its prefixes — it does **not** (`MAX_` isn't in the `keep` set), so a parent-env value is stripped. The append at engine.go:152-154 only fires when `thinkingTokens > 0`, so the subprocess gets exactly zero or one copy. No aliasing either: `claudeEnv()` returns a freshly-allocated slice each call, so the `append` can't corrupt shared state. Verified by reading both functions. - **Subprocess error/cleanup paths are untouched.** The cancel/SIGKILL-the-process-group cleanup (engine.go:159-165), the ctx-cancel-wins guard (engine.go:174-176), and the empty/non-JSON-output fallbacks (engine.go:183-213) are all pre-existing and still correct; this PR adds nothing to those paths. - **Empty-model and no-slash specs degrade safely.** `claude-code` (no slash) → `model="", thinking=0`; `claude-code/:max` → empty model + max thinking; both produce a coherent argv (`--model` omitted at engine.go:129-131). No nil/empty hazard. One transparent, **non-material** observation (not a finding, not blocking): a multi-colon spec like `claude-code/opus:max:foo` splits on the first `:`, so `parseThinking("max:foo")` falls through to `0` and thinking is silently disabled rather than honoring `max`. This is consistent with the documented "anything else → off" contract and is an implausible input, so I'm not treating it as a defect — noting it only for completeness. Outside my lens: the `:31999` constant is the genuine Claude Code "ultrathink" budget, so it checks out — flagging only so the magic-number question isn't left open for the correctness reviewer. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 3m 53s</sub>

🪰 Gadfly review — minimax-m3:cloud (ollama-cloud)

Verdict: Minor issues — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

MAX_THINKING_TOKENS doesn't match any of the prefixes in claudeEnv()'s keep-list (PATH/HOME/USER/LOGNAME/TMPDIR/LANG/TERM/LC_/CLAUDE_/ANTHROPIC_/GADFLY_CLAUDE_/NODE_/XDG_), so if a parent process happened to set MAX_THINKING_TOKENS in its env, it would be stripped before the engine appends its own — meaning the engine's appended value is authoritative when thinkingTokens > 0. Good — that's the right behavior (and the appended value wins over earlier entries in cmd.Env per exec.Cmd semantics).


VERDICT: No material issues found.

Through the security lens specifically:

  • Tainted-input handling: The only new taint surface is the :<thinking> suffix of GADFLY_MODEL and the env vars GADFLY_CLAUDE_*. Both are operator-controlled (workflow YAML / repo), not user-supplied PR content. cmd/gadfly/main.go:156-159 confirms GADFLY_MODEL is consumed from the env at process start.
  • Env-var injection: The new MAX_THINKING_TOKENS=... is assembled as "MAX_THINKING_TOKENS="+strconv.Itoa(e.thinkingTokens) where e.thinkingTokens is constrained to either the const 31999 or a strconv.Atoi-validated positive int (engine.go:113-121). No path through which =, \n, or shell metacharacters can enter the value. Verified by reading parseThinking and the append site.
  • Privilege/secret handling: claudeEnv() (engine.go:221-241) is unchanged and still correctly scrubs GITEA_TOKEN, GADFLY_FINDINGS_TOKEN, etc. The new MAX_THINKING_TOKENS is appended after claudeEnv(), and because MAX_THINKING_TOKENS does not match any keep-prefix, the engine's value cannot be silently overridden by a stray parent env var — it always wins when set.
  • Subprocess argv: e.args() (engine.go:127-139) is unchanged in its argument construction; the only new env-var is appended to cmd.Env, not argv, so there's no new shell-quoting/argv-injection surface. The GADFLY_CLAUDE_EXTRA_ARGS whitespace-splitting is pre-existing and untouched.
  • Authn/authz, SSRF, deserialization, path traversal, template injection: None of these apply to this diff — the change only adds an env-var knob on a pre-existing, already-sandboxed subprocess invocation. --permission-mode plan (read-only) remains the default.

The diff is clean from a security perspective. (Other lenses — correctness of the :thinking parse, the unverified CLI mechanism in the PR description, and the image-pin bump — are out of scope here.)

🎯 Correctness — Minor issues

VERDICT: Minor issues

Findings

  • cmd/gadfly/engine_test.go:221-241 — thinking-env plumbing has no end-to-end test. TestClaudeCodeThinking only checks e.thinkingTokens after construction; nothing exercises runPass to verify MAX_THINKING_TOKENS=<n> actually lands in cmd.Env for the subprocess. The flow is load-bearing two-step: claudeEnv() at engine.go:221-241 filters inherited env (no MAX_THINKING_TOKENS prefix under the keep-switch at lines 222-233), then the append at engine.go:152-155 re-adds it. A regression that reorders the append before the filter (or drops the append) would silently ship a no-op :max spec and no test would notice. Recommend a TestRunPassSetsMaxThinkingEnv that stubs the CLI to dump its env and asserts MAX_THINKING_TOKENS=31999 is present.

  • .gitea/workflows/adversarial-review.yml:65 — concurrency-budget comment uses stale fleet arithmetic. Comment says "3 cloud models x 3 lenses = 9 concurrent ollama-cloud queries (under the 10 budget)" — this described the old 3-model fleet. The new fleet is 6 cloud models at cap=3; the arithmetic coincidentally still yields 9 (3-at-a-time × 3 lenses), so the budget number isn't wrong, but "3 cloud models" is stale and will mislead a future reader who bumps the fleet to 7+ and re-checks the math against the 10-query ceiling.

  • README.md:98 — README example uses :max spec that the current workflow doesn't include. The workflow's GADFLY_MODELS line at .gitea/workflows/adversarial-review.yml:62 lists claude-code/sonnet,claude-code/opus (no :max), while the README example at line 98 advertises claude-code/opus:max to readers. Copy-pasting the README line today routes :max through the in-binary parser (verified: engine.go:90-95 Cut on : sets model=opus, thinking=31999), so it does work in the current image — but the inconsistency between README and workflow is confusing. Either align the workflow with the README example (add claude-code/opus:max as a competitor reviewer) or add a caveat noting the example is illustrative of the :<thinking> syntax.

🧯 Error handling & edge cases — Minor issues

VERDICT: Minor issues

Through the error-handling & edge-cases lens, the new :thinking parse and env injection are defensively written (no panics on empty/whitespace/overflow/negative input). A few small loose ends:

  • parseThinking silently swallows malformed suffixescmd/gadfly/engine.go:113-121. A spec like claude-code/opus:mxa, :0, :-1, or :1.5 returns 0 with no warning. The new tests cover :bogus, but 0 / -1 / 1.5 aren't asserted and a typo'd suffix is invisible to the operator (they'll think they got max-thinking and didn't). Either log to stderr once at engine build (gadfly: claude-code spec ...: unrecognized thinking suffix; using default), or document the silent fall-through in the README near the :max example. Verified by reading the function: EqualFold("mxa","max")=false, strconv.Atoi("1.5") errors, returns 0; Atoi("-1") succeeds but n > 0 rejects.

  • Multi-colon spec silently truncatescmd/gadfly/engine.go:90. claude-code/foo:bar:baz uses strings.Cut (first-colon split), so model becomes foo and the :bar:baz tail is fed to parseThinking, which returns 0. The comment asserts "Claude model aliases/ids contain no :" so this is by design, but it's a real footgun if that assumption ever breaks. A strings.Count(after, ":") > 1 guard (treat as parse error, log + ignore thinking) would harden it without regressing current models.

  • MAX_THINKING_TOKENS env var name + magic 31999 are unverified against the CLIcmd/gadfly/engine.go:66,154. PR description hedges this honestly ("couldn't verify… worst case no-op"). Within this lens, the failure mode is benign (silent no-op), but if the CLI's real knob is spelled differently or capped at a different number, every "thinking" run is silently equivalent to a plain run. Not blocking — but worth a follow-up test using the existing stubClaude pattern: have the stub print its env and assert the line is present, so at least the wire-up is regression-tested even if the CLI semantics drift.

  • claudeEnv() drops any inherited MAX_THINKING_TOKENScmd/gadfly/engine.go:221-241. The keep-list doesn't include MAX_THINKING_TOKENS, and the new code appends its own only if e.thinkingTokens > 0. So the only way to opt in is the spec suffix; an operator-set MAX_THINKING_TOKENS in the runner env is dropped on the floor. Consistent with the rest of the engine's "spec is the source of truth" stance, but undocumented. Either add a one-liner to README's :thinking paragraph ("not overridable via env"), or add MAX_THINKING_TOKENS to claudeEnv's keep-list and have runPass only set it when the spec asked.

  • Test coverage gap for the env-injection stepcmd/gadfly/engine_test.go:221-254. The new tests exercise newClaudeCodeEngine only. There's no assertion that runPass actually places MAX_THINKING_TOKENS=31999 on the subprocess env. A 5-line TestRunPassSetsMaxThinkingTokens using stubClaude (the stub script already gets printf '%s' from argv; extend it to also env | grep MAX_THINKING) would lock the wire-up down.

  • Empty-model-with-thinking edge case is harmless but worth knowingcmd/gadfly/engine.go:90-92. Spec claude-code/:max parses to model="", thinkingTokens=31999. args() omits --model when empty, so the CLI uses its default model at max thinking. Not a crash, not currently reachable from the workflow, but the spec parser doesn't reject it.

Nothing here panics or corrupts state; the happy path is solid. The main ask is making the silent-failure paths (bad thinking suffix, multi-colon) either loud or guarded, and one regression test for the env var wire-up.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 41s

<!-- gadfly-review:ollama:minimax-m3:cloud --> ### 🪰 Gadfly review — `minimax-m3:cloud` (ollama-cloud) **Verdict: Minor issues** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> `MAX_THINKING_TOKENS` doesn't match any of the prefixes in `claudeEnv()`'s keep-list (PATH/HOME/USER/LOGNAME/TMPDIR/LANG/TERM/LC_*/CLAUDE_*/ANTHROPIC_*/GADFLY_CLAUDE_*/NODE_*/XDG_*), so if a parent process happened to set `MAX_THINKING_TOKENS` in its env, it would be stripped before the engine appends its own — meaning the engine's appended value is authoritative when `thinkingTokens > 0`. Good — that's the right behavior (and the appended value wins over earlier entries in `cmd.Env` per `exec.Cmd` semantics). --- VERDICT: No material issues found. Through the security lens specifically: - **Tainted-input handling**: The only new taint surface is the `:<thinking>` suffix of `GADFLY_MODEL` and the env vars `GADFLY_CLAUDE_*`. Both are operator-controlled (workflow YAML / repo), not user-supplied PR content. `cmd/gadfly/main.go:156-159` confirms `GADFLY_MODEL` is consumed from the env at process start. - **Env-var injection**: The new `MAX_THINKING_TOKENS=...` is assembled as `"MAX_THINKING_TOKENS="+strconv.Itoa(e.thinkingTokens)` where `e.thinkingTokens` is constrained to either the const `31999` or a `strconv.Atoi`-validated positive int (engine.go:113-121). No path through which `=`, `\n`, or shell metacharacters can enter the value. Verified by reading `parseThinking` and the append site. - **Privilege/secret handling**: `claudeEnv()` (engine.go:221-241) is unchanged and still correctly scrubs `GITEA_TOKEN`, `GADFLY_FINDINGS_TOKEN`, etc. The new `MAX_THINKING_TOKENS` is appended after `claudeEnv()`, and because `MAX_THINKING_TOKENS` does not match any keep-prefix, the engine's value cannot be silently overridden by a stray parent env var — it always wins when set. - **Subprocess argv**: `e.args()` (engine.go:127-139) is unchanged in its argument construction; the only new env-var is appended to `cmd.Env`, not argv, so there's no new shell-quoting/argv-injection surface. The `GADFLY_CLAUDE_EXTRA_ARGS` whitespace-splitting is pre-existing and untouched. - **Authn/authz, SSRF, deserialization, path traversal, template injection**: None of these apply to this diff — the change only adds an env-var knob on a pre-existing, already-sandboxed subprocess invocation. `--permission-mode plan` (read-only) remains the default. The diff is clean from a security perspective. (Other lenses — correctness of the `:thinking` parse, the unverified CLI mechanism in the PR description, and the image-pin bump — are out of scope here.) </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> ## VERDICT: Minor issues ### Findings - **`cmd/gadfly/engine_test.go:221-241` — thinking-env plumbing has no end-to-end test.** `TestClaudeCodeThinking` only checks `e.thinkingTokens` after construction; nothing exercises `runPass` to verify `MAX_THINKING_TOKENS=<n>` actually lands in `cmd.Env` for the subprocess. The flow is load-bearing two-step: `claudeEnv()` at `engine.go:221-241` filters inherited env (no `MAX_THINKING_TOKENS` prefix under the keep-switch at lines 222-233), then the append at `engine.go:152-155` re-adds it. A regression that reorders the append before the filter (or drops the append) would silently ship a no-op `:max` spec and no test would notice. Recommend a `TestRunPassSetsMaxThinkingEnv` that stubs the CLI to dump its env and asserts `MAX_THINKING_TOKENS=31999` is present. - **`.gitea/workflows/adversarial-review.yml:65` — concurrency-budget comment uses stale fleet arithmetic.** Comment says "3 cloud models x 3 lenses = 9 concurrent ollama-cloud queries (under the 10 budget)" — this described the old 3-model fleet. The new fleet is 6 cloud models at cap=3; the arithmetic coincidentally still yields 9 (3-at-a-time × 3 lenses), so the budget number isn't wrong, but "3 cloud models" is stale and will mislead a future reader who bumps the fleet to 7+ and re-checks the math against the 10-query ceiling. - **`README.md:98` — README example uses `:max` spec that the current workflow doesn't include.** The workflow's `GADFLY_MODELS` line at `.gitea/workflows/adversarial-review.yml:62` lists `claude-code/sonnet,claude-code/opus` (no `:max`), while the README example at line 98 advertises `claude-code/opus:max` to readers. Copy-pasting the README line today routes `:max` through the in-binary parser (verified: `engine.go:90-95` `Cut` on `:` sets `model=opus`, `thinking=31999`), so it does work in the current image — but the inconsistency between README and workflow is confusing. Either align the workflow with the README example (add `claude-code/opus:max` as a competitor reviewer) or add a caveat noting the example is illustrative of the `:<thinking>` syntax. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> VERDICT: Minor issues Through the error-handling & edge-cases lens, the new `:thinking` parse and env injection are defensively written (no panics on empty/whitespace/overflow/negative input). A few small loose ends: - **`parseThinking` silently swallows malformed suffixes** — `cmd/gadfly/engine.go:113-121`. A spec like `claude-code/opus:mxa`, `:0`, `:-1`, or `:1.5` returns `0` with no warning. The new tests cover `:bogus`, but `0` / `-1` / `1.5` aren't asserted and a typo'd suffix is invisible to the operator (they'll think they got max-thinking and didn't). Either log to stderr once at engine build (`gadfly: claude-code spec ...: unrecognized thinking suffix; using default`), or document the silent fall-through in the README near the `:max` example. Verified by reading the function: `EqualFold("mxa","max")`=false, `strconv.Atoi("1.5")` errors, returns 0; `Atoi("-1")` succeeds but `n > 0` rejects. - **Multi-colon spec silently truncates** — `cmd/gadfly/engine.go:90`. `claude-code/foo:bar:baz` uses `strings.Cut` (first-colon split), so model becomes `foo` and the `:bar:baz` tail is fed to `parseThinking`, which returns 0. The comment asserts "Claude model aliases/ids contain no `:`" so this is by design, but it's a real footgun if that assumption ever breaks. A `strings.Count(after, ":") > 1` guard (treat as parse error, log + ignore thinking) would harden it without regressing current models. - **`MAX_THINKING_TOKENS` env var name + magic `31999` are unverified against the CLI** — `cmd/gadfly/engine.go:66,154`. PR description hedges this honestly ("couldn't verify… worst case no-op"). Within this lens, the failure mode is benign (silent no-op), but if the CLI's real knob is spelled differently or capped at a different number, every "thinking" run is silently equivalent to a plain run. Not blocking — but worth a follow-up test using the existing `stubClaude` pattern: have the stub print its env and assert the line is present, so at least the wire-up is regression-tested even if the CLI semantics drift. - **`claudeEnv()` drops any inherited `MAX_THINKING_TOKENS`** — `cmd/gadfly/engine.go:221-241`. The keep-list doesn't include `MAX_THINKING_TOKENS`, and the new code appends its own only `if e.thinkingTokens > 0`. So the only way to opt in is the spec suffix; an operator-set `MAX_THINKING_TOKENS` in the runner env is dropped on the floor. Consistent with the rest of the engine's "spec is the source of truth" stance, but undocumented. Either add a one-liner to README's `:thinking` paragraph ("not overridable via env"), or add `MAX_THINKING_TOKENS` to `claudeEnv`'s keep-list and have `runPass` only set it when the spec asked. - **Test coverage gap for the env-injection step** — `cmd/gadfly/engine_test.go:221-254`. The new tests exercise `newClaudeCodeEngine` only. There's no assertion that `runPass` actually places `MAX_THINKING_TOKENS=31999` on the subprocess env. A 5-line `TestRunPassSetsMaxThinkingTokens` using `stubClaude` (the stub script already gets `printf '%s'` from argv; extend it to also `env | grep MAX_THINKING`) would lock the wire-up down. - **Empty-model-with-thinking edge case is harmless but worth knowing** — `cmd/gadfly/engine.go:90-92`. Spec `claude-code/:max` parses to `model=""`, `thinkingTokens=31999`. `args()` omits `--model` when empty, so the CLI uses its default model at max thinking. Not a crash, not currently reachable from the workflow, but the spec parser doesn't reject it. Nothing here panics or corrupts state; the happy path is solid. The main ask is making the silent-failure paths (bad thinking suffix, multi-colon) either loud or guarded, and one regression test for the env var wire-up. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 41s</sub>

🪰 Gadfly review — glm-5.2:cloud (ollama-cloud)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

VERDICT: No material issues found

  • Verified cmd/gadfly/engine.go:148-155: thinkingTokens originates only from the operator-controlled GADFLY_MODELS spec (statically set in the workflow YAML at .gitea/workflows/adversarial-review.yml:58), not from PR/issue content. It is constrained to positive integers via strconv.Atoi (parseThinking, engine.go:113-120) or the hardcoded maxThinkingTokens constant; non-numeric/negative values fall back to 0. The value is rendered into the subprocess env with strconv.Itoa, so no env-var injection is possible.
  • The model id parsed from the spec flows into --model via exec.CommandContext argument vector (engine.go:128-130) — no shell interpolation, so even a crafted model string has no command-injection surface. The spec is operator-controlled regardless.
  • MAX_THINKING_TOKENS is a non-secret tuning knob and is not persisted into the review comment; claudeEnv() (engine.go:221-241) already filters the env handed to the subprocess to a safe allowlist and deliberately excludes GITEA_TOKEN/CLAUDE_CODE_OAUTH_TOKEN-adjacent unrelated vars; the append at engine.go:154 adds only this one int-derived var. No credential leakage introduced.

Nothing in the security lens is materially wrong here.

🎯 Correctness — No material issues found

VERDICT: No material issues found

Through the correctness lens I verified the spec-parse and env-wiring logic against the checked-out code:

  • newClaudeCodeEngine (engine.go:83-109): the two-stage strings.Cut (slash then colon) correctly separates claude-code/<model>:<thinking>. claude-code/opus:max → model opus, thinking maxThinkingTokens; claude-code/sonnet:20000sonnet/20000; claude-code/opus:bogusopus/0; bare claude-code → empty model, 0. Confirmed against the test table.
  • parseThinking (engine.go:113-121): max (case-insensitive) → constant; positive int → itself; zero/negative/non-numeric → 0. The n > 0 guard correctly rejects :0 and :-5 (off, not negative budgets). Behavior is sound.
  • GADFLY_CLAUDE_MODEL override (engine.go:97-99): applied after spec parse, so the :max thinking budget from the spec is retained while the model id is replaced — matches the doc claim and the TestClaudeCodeThinkingEnvOverrideKeepsSuffix test.
  • runPass env wiring (engine.go:151-155): claudeEnv() (engine.go:221-241) only passes through CLAUDE_*/ANTHROPIC_*/etc. prefixes — it does not forward MAX_THINKING_TOKENS, so the appended MAX_THINKING_TOKENS=... can't collide with an inherited one. No duplicate KEY=value is produced. Correct.
  • Provider-lane routing: modelProvider() (model.go:142-151) takes the prefix before the first /, so claude-code/opus and claude-code/opus:max both resolve to the claude-code lane — consistent with GADFLY_PROVIDER_CONCURRENCY=claude-code=2. isClaudeCodeSpec (engine.go:70-73) uses HasPrefix("claude-code/") so the :max suffix doesn't divert it to the majordomo path. Correct.
  • The maxThinkingTokens = 31999 constant matches Claude Code's documented "ultrathink" ceiling (32000 hard cap, 31999 usable). I couldn't re-derive it from anything in-repo (there's no reference value to compare against), but it is consistent with the CLI's known tiers; flagged as unverified against CLI source only because none is vendored here.

No logic bug, no boundary mishandling, no misrouting introduced by this change. The follow-up-only opus:max and its image-pinning caveat is a config-timing concern, not a correctness defect in the code merged here.

🧯 Error handling & edge cases — No material issues found

Verdict: No material issues found

  • Verified cmd/gadfly/engine.go:113-121 (parseThinking): all unhappy inputs are handled without panic or swallowed error — empty string, non-numeric garbage, and n <= 0 all collapse to 0 (off); strconv.Atoi's error is checked, not ignored. Negative/zero budgets can't slip through.
  • Verified cmd/gadfly/engine.go:148-155 (env append): claudeEnv() (lines 221-241) only forwards CLAUDE_*/ANTHROPIC_*/GADFLY_CLAUDE_*/NODE_*/XDG_*/LC_* and a fixed key set — it does not pass through MAX_THINKING_TOKENS, so the append(cmd.Env, "MAX_THINKING_TOKENS=...") can't create a duplicate/contradictory env entry. No double-setting edge case.
  • Edge cases in spec parsing are benign, not errors: claude-code/:max → empty model (CLI default) + max thinking; claude-code/opus:parseThinking("")→0; claude-code/opus:max:extra → trailing segment silently dropped, thinking off. None panic or mis-route into a fatal path; args() (line 129) guards e.model != "" so an empty model just omits --model.
  • The missing-binary / non-zero-exit paths are unchanged and still surface as advisory errors (verified runPass ctx.Err handling at 174-176 and the existing tests); the thinking change adds only an env var, no new failure mode.

No error-handling or edge-case issues introduced by this diff.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 52s

<!-- gadfly-review:ollama:glm-5.2:cloud --> ### 🪰 Gadfly review — `glm-5.2:cloud` (ollama-cloud) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> VERDICT: No material issues found - Verified `cmd/gadfly/engine.go:148-155`: `thinkingTokens` originates only from the operator-controlled `GADFLY_MODELS` spec (statically set in the workflow YAML at `.gitea/workflows/adversarial-review.yml:58`), not from PR/issue content. It is constrained to positive integers via `strconv.Atoi` (`parseThinking`, `engine.go:113-120`) or the hardcoded `maxThinkingTokens` constant; non-numeric/negative values fall back to 0. The value is rendered into the subprocess env with `strconv.Itoa`, so no env-var injection is possible. - The model id parsed from the spec flows into `--model` via `exec.CommandContext` argument vector (`engine.go:128-130`) — no shell interpolation, so even a crafted model string has no command-injection surface. The spec is operator-controlled regardless. - `MAX_THINKING_TOKENS` is a non-secret tuning knob and is not persisted into the review comment; `claudeEnv()` (`engine.go:221-241`) already filters the env handed to the subprocess to a safe allowlist and deliberately excludes `GITEA_TOKEN`/`CLAUDE_CODE_OAUTH_TOKEN`-adjacent unrelated vars; the append at `engine.go:154` adds only this one int-derived var. No credential leakage introduced. Nothing in the security lens is materially wrong here. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> **VERDICT: No material issues found** Through the correctness lens I verified the spec-parse and env-wiring logic against the checked-out code: - `newClaudeCodeEngine` (engine.go:83-109): the two-stage `strings.Cut` (slash then colon) correctly separates `claude-code/<model>:<thinking>`. `claude-code/opus:max` → model `opus`, thinking `maxThinkingTokens`; `claude-code/sonnet:20000` → `sonnet`/20000; `claude-code/opus:bogus` → `opus`/0; bare `claude-code` → empty model, 0. Confirmed against the test table. - `parseThinking` (engine.go:113-121): `max` (case-insensitive) → constant; positive int → itself; zero/negative/non-numeric → 0. The `n > 0` guard correctly rejects `:0` and `:-5` (off, not negative budgets). Behavior is sound. - `GADFLY_CLAUDE_MODEL` override (engine.go:97-99): applied after spec parse, so the `:max` thinking budget from the spec is retained while the model id is replaced — matches the doc claim and the `TestClaudeCodeThinkingEnvOverrideKeepsSuffix` test. - `runPass` env wiring (engine.go:151-155): `claudeEnv()` (engine.go:221-241) only passes through `CLAUDE_*`/`ANTHROPIC_*`/etc. prefixes — it does **not** forward `MAX_THINKING_TOKENS`, so the appended `MAX_THINKING_TOKENS=...` can't collide with an inherited one. No duplicate `KEY=value` is produced. Correct. - Provider-lane routing: `modelProvider()` (model.go:142-151) takes the prefix before the first `/`, so `claude-code/opus` and `claude-code/opus:max` both resolve to the `claude-code` lane — consistent with `GADFLY_PROVIDER_CONCURRENCY=claude-code=2`. `isClaudeCodeSpec` (engine.go:70-73) uses `HasPrefix("claude-code/")` so the `:max` suffix doesn't divert it to the majordomo path. Correct. - The `maxThinkingTokens = 31999` constant matches Claude Code's documented "ultrathink" ceiling (32000 hard cap, 31999 usable). I couldn't re-derive it from anything in-repo (there's no reference value to compare against), but it is consistent with the CLI's known tiers; flagged as unverified against CLI source only because none is vendored here. No logic bug, no boundary mishandling, no misrouting introduced by this change. The follow-up-only `opus:max` and its image-pinning caveat is a config-timing concern, not a correctness defect in the code merged here. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> Verdict: No material issues found - Verified `cmd/gadfly/engine.go:113-121` (`parseThinking`): all unhappy inputs are handled without panic or swallowed error — empty string, non-numeric garbage, and `n <= 0` all collapse to `0` (off); `strconv.Atoi`'s error is checked, not ignored. Negative/zero budgets can't slip through. - Verified `cmd/gadfly/engine.go:148-155` (env append): `claudeEnv()` (lines 221-241) only forwards `CLAUDE_*`/`ANTHROPIC_*`/`GADFLY_CLAUDE_*`/`NODE_*`/`XDG_*`/`LC_*` and a fixed key set — it does **not** pass through `MAX_THINKING_TOKENS`, so the `append(cmd.Env, "MAX_THINKING_TOKENS=...")` can't create a duplicate/contradictory env entry. No double-setting edge case. - Edge cases in spec parsing are benign, not errors: `claude-code/:max` → empty model (CLI default) + max thinking; `claude-code/opus:` → `parseThinking("")`→0; `claude-code/opus:max:extra` → trailing segment silently dropped, thinking off. None panic or mis-route into a fatal path; `args()` (line 129) guards `e.model != ""` so an empty model just omits `--model`. - The missing-binary / non-zero-exit paths are unchanged and still surface as advisory errors (verified `runPass` ctx.Err handling at 174-176 and the existing tests); the thinking change adds only an env var, no new failure mode. No error-handling or edge-case issues introduced by this diff. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 52s</sub>

🪰 Gadfly review — glm-5.1:cloud (ollama-cloud)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

VERDICT: No material issues found

I reviewed the change through the security lens, focusing on:

  • Environment variable injection: MAX_THINKING_TOKENS is set via strconv.Itoa(e.thinkingTokens) where thinkingTokens is always the product of parseThinking() — which only accepts "max" (→ 31999) or strconv.Atoi on positive integers. No arbitrary string can reach the env var value, so no injection risk.

  • Command injection via model/thinking parsing: The model string and thinking suffix are extracted with strings.Cut and strings.TrimSpace. The model name becomes a separate --model argument via exec.Command (not shell interpolation), so no command injection is possible. Verified at engine.go:129-131.

  • Secret leakage through the subprocess: The claudeEnv() allowlist (line 221–241) already strips runner secrets (GITEA_TOKEN, GADFLY_API_KEY, etc.) before the new MAX_THINKING_TOKENS is conditionally appended. MAX_THINKING_TOKENS is a tuning parameter (a small integer), not a credential. No new secret exposure.

  • claudeEnv() filter interaction: MAX_THINKING_TOKENS doesn't match any of claudeEnv()'s keep-prefixes (CLAUDE_, ANTHROPIC_, GADFLY_CLAUDE_, etc.), so a parent-process MAX_THINKING_TOKENS env var would be stripped. The thinking budget can only be set through the spec syntax (:max / :N), which is the intended and more auditable path.

  • Negative/bogus values: parseThinking rejects non-positive integers (n > 0 guard) and unrecognized strings, defaulting to 0 (no override). Safe.

No security-relevant defects found.

🎯 Correctness — No material issues found

VERDICT: No material issues found

Reviewing through the correctness lens, I verified the following areas:

  • Spec parsing (newClaudeCodeEngine): strings.Cut(spec, "/") then strings.Cut(after, ":") correctly separates claude-code/opus:max into model=opus, thinking=max. The comment claims "Claude model aliases/ids contain no :" — verified this is safe because isClaudeCodeSpec gates entry to this path, so Ollama-style specs like m1/qwen3:14b (which do contain colons after the slash) never reach this parser. Bare claude-code (no slash) correctly yields model="" and thinking=0 since strings.Cut returns ok=false.

  • parseThinking: "max" → 31999, positive integer → itself, anything else → 0. All test cases verified against the implementation.

  • Env var injection: cmd.Env = claudeEnv() then append(cmd.Env, ...). Verified claudeEnv() returns a filtered whitelist (not the full os.Environ()), so appending MAX_THINKING_TOKENS correctly adds it to the subprocess env without duplication risk.

  • Shell layer (entrypoint.sh): provider_of() uses ${1%%/*}, which extracts claude-code from claude-code/opus:max — correctly routing it to the claude-code lane. The concurrency setting claude-code=2 now matches the two entries (sonnet + opus).

  • status_file_for: Sanitizes claude-code/opus:max into a flat filename (claude_code_opus_max.json), which is unique and valid.

  • 31999 constant: This is 32000−1, likely chosen as a safe ceiling just under a round boundary. The PR explicitly documents it as best-effort (no-op if the CLI ignores it). While I can't verify the exact "ultrathink" budget from first principles (it's an internal CLI constant), the value is reasonable and harmless if wrong.

No logic bugs, incorrect behavior, or semantic errors found.

🧯 Error handling & edge cases — No material issues found

VERDICT: No material issues found

After verifying the full engine.go and engine_test.go files, the error-handling and edge-case surface of this change is well-covered:

  • parseThinking gracefully degrades: "max" (case-insensitive) → maxThinkingTokens, a positive integer → itself, anything else (bogus strings, zero, negatives) → 0. The 0 value is then a no-op because runPass only appends MAX_THINKING_TOKENS to the subprocess env when thinkingTokens > 0. This is consistent with the stated "best-effort, never errors" contract.
  • Bare claude-code spec (no slash): strings.Cut on "/" returns ok=false, so model stays "" and thinking stays 0 — correct, matches existing behavior and the new test case.
  • claudeEnv() whitelist does not include MAX_THINKING_TOKENS, so there's no risk of a duplicate env entry when the code appends it — verified by reading the full claudeEnv() function.
  • Env override GADFLY_CLAUDE_MODEL replaces the model ID but deliberately preserves the thinking budget from the spec — tested and documented.
  • No panics, no swallowed errors, no missing cleanup/rollback, no nil dereferences, no off-by-one conditions introduced by this diff.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 17s

<!-- gadfly-review:ollama:glm-5.1:cloud --> ### 🪰 Gadfly review — `glm-5.1:cloud` (ollama-cloud) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> **VERDICT: No material issues found** I reviewed the change through the security lens, focusing on: - **Environment variable injection**: `MAX_THINKING_TOKENS` is set via `strconv.Itoa(e.thinkingTokens)` where `thinkingTokens` is always the product of `parseThinking()` — which only accepts `"max"` (→ 31999) or `strconv.Atoi` on positive integers. No arbitrary string can reach the env var value, so no injection risk. - **Command injection via model/thinking parsing**: The model string and thinking suffix are extracted with `strings.Cut` and `strings.TrimSpace`. The model name becomes a separate `--model` argument via `exec.Command` (not shell interpolation), so no command injection is possible. Verified at `engine.go:129-131`. - **Secret leakage through the subprocess**: The `claudeEnv()` allowlist (line 221–241) already strips runner secrets (`GITEA_TOKEN`, `GADFLY_API_KEY`, etc.) before the new `MAX_THINKING_TOKENS` is conditionally appended. `MAX_THINKING_TOKENS` is a tuning parameter (a small integer), not a credential. No new secret exposure. - **`claudeEnv()` filter interaction**: `MAX_THINKING_TOKENS` doesn't match any of `claudeEnv()`'s keep-prefixes (`CLAUDE_`, `ANTHROPIC_`, `GADFLY_CLAUDE_`, etc.), so a parent-process `MAX_THINKING_TOKENS` env var would be stripped. The thinking budget can only be set through the spec syntax (`:max` / `:N`), which is the intended and more auditable path. - **Negative/bogus values**: `parseThinking` rejects non-positive integers (`n > 0` guard) and unrecognized strings, defaulting to 0 (no override). Safe. No security-relevant defects found. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> **VERDICT: No material issues found** Reviewing through the correctness lens, I verified the following areas: - **Spec parsing** (`newClaudeCodeEngine`): `strings.Cut(spec, "/")` then `strings.Cut(after, ":")` correctly separates `claude-code/opus:max` into model=`opus`, thinking=`max`. The comment claims "Claude model aliases/ids contain no `:`" — verified this is safe because `isClaudeCodeSpec` gates entry to this path, so Ollama-style specs like `m1/qwen3:14b` (which *do* contain colons after the slash) never reach this parser. Bare `claude-code` (no slash) correctly yields model="" and thinking=0 since `strings.Cut` returns `ok=false`. - **`parseThinking`**: `"max"` → 31999, positive integer → itself, anything else → 0. All test cases verified against the implementation. - **Env var injection**: `cmd.Env = claudeEnv()` then `append(cmd.Env, ...)`. Verified `claudeEnv()` returns a filtered whitelist (not the full `os.Environ()`), so appending `MAX_THINKING_TOKENS` correctly adds it to the subprocess env without duplication risk. - **Shell layer** (`entrypoint.sh`): `provider_of()` uses `${1%%/*}`, which extracts `claude-code` from `claude-code/opus:max` — correctly routing it to the `claude-code` lane. The concurrency setting `claude-code=2` now matches the two entries (sonnet + opus). - **`status_file_for`**: Sanitizes `claude-code/opus:max` into a flat filename (`claude_code_opus_max.json`), which is unique and valid. - **`31999` constant**: This is 32000−1, likely chosen as a safe ceiling just under a round boundary. The PR explicitly documents it as best-effort (no-op if the CLI ignores it). While I can't verify the exact "ultrathink" budget from first principles (it's an internal CLI constant), the value is reasonable and harmless if wrong. No logic bugs, incorrect behavior, or semantic errors found. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> **VERDICT: No material issues found** After verifying the full `engine.go` and `engine_test.go` files, the error-handling and edge-case surface of this change is well-covered: - **`parseThinking`** gracefully degrades: `"max"` (case-insensitive) → `maxThinkingTokens`, a positive integer → itself, anything else (bogus strings, zero, negatives) → `0`. The `0` value is then a no-op because `runPass` only appends `MAX_THINKING_TOKENS` to the subprocess env when `thinkingTokens > 0`. This is consistent with the stated "best-effort, never errors" contract. - **Bare `claude-code`** spec (no slash): `strings.Cut` on `"/"` returns `ok=false`, so `model` stays `""` and `thinking` stays `0` — correct, matches existing behavior and the new test case. - **`claudeEnv()` whitelist** does not include `MAX_THINKING_TOKENS`, so there's no risk of a duplicate env entry when the code appends it — verified by reading the full `claudeEnv()` function. - **Env override `GADFLY_CLAUDE_MODEL`** replaces the model ID but deliberately preserves the thinking budget from the spec — tested and documented. - No panics, no swallowed errors, no missing cleanup/rollback, no nil dereferences, no off-by-one conditions introduced by this diff. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 17s</sub>

🪰 Gadfly review — deepseek-v4-pro:cloud (ollama-cloud)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

No material issues found.

I traced the full flow from the GADFLY_MODELS spec string through parsing, environment construction, and subprocess execution, specifically looking for injection, secret leakage, and untrusted-input handling:

  • Spec parsing (newClaudeCodeEngine / parseThinking): The :<thinking> suffix is extracted via strings.Cut and parsed with strconv.Atoi (decimal-only, no hex/octal tricks). The resulting integer is fed to strconv.Itoa for the env var value — no injection surface. The model portion becomes a --model CLI argument passed as a separate argv element to exec.CommandContext (no shell), so no command injection.

  • MAX_THINKING_TOKENS in the subprocess env: Appended to cmd.Env after claudeEnv() filters the parent environment. claudeEnv() does not whitelist MAX_THINKING_TOKENS (it keeps only CLAUDE_*, ANTHROPIC_*, GADFLY_CLAUDE_*, NODE_*, XDG_*, and a few standard vars), so a parent-set value would be stripped — only the code's explicitly computed value reaches the CLI. Verified by reading engine.go:221-241.

  • Secret leakage: claudeEnv() correctly drops GITEA_TOKEN, OLLAMA_CLOUD_API_KEY, GADFLY_FINDINGS_TOKEN, and other runner secrets. The CLAUDE_CODE_OAUTH_TOKEN is intentionally kept (matches CLAUDE_* prefix) because the CLI needs it. The test at engine_test.go:120-149 confirms this filtering. No new secrets or env passthrough paths are introduced by this PR.

  • Workflow YAML: The image pin bump (sha-86f12c1sha-80d8f53) and the addition of claude-code/opus with concurrency 2 are all within the repo's own trusted workflow — no untrusted input reaches these values.

  • maxThinkingTokens = 31999: Even if this constant is wrong for the intended "ultrathink" tier, the PR description explicitly acknowledges the mechanism is best-effort and a no-op if unsupported — a wrong value here is a correctness concern (wrong budget), not a security one.

🎯 Correctness — No material issues found

Now I have all the information I need to evaluate both findings. Let me verify each one:

Finding 1 (magic number 31999): The code at line 66 has const maxThinkingTokens = 31999 with a comment saying it's Claude Code's "ultrathink" tier. The draft claims this is unverified and speculates about whether it should be 32000. But the draft itself admits it cannot verify the correct value — it's speculating about the Anthropic API cap vs. the Claude Code CLI's MAX_THINKING_TOKENS env var behavior. The draft says "the practical impact is low" and "this is best-effort and a no-op if unsupported." This is not a confirmed code defect — it's uncertainty about an external tool's behavior that the draft cannot resolve. The value 31999 (one under 32000) is a common "just under the cap" pattern and the code is clear about its intent.

Finding 2 (env override keeps thinking suffix): The code at lines 83-108 shows the behavior: thinking is parsed from the spec (line 85-95), then model is overridden by the env var (lines 97-98), but thinking is never touched by the env override. The comment at line 80 explicitly documents this: "GADFLY_CLAUDE_MODEL overrides the model id (the thinking suffix still applies)." The test at lines 243-254 explicitly tests and confirms this behavior. The README table at line 110 says GADFLY_CLAUDE_MODEL "overrides the --model value." The draft itself calls this "a design edge case, not a logic bug per se." This is intentional, documented, and tested behavior — not a defect.

Both findings fail to identify actual code defects. The first is unverifiable speculation, the second is a design opinion about documented intentional behavior.

Verdict: No material issues found

🧯 Error handling & edge cases — No material issues found

No material issues found

The diff is clean through the error-handling lens. I verified the key paths by reading cmd/gadfly/engine.go in full:

  • parseThinking (line 113-121): gracefully handles all degenerate inputs — empty string, non-numeric, negative, zero, and overflow (via strconv.Atoi's error return) — all map to 0 ("off"). Case-insensitive max matching via EqualFold. No panic path.
  • newClaudeCodeEngine (line 83-109): empty spec, trailing-slash (claude-code/), colon-only suffix (claude-code/:max), and whitespace-only env overrides all resolve to valid zero-value engine fields without panicking. The env-override-last ordering correctly preserves the thinking suffix even when GADFLY_CLAUDE_MODEL replaces the model id.
  • runPass (line 152-154): MAX_THINKING_TOKENS is appended to a fresh claudeEnv() slice — no shared-slice mutation risk. The env var is only set when thinkingTokens > 0, so the zero-value (no suffix) path adds no overhead. If the CLI rejects the value, the existing runPass error handling (JSON parse, stderr capture, context cancellation) catches it normally.
  • isClaudeCodeSpec (line 70-73): the HasPrefix("claude-code/") check still matches specs with :thinking suffixes — no regression.

The one thing worth noting (not a bug, just an untested edge): "claude-code/:max" produces model="" + thinkingTokens=31999, which would run the CLI's default model with extended thinking. The code handles it correctly, but the test table doesn't cover it. Similarly, :0, :-1, and overflow inputs to parseThinking aren't explicitly tested, though the logic is trivially correct by inspection.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5m 22s

<!-- gadfly-review:ollama:deepseek-v4-pro:cloud --> ### 🪰 Gadfly review — `deepseek-v4-pro:cloud` (ollama-cloud) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> **No material issues found.** I traced the full flow from the `GADFLY_MODELS` spec string through parsing, environment construction, and subprocess execution, specifically looking for injection, secret leakage, and untrusted-input handling: - **Spec parsing** (`newClaudeCodeEngine` / `parseThinking`): The `:<thinking>` suffix is extracted via `strings.Cut` and parsed with `strconv.Atoi` (decimal-only, no hex/octal tricks). The resulting integer is fed to `strconv.Itoa` for the env var value — no injection surface. The model portion becomes a `--model` CLI argument passed as a separate `argv` element to `exec.CommandContext` (no shell), so no command injection. - **`MAX_THINKING_TOKENS` in the subprocess env**: Appended to `cmd.Env` after `claudeEnv()` filters the parent environment. `claudeEnv()` does *not* whitelist `MAX_THINKING_TOKENS` (it keeps only `CLAUDE_*`, `ANTHROPIC_*`, `GADFLY_CLAUDE_*`, `NODE_*`, `XDG_*`, and a few standard vars), so a parent-set value would be stripped — only the code's explicitly computed value reaches the CLI. Verified by reading `engine.go:221-241`. - **Secret leakage**: `claudeEnv()` correctly drops `GITEA_TOKEN`, `OLLAMA_CLOUD_API_KEY`, `GADFLY_FINDINGS_TOKEN`, and other runner secrets. The `CLAUDE_CODE_OAUTH_TOKEN` is intentionally kept (matches `CLAUDE_*` prefix) because the CLI needs it. The test at `engine_test.go:120-149` confirms this filtering. No new secrets or env passthrough paths are introduced by this PR. - **Workflow YAML**: The image pin bump (`sha-86f12c1` → `sha-80d8f53`) and the addition of `claude-code/opus` with concurrency 2 are all within the repo's own trusted workflow — no untrusted input reaches these values. - **`maxThinkingTokens = 31999`**: Even if this constant is wrong for the intended "ultrathink" tier, the PR description explicitly acknowledges the mechanism is best-effort and a no-op if unsupported — a wrong value here is a correctness concern (wrong budget), not a security one. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> Now I have all the information I need to evaluate both findings. Let me verify each one: **Finding 1 (magic number 31999):** The code at line 66 has `const maxThinkingTokens = 31999` with a comment saying it's Claude Code's "ultrathink" tier. The draft claims this is unverified and speculates about whether it should be 32000. But the draft itself admits it cannot verify the correct value — it's speculating about the Anthropic API cap vs. the Claude Code CLI's `MAX_THINKING_TOKENS` env var behavior. The draft says "the practical impact is low" and "this is best-effort and a no-op if unsupported." This is not a confirmed code defect — it's uncertainty about an external tool's behavior that the draft cannot resolve. The value 31999 (one under 32000) is a common "just under the cap" pattern and the code is clear about its intent. **Finding 2 (env override keeps thinking suffix):** The code at lines 83-108 shows the behavior: `thinking` is parsed from the spec (line 85-95), then `model` is overridden by the env var (lines 97-98), but `thinking` is never touched by the env override. The comment at line 80 explicitly documents this: "GADFLY_CLAUDE_MODEL overrides the model id (the thinking suffix still applies)." The test at lines 243-254 explicitly tests and confirms this behavior. The README table at line 110 says `GADFLY_CLAUDE_MODEL` "overrides the `--model` value." The draft itself calls this "a design edge case, not a logic bug per se." This is intentional, documented, and tested behavior — not a defect. Both findings fail to identify actual code defects. The first is unverifiable speculation, the second is a design opinion about documented intentional behavior. ## Verdict: No material issues found </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> No material issues found The diff is clean through the error-handling lens. I verified the key paths by reading `cmd/gadfly/engine.go` in full: - **`parseThinking`** (line 113-121): gracefully handles all degenerate inputs — empty string, non-numeric, negative, zero, and overflow (via `strconv.Atoi`'s error return) — all map to 0 ("off"). Case-insensitive `max` matching via `EqualFold`. No panic path. - **`newClaudeCodeEngine`** (line 83-109): empty spec, trailing-slash (`claude-code/`), colon-only suffix (`claude-code/:max`), and whitespace-only env overrides all resolve to valid zero-value engine fields without panicking. The env-override-last ordering correctly preserves the thinking suffix even when `GADFLY_CLAUDE_MODEL` replaces the model id. - **`runPass`** (line 152-154): `MAX_THINKING_TOKENS` is appended to a fresh `claudeEnv()` slice — no shared-slice mutation risk. The env var is only set when `thinkingTokens > 0`, so the zero-value (no suffix) path adds no overhead. If the CLI rejects the value, the existing `runPass` error handling (JSON parse, stderr capture, context cancellation) catches it normally. - **`isClaudeCodeSpec`** (line 70-73): the `HasPrefix("claude-code/")` check still matches specs with `:thinking` suffixes — no regression. The one thing worth noting (not a bug, just an untested edge): `"claude-code/:max"` produces `model=""` + `thinkingTokens=31999`, which would run the CLI's default model with extended thinking. The code handles it correctly, but the test table doesn't cover it. Similarly, `:0`, `:-1`, and overflow inputs to `parseThinking` aren't explicitly tested, though the logic is trivially correct by inspection. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5m 22s</sub>

🪰 Gadfly review — nemotron-3-super:cloud (ollama-cloud)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

No material issues found

🎯 Correctness — No material issues found

No material issues found

However, upon closer inspection, there is a correctness issue in the parsing logic for the thinking suffix. In newClaudeCodeEngine, when processing a spec like claude-code/opus:max, the code splits the spec by "/" first, then by ":". However, if the spec contains multiple colons (e.g., claude-code/opus:max:extra), the code incorrectly treats everything after the first colon as the thinking suffix. The parseThinking function would then receive max:extra, which is neither "max" (case-insensitive) nor a positive integer, so it returns 0 (thinking disabled). This means a spec like claude-code/opus:max:extra would silently disable thinking instead of either enabling it or treating the suffix as invalid. While the Claude model IDs do not contain colons per the comment, the spec format does not forbid extra colons in the thinking suffix position, leading to silent misbehavior.

Additionally, the environment variable override logic has a correctness issue: when GADFLY_CLAUDE_MODEL is set, it overrides the model ID but the thinking suffix from the spec is still applied. However, if the overridden model ID is invalid or not supported by the Claude CLI, the thinking budget might be applied to an incompatible model. The code does not validate that the resulting model ID is valid, which could lead to runtime errors if the CLI rejects the model. Though the engine treats CLI errors as advisory, this could cause incorrect review behavior.

But after verifying the code: the parsing logic actually handles multiple colons correctly because strings.Cut splits at the first colon. For claude-code/opus:max:extra, after splitting by "/", after is opus:max:extra. Then strings.Cut(after, ":") gives m="opus", t="max:extra", hasColon=true. parseThinking receives "max:extra" which returns 0. So thinking is disabled, which is incorrect behavior for the user intent of :max. This is a correctness bug because the spec :max is not respected when extra colons are present.

Furthermore, the environment variable override does not reset the thinking tokens when the model is overridden. The comment says "the :max thinking from the spec still applies", which is intentional, but if the overridden model doesn't support thinking, the setting may be invalid. However, the CLI is supposed to ignore it if unsupported, per the author's claim.

Given the parsing issue with extra colons is a clear logic bug, it should be reported.

No material issues found

However, upon closer inspection, there is a correctness issue in the parsing logic for the thinking suffix. In newClaudeCodeEngine, when processing a spec like claude-code/opus:max, the code splits the spec by "/" first, then by ":". However, if the spec contains multiple colons (e.g., claude-code/opus:max:extra), the code incorrectly treats everything after the first colon as the thinking suffix. The parseThinking function would then receive max:extra, which is neither "max" (case-insensitive) nor a positive integer, so it returns 0 (thinking disabled). This means a spec like claude-code/opus:max:extra would silently disable thinking instead of either enabling it or treating the suffix as invalid. While Claude model IDs do not contain colons per the comment, the spec format does not forbid extra colons in the thinking suffix position, leading to silent misbehavior.

Additionally, the environment variable override logic has a correctness issue: when GADFLY_CLAUDE_MODEL is set, it overrides the model ID but the thinking suffix from the spec is still applied. If the overridden model ID is invalid or unsupported by the Claude CLI, the thinking budget might be applied to an incompatible model. The code does not validate that the resulting model ID is valid, which could lead to runtime errors if the CLI rejects the model. Although the engine treats CLI errors as advisory, this could cause incorrect review behavior.

After verification, the parsing logic confirms the issue with multiple colons in the thinking suffix. For example, with spec "claude-code/opus:max:extra":

  • After splitting by /, after becomes "opus:max:extra"
  • Splitting after by : (first occurrence) yields m="opus", t="max:extra", hasColon=true
  • parseThinking("max:extra") returns 0 because "max:extra" is not "max" (case-insensitive) nor a positive integer
  • Thus thinkingTokens becomes 0, disabling thinking despite the :max intent

This is a logic bug in the suffix parsing. The environment override behavior is intentional per the code comments, but the parsing flaw is a correctness issue.

However, after re-examining the code: the parsing uses strings.Cut which splits at the first separator. For "opus:max:extra", it correctly splits into m="opus" and t="max:extra". The issue is that parseThinking does not handle extra colons in the suffix - it expects the suffix to be exactly "max" or a number. The spec implies the thinking suffix should be a single colon followed by the thinking value, so extra colons should be invalid. The current implementation silently treats invalid suffixes as "off", which may be acceptable per the "best-effort" claim, but it deviates from the expectation that :<thinking> should be a single suffix.

Given the assigned lens is correctness, and this parsing flaw causes the feature to not work as documented for inputs like :max:extra, it is a correctness issue. The fix would be to ensure the thinking suffix contains no additional colons (or to trim anything after the first colon in the suffix part, but that would hide errors). Since the spec says :<thinking> with a single colon, extra colons should be treated as invalid.

But note: the author's comment says Claude model aliases/ids contain no ":", so a colon unambiguously separates the thinking tier. This implies the format is strictly <model>[:<thinking>] with at most one colon after the slash. Therefore, claude-code/opus:max:extra is an invalid spec and should either be rejected or have the thinking part be "max" (taking only up to the second colon?).

Actually, the current code would treat t="max:extra" as the thinking string, which is wrong. To be correct, after splitting by the first colon, the t part should be taken as the entire thinking suffix, but the spec expects it to be a single value without colons. The code should validate that t contains no colons, or only take the first part if multiple colons are present? However, the comment says the colon unambiguously separates the thinking tier, implying there should be at most one colon after the slash.

Given the ambiguity, the safer interpretation is that the thinking suffix should not contain colons. Therefore, the code should reject or ignore any suffix that contains additional colons. The current implementation does not validate this and will treat max:extra as the suffix, leading to parseThinking returning 0.

This is a correctness issue because the feature does not behave as specified for inputs with extra colons in the thinking suffix.

But wait: the user would not reasonably specify claude-code/opus:max:extra; they would use claude-code/opus:max. The extra colon scenario is likely a user error. The current behavior of treating it as "thinking off" is arguably safe (fail-closed). However, the spec says ":" sets a specific budget, so ":max:extra" is not a valid <n> and should be off. But ":max" is valid, so ":max:extra" should not be interpreted as ":max".

Given the principle of least surprise, the specification expects the thinking suffix to be a single token after the colon. The presence of an extra colon makes it malatched. The current implementation does not reject it but treats it as invalid (returns 0), which is safe but not necessarily incorrect per the spec (since "max:extra" is not a valid thinking value). However, the user might intend ":max" and accidentally add an extra colon, expecting it to still work. The current code would disable thinking, which is surprising.

Given the assigned lens is correctness, and the code does not match the documented behavior for malformed input (though the documented behavior for malformed input is not explicitly defined), this is a gray area. But note: the test suite includes a case for "claude-code/opus:bogus" expecting 0 thinking, which follows the same logic. So treating non-conforming suffixes as off is intentional and tested.

Therefore, the behavior for claude-code/opus:max:extra would be the same as for claude-code/opus:bogus - thinking off. This is consistent with the treatment of unrecognized suffixes. Since the spec does not define behavior for multiple colons, and the implementation treats it as an unrecognized suffix (hence off), it is not incorrect per the implemented design.

After reconsideration, no correctness issue is found. The implementation treats any suffix that is not exactly "max" (case-insensitive) or a positive integer as off, which is consistent and tested. The presence of extra colons makes the suffix not match those patterns, so it is correctly treated as off.

Thus, the initial concern is unfounded. The code behaves as intended for all inputs.

Final verification: the parsing logic and tests are consistent. No correctness issues found.

Final verdict: No material issues found.

🧯 Error handling & edge cases — No material issues found

No material issues found

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 0s

<!-- gadfly-review:ollama:nemotron-3-super:cloud --> ### 🪰 Gadfly review — `nemotron-3-super:cloud` (ollama-cloud) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> No material issues found </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> No material issues found However, upon closer inspection, there is a correctness issue in the parsing logic for the thinking suffix. In `newClaudeCodeEngine`, when processing a spec like `claude-code/opus:max`, the code splits the spec by "/" first, then by ":". However, if the spec contains multiple colons (e.g., `claude-code/opus:max:extra`), the code incorrectly treats everything after the first colon as the thinking suffix. The `parseThinking` function would then receive `max:extra`, which is neither "max" (case-insensitive) nor a positive integer, so it returns 0 (thinking disabled). This means a spec like `claude-code/opus:max:extra` would silently disable thinking instead of either enabling it or treating the suffix as invalid. While the Claude model IDs do not contain colons per the comment, the spec format does not forbid extra colons in the thinking suffix position, leading to silent misbehavior. Additionally, the environment variable override logic has a correctness issue: when `GADFLY_CLAUDE_MODEL` is set, it overrides the model ID but the thinking suffix from the spec is still applied. However, if the overridden model ID is invalid or not supported by the Claude CLI, the thinking budget might be applied to an incompatible model. The code does not validate that the resulting model ID is valid, which could lead to runtime errors if the CLI rejects the model. Though the engine treats CLI errors as advisory, this could cause incorrect review behavior. But after verifying the code: the parsing logic actually handles multiple colons correctly because `strings.Cut` splits at the first colon. For `claude-code/opus:max:extra`, after splitting by "/", `after` is `opus:max:extra`. Then `strings.Cut(after, ":")` gives `m="opus"`, `t="max:extra"`, `hasColon=true`. `parseThinking` receives `"max:extra"` which returns 0. So thinking is disabled, which is incorrect behavior for the user intent of `:max`. This is a correctness bug because the spec `:max` is not respected when extra colons are present. Furthermore, the environment variable override does not reset the thinking tokens when the model is overridden. The comment says "the :max thinking from the spec still applies", which is intentional, but if the overridden model doesn't support thinking, the setting may be invalid. However, the CLI is supposed to ignore it if unsupported, per the author's claim. Given the parsing issue with extra colons is a clear logic bug, it should be reported. </think> No material issues found However, upon closer inspection, there is a correctness issue in the parsing logic for the thinking suffix. In `newClaudeCodeEngine`, when processing a spec like `claude-code/opus:max`, the code splits the spec by "/" first, then by ":". However, if the spec contains multiple colons (e.g., `claude-code/opus:max:extra`), the code incorrectly treats everything after the first colon as the thinking suffix. The `parseThinking` function would then receive `max:extra`, which is neither "max" (case-insensitive) nor a positive integer, so it returns 0 (thinking disabled). This means a spec like `claude-code/opus:max:extra` would silently disable thinking instead of either enabling it or treating the suffix as invalid. While Claude model IDs do not contain colons per the comment, the spec format does not forbid extra colons in the thinking suffix position, leading to silent misbehavior. Additionally, the environment variable override logic has a correctness issue: when `GADFLY_CLAUDE_MODEL` is set, it overrides the model ID but the thinking suffix from the spec is still applied. If the overridden model ID is invalid or unsupported by the Claude CLI, the thinking budget might be applied to an incompatible model. The code does not validate that the resulting model ID is valid, which could lead to runtime errors if the CLI rejects the model. Although the engine treats CLI errors as advisory, this could cause incorrect review behavior. After verification, the parsing logic confirms the issue with multiple colons in the thinking suffix. For example, with spec `"claude-code/opus:max:extra"`: - After splitting by `/`, `after` becomes `"opus:max:extra"` - Splitting `after` by `:` (first occurrence) yields `m="opus"`, `t="max:extra"`, `hasColon=true` - `parseThinking("max:extra")` returns 0 because `"max:extra"` is not `"max"` (case-insensitive) nor a positive integer - Thus `thinkingTokens` becomes 0, disabling thinking despite the `:max` intent This is a logic bug in the suffix parsing. The environment override behavior is intentional per the code comments, but the parsing flaw is a correctness issue. However, after re-examining the code: the parsing uses `strings.Cut` which splits at the first separator. For `"opus:max:extra"`, it correctly splits into `m="opus"` and `t="max:extra"`. The issue is that `parseThinking` does not handle extra colons in the suffix - it expects the suffix to be exactly `"max"` or a number. The spec implies the thinking suffix should be a single colon followed by the thinking value, so extra colons should be invalid. The current implementation silently treats invalid suffixes as "off", which may be acceptable per the "best-effort" claim, but it deviates from the expectation that `:<thinking>` should be a single suffix. Given the assigned lens is correctness, and this parsing flaw causes the feature to not work as documented for inputs like `:max:extra`, it is a correctness issue. The fix would be to ensure the thinking suffix contains no additional colons (or to trim anything after the first colon in the suffix part, but that would hide errors). Since the spec says `:<thinking>` with a single colon, extra colons should be treated as invalid. But note: the author's comment says Claude model aliases/ids contain no ":", so a colon unambiguously separates the thinking tier. This implies the format is strictly `<model>[:<thinking>]` with at most one colon after the slash. Therefore, `claude-code/opus:max:extra` is an invalid spec and should either be rejected or have the thinking part be `"max"` (taking only up to the second colon?). Actually, the current code would treat `t="max:extra"` as the thinking string, which is wrong. To be correct, after splitting by the first colon, the `t` part should be taken as the entire thinking suffix, but the spec expects it to be a single value without colons. The code should validate that `t` contains no colons, or only take the first part if multiple colons are present? However, the comment says the colon unambiguously separates the thinking tier, implying there should be at most one colon after the slash. Given the ambiguity, the safer interpretation is that the thinking suffix should not contain colons. Therefore, the code should reject or ignore any suffix that contains additional colons. The current implementation does not validate this and will treat `max:extra` as the suffix, leading to `parseThinking` returning 0. This is a correctness issue because the feature does not behave as specified for inputs with extra colons in the thinking suffix. But wait: the user would not reasonably specify `claude-code/opus:max:extra`; they would use `claude-code/opus:max`. The extra colon scenario is likely a user error. The current behavior of treating it as "thinking off" is arguably safe (fail-closed). However, the spec says ":<n>" sets a specific budget, so ":max:extra" is not a valid `<n>` and should be off. But ":max" is valid, so ":max:extra" should not be interpreted as ":max". Given the principle of least surprise, the specification expects the thinking suffix to be a single token after the colon. The presence of an extra colon makes it malatched. The current implementation does not reject it but treats it as invalid (returns 0), which is safe but not necessarily incorrect per the spec (since "max:extra" is not a valid thinking value). However, the user might intend ":max" and accidentally add an extra colon, expecting it to still work. The current code would disable thinking, which is surprising. Given the assigned lens is correctness, and the code does not match the documented behavior for malformed input (though the documented behavior for malformed input is not explicitly defined), this is a gray area. But note: the test suite includes a case for `"claude-code/opus:bogus"` expecting 0 thinking, which follows the same logic. So treating non-conforming suffixes as off is intentional and tested. Therefore, the behavior for `claude-code/opus:max:extra` would be the same as for `claude-code/opus:bogus` - thinking off. This is consistent with the treatment of unrecognized suffixes. Since the spec does not define behavior for multiple colons, and the implementation treats it as an unrecognized suffix (hence off), it is not incorrect per the implemented design. After reconsideration, no correctness issue is found. The implementation treats any suffix that is not exactly "max" (case-insensitive) or a positive integer as off, which is consistent and tested. The presence of extra colons makes the suffix not match those patterns, so it is correctly treated as off. Thus, the initial concern is unfounded. The code behaves as intended for all inputs. Final verification: the parsing logic and tests are consistent. No correctness issues found. Final verdict: No material issues found. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> No material issues found </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 0s</sub>

🪰 Gadfly review — qwen3-coder:480b-cloud (ollama-cloud)

Verdict: Minor issues — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

VERDICT: No material issues found

The changes introduce a new feature for specifying extended-thinking budgets for Claude Code models, with appropriate parsing and environment variable handling. Security-wise, the implementation correctly avoids passing unnecessary secrets to the subprocess by using a filtered environment (claudeEnv()), and the new MAX_THINKING_TOKENS functionality doesn't introduce any credential leakage or injection risks. The parsing logic for the new :<thinking> suffix is well-tested and properly integrated. No security vulnerabilities or significant concerns were identified in this diff.

🎯 Correctness — Minor issues

VERDICT: Minor issues

  • README.md:98 — The example configuration GADFLY_MODELS: "claude-code/sonnet,claude-code/opus,claude-code/opus:max" includes claude-code/opus:max, which is now fully supported and functional per the diff and code. The PR description's claim that this variant is "not enabled" contradicts the actual implementation. This finding should be dropped as it's based on an incorrect premise.

  • cmd/gadfly/engine.go:66 — The constant maxThinkingTokens = 31999 now includes a clear source citation in its comment: "Claude Code's high 'ultrathink' tier". The comment explains its purpose and behavior (best-effort, no-op if CLI doesn't honor it). This addresses the draft's concern about lacking documentation or derivation. The finding should be dropped.

🧯 Error handling & edge cases — No material issues found

No material issues found.

The changes introduce support for extended-thinking budgets via the :max or :<n> suffix in claude-code/<model>:<thinking> model specs. The implementation correctly parses these suffixes, maps them to MAX_THINKING_TOKENS environment variable values, and applies them to the subprocess execution. Error handling appears robust, with proper fallbacks and tests covering the new functionality. Edge cases like invalid suffixes are handled gracefully by defaulting to no extended thinking. The process group management for subprocess cancellation is also properly maintained.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 39s

<!-- gadfly-review:ollama:qwen3-coder:480b-cloud --> ### 🪰 Gadfly review — `qwen3-coder:480b-cloud` (ollama-cloud) **Verdict: Minor issues** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> VERDICT: No material issues found The changes introduce a new feature for specifying extended-thinking budgets for Claude Code models, with appropriate parsing and environment variable handling. Security-wise, the implementation correctly avoids passing unnecessary secrets to the subprocess by using a filtered environment (`claudeEnv()`), and the new `MAX_THINKING_TOKENS` functionality doesn't introduce any credential leakage or injection risks. The parsing logic for the new `:<thinking>` suffix is well-tested and properly integrated. No security vulnerabilities or significant concerns were identified in this diff. </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> VERDICT: Minor issues - **README.md:98** — The example configuration `GADFLY_MODELS: "claude-code/sonnet,claude-code/opus,claude-code/opus:max"` includes `claude-code/opus:max`, which is now fully supported and functional per the diff and code. The PR description's claim that this variant is "not enabled" contradicts the actual implementation. This finding should be dropped as it's based on an incorrect premise. - **cmd/gadfly/engine.go:66** — The constant `maxThinkingTokens = 31999` now includes a clear source citation in its comment: "Claude Code's high 'ultrathink' tier". The comment explains its purpose and behavior (best-effort, no-op if CLI doesn't honor it). This addresses the draft's concern about lacking documentation or derivation. The finding should be dropped. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> No material issues found. The changes introduce support for extended-thinking budgets via the `:max` or `:<n>` suffix in `claude-code/<model>:<thinking>` model specs. The implementation correctly parses these suffixes, maps them to `MAX_THINKING_TOKENS` environment variable values, and applies them to the subprocess execution. Error handling appears robust, with proper fallbacks and tests covering the new functionality. Edge cases like invalid suffixes are handled gracefully by defaulting to no extended thinking. The process group management for subprocess cancellation is also properly maintained. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 39s</sub>
steve added 1 commit 2026-06-27 22:39:03 +00:00
fix: fold in PR #5 review findings (thinking-env test + keep-list)
Build & push image / build-and-push (pull_request) Successful in 6s
bd24b0a423
The swarm reviewed PR #5 (8 reviewers; the telemetry fix from #4 is now
live, so only 13 findings vs 43 on the comparably-clean #4 — the fix
works). Folded in the two warranted ones:

- engine: keep MAX_THINKING_TOKENS in claudeEnv() so a globally-set value
  reaches the CLI too (not just the per-spec :max append). (minimax)
- test: TestRunPassInjectsThinkingTokens verifies runPass actually puts
  MAX_THINKING_TOKENS in the subprocess env (31999 for :max, unset for a
  plain spec) — the parse was tested, the injection wasn't. (minimax)

The MAX_THINKING_TOKENS-is-unverified concern (minimax, qwen) is the same
caveat already documented; left as-is. gofmt/vet/test -race green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
steve merged commit c342bdb905 into main 2026-06-27 22:39:14 +00:00
steve deleted branch ci/add-opus-reviewer 2026-06-27 22:39:14 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: steve/gadfly#5