ci: enable claude-code/opus:max (max-thinking) reviewer #6

Merged
steve merged 1 commits from ci/opus-max-thinking into main 2026-06-27 22:49:49 +00:00
Owner

Follow-up to #5 — the max-thinking opus you asked for.

Now that :sha-c342bdb (which has the :thinking spec parse from #5) is built, this adds claude-code/opus:max to the dogfood swarm and pins to that image. The Claude Code lineup is now sonnet + opus + opus-with-max-thinking, so the scoreboard can compare the same model at two thinking depths.

Max thinking is applied via MAX_THINKING_TOKENS=31999 on the CLI subprocess (best-effort — a no-op if the installed CLI build ignores it, in which case opus:max simply behaves like opus).

No code change — workflow only. This PR's own review will be the first to exercise claude-code/opus:max end-to-end. YAML validated.

🤖 Generated with Claude Code

Follow-up to #5 — the max-thinking opus you asked for. Now that `:sha-c342bdb` (which has the `:thinking` spec parse from #5) is built, this adds **`claude-code/opus:max`** to the dogfood swarm and pins to that image. The Claude Code lineup is now **sonnet + opus + opus-with-max-thinking**, so the scoreboard can compare the same model at two thinking depths. Max thinking is applied via `MAX_THINKING_TOKENS=31999` on the CLI subprocess (best-effort — a no-op if the installed CLI build ignores it, in which case `opus:max` simply behaves like `opus`). No code change — workflow only. This PR's own review will be the first to exercise `claude-code/opus:max` end-to-end. YAML validated. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
steve added 1 commit 2026-06-27 22:40:55 +00:00
ci: enable claude-code/opus:max (max-thinking) reviewer
Build & push image / build-and-push (pull_request) Successful in 7s
Adversarial Review (Gadfly) / review (pull_request) Successful in 7m1s
36f8ef0ec7
Follow-up to #5: now that :sha-c342bdb (which has the :thinking spec
parse) is built, add claude-code/opus:max to the dogfood swarm and pin to
that image. The Claude Code lineup is now sonnet + opus + opus-with-max-
thinking, so the scoreboard can compare the same model at two thinking
depths. Best-effort max thinking via MAX_THINKING_TOKENS (no-op if the
CLI build ignores it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

🪰 Gadfly — live review status

9/9 reviewers finished · updated 2026-06-27 22:47:56Z

claude-code/opus · claude-code — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

claude-code/opus:max · claude-code — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

claude-code/sonnet · claude-code — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

deepseek-v4-pro:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

glm-5.1:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

glm-5.2:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

minimax-m3:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

nemotron-3-super:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — Reviewed

qwen3-coder:480b-cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — No material issues found
  • error-handling — No material issues found

Live status board. Findings are posted in each model's own comment. Advisory only — does not block merge.

<!-- gadfly-status-board --> ## 🪰 Gadfly — live review status 9/9 reviewers finished · updated 2026-06-27 22:47:56Z #### `claude-code/opus` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `claude-code/opus:max` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `claude-code/sonnet` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `deepseek-v4-pro:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `glm-5.1:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `glm-5.2:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `minimax-m3:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found #### `nemotron-3-super:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — Reviewed #### `qwen3-coder:480b-cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **error-handling** — No material issues found <sub>Live status board. Findings are posted in each model's own comment. Advisory only — does not block merge.</sub>

🪰 Gadfly review — glm-5.2:cloud (ollama-cloud)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

VERDICT: No material issues found

  • The diff is workflow-only. From the security lens I verified the full input-to-execution path for the new claude-code/opus:max model spec:
    • GADFLY_MODELS is parsed by newClaudeCodeEngine (cmd/gadfly/engine.go:83-109). The opus:max token splits on / then :model="opus", thinking=parseThinking("max")=31999. No shell is involved; model is passed to the CLI purely as an exec.CommandContext argv element (--model, e.model at engine.go:129-130), so the spec string is not a command-injection vector.
    • MAX_THINKING_TOKENS=31999 is injected into the subprocess env only at engine.go:152-154, after claudeEnv() builds the minimal allowlist. I confirmed claudeEnv() (engine.go:221-241) deliberately drops GITEA_TOKEN, GADFLY_FINDINGS_TOKEN, and OLLAMA_CLOUD_API_KEY and only keeps CLAUDE_*/ANTHROPIC_*/GADFLY_CLAUDE_*/locale/Node/XDG plus MAX_THINKING_TOKENS. No new secret exposure is introduced by adding the :max lane; the OAuth token that is passed is the intended CLI auth.
    • Image pin moved from :sha-80d8f53 to :sha-c342bdb — both immutable sha tags from the same private registry; no downgrade to :latest and no new trust boundary.
  • No authn/authz change: the job-level if: actor allow-list (steve/fizi/dazed) and GADFLY_ALLOWED_USERS mirror are untouched, so the comment-trigger secret-bearing-run guard still applies to the new lane.
  • I checked the unit tests (cmd/gadfly/engine_test.go:221-281) which assert :max31999 and that the env is only set for :max, not plain specs — consistent with the security-relevant env handling above.

Nothing in my lane is materially wrong.

🎯 Correctness — No material issues found

I verified the full pipeline this PR touches: the workflow YAML, the engine spec parser (cmd/gadfly/engine.go), the entrypoint lane/concurrency logic (entrypoint.sh), and the per-model lens fan-out (cmd/gadfly/main.go). Tracing claude-code/opus:max end-to-end:

  • Spec parse (engine.go:83-121): strings.Cut(after, ":") → model=opus, parseThinking("max")maxThinkingTokens (31999). Correct.
  • 31999 value: this is Claude Code's ultrathink tier budget (think=4000 / megathink=10000 / ultrathink=31999). The constant matches the documented tier — not a plausible-looking magic number, the real one.
  • Env injection (engine.go:148-155): thinkingTokens > 0 → appends MAX_THINKING_TOKENS=31999. The workflow env block does not set MAX_THINKING_TOKENS, and claudeEnv() only keeps it if present in os.Environ(), so the subprocess receives exactly one MAX_THINKING_TOKENS=31999 entry — no duplicate. Correct for this configuration.
  • Lane routing (entrypoint.sh:159): provider_of "claude-code/opus:max"claude-code (first path segment). It joins the claude-code lane (cap 2), alongside sonnet and opus — 3 models, 2 at a time. Correct.
  • Status-board uniqueness (entrypoint.sh:165): status_file_for sanitizes via tr -c '[:alnum:]._-' '_', so claude-code/opus:maxclaude-code_opus_max, distinct from claude-code_opus. No collision on the live board.
  • Lens concurrency (main.go:281-286): GADFLY_PROVIDER_LENS_CONCURRENCY only overrides ollama-cloud; claude-code falls back to default 1, so each claude-code model runs its 3 lenses serially. 2 models at a time × 1 lens = 2 concurrent CLI subprocesses — within the lane's intent.

Verdict: No material issues found.

  • The one thing worth noting (not introduced by this diff, and not triggered by this workflow config): runPass appends MAX_THINKING_TOKENS=… after claudeEnv(), which already keeps MAX_THINKING_TOKENS if it exists in the parent env. If MAX_THINKING_TOKENS were ever exported into the runner env, the subprocess would get two entries (first wins in glibc getenv), silently overriding the :max budget with the parent value. This PR's config does not set it, so it's a latent engine edge case, not a defect of this change. Verified by reading engine.go:148-155 and engine.go:221-240.
🧯 Error handling & edge cases — No material issues found

I verified the YAML change against the code that consumes it. This is a workflow-only diff: it adds claude-code/opus:max to GADFLY_MODELS and bumps the pinned image tag (which already carries the :max/MAX_THINKING_TOKENS parsing logic from the prior PR).

Checked the edge cases in my lane:

  • :max spec parsing (cmd/gadfly/engine.go:83-121): strings.Cut on : correctly splits opus:max into model=opus, thinking=max; parseThinking("max") returns 31999. No panic/nil path on malformed suffix — unknown values fall through to 0 (no-op), which is exactly the "best-effort, degrades to opus" behavior described.
  • Env injection (engine.go:152-154): MAX_THINKING_TOKENS only appended when thinkingTokens > 0; empty/zero never leaks. No swallowed error here.
  • Provider lane (entrypoint.sh:159): provider_of splits on first /claude-code for claude-code/opus:max, so it lands in the claude-code=2 concurrency lane as intended. No empty-provider edge case.
  • Status file uniqueness (entrypoint.sh:165): status_file_for sanitizes :_, so claude-code/opus (claude-code_opus) and claude-code/opus:max (claude-code_opus_max) get distinct files — no collision that would make the live board conflate the two runs.
  • Run failure handling (entrypoint.sh:183): a failed model logs and continues; adding a third claude-code entry doesn't change that, and with cap=2 the third queues rather than erroring.

No ignored errors, missing cleanup, panics on bad input, or boundary issues introduced by this change.

VERDICT: No material issues found.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 1m 2s

<!-- gadfly-review:ollama:glm-5.2:cloud --> ### 🪰 Gadfly review — `glm-5.2:cloud` (ollama-cloud) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> VERDICT: No material issues found - The diff is workflow-only. From the security lens I verified the full input-to-execution path for the new `claude-code/opus:max` model spec: - `GADFLY_MODELS` is parsed by `newClaudeCodeEngine` (`cmd/gadfly/engine.go:83-109`). The `opus:max` token splits on `/` then `:` → `model="opus"`, `thinking=parseThinking("max")=31999`. No shell is involved; `model` is passed to the CLI purely as an `exec.CommandContext` argv element (`--model`, `e.model` at `engine.go:129-130`), so the spec string is not a command-injection vector. - `MAX_THINKING_TOKENS=31999` is injected into the subprocess env only at `engine.go:152-154`, after `claudeEnv()` builds the minimal allowlist. I confirmed `claudeEnv()` (`engine.go:221-241`) deliberately drops `GITEA_TOKEN`, `GADFLY_FINDINGS_TOKEN`, and `OLLAMA_CLOUD_API_KEY` and only keeps `CLAUDE_*`/`ANTHROPIC_*`/`GADFLY_CLAUDE_*`/locale/Node/XDG plus `MAX_THINKING_TOKENS`. No new secret exposure is introduced by adding the `:max` lane; the OAuth token that *is* passed is the intended CLI auth. - Image pin moved from `:sha-80d8f53` to `:sha-c342bdb` — both immutable sha tags from the same private registry; no downgrade to `:latest` and no new trust boundary. - No authn/authz change: the job-level `if:` actor allow-list (`steve`/`fizi`/`dazed`) and `GADFLY_ALLOWED_USERS` mirror are untouched, so the comment-trigger secret-bearing-run guard still applies to the new lane. - I checked the unit tests (`cmd/gadfly/engine_test.go:221-281`) which assert `:max` → `31999` and that the env is only set for `:max`, not plain specs — consistent with the security-relevant env handling above. Nothing in my lane is materially wrong. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> I verified the full pipeline this PR touches: the workflow YAML, the engine spec parser (`cmd/gadfly/engine.go`), the entrypoint lane/concurrency logic (`entrypoint.sh`), and the per-model lens fan-out (`cmd/gadfly/main.go`). Tracing `claude-code/opus:max` end-to-end: - **Spec parse** (`engine.go:83-121`): `strings.Cut(after, ":")` → model=`opus`, `parseThinking("max")` → `maxThinkingTokens` (31999). Correct. - **31999 value**: this is Claude Code's `ultrathink` tier budget (think=4000 / megathink=10000 / ultrathink=31999). The constant matches the documented tier — not a plausible-looking magic number, the real one. - **Env injection** (`engine.go:148-155`): `thinkingTokens > 0` → appends `MAX_THINKING_TOKENS=31999`. The workflow env block does **not** set `MAX_THINKING_TOKENS`, and `claudeEnv()` only keeps it if present in `os.Environ()`, so the subprocess receives exactly one `MAX_THINKING_TOKENS=31999` entry — no duplicate. Correct for this configuration. - **Lane routing** (`entrypoint.sh:159`): `provider_of "claude-code/opus:max"` → `claude-code` (first path segment). It joins the `claude-code` lane (cap 2), alongside `sonnet` and `opus` — 3 models, 2 at a time. Correct. - **Status-board uniqueness** (`entrypoint.sh:165`): `status_file_for` sanitizes via `tr -c '[:alnum:]._-' '_'`, so `claude-code/opus:max` → `claude-code_opus_max`, distinct from `claude-code_opus`. No collision on the live board. - **Lens concurrency** (`main.go:281-286`): `GADFLY_PROVIDER_LENS_CONCURRENCY` only overrides `ollama-cloud`; `claude-code` falls back to default 1, so each claude-code model runs its 3 lenses serially. 2 models at a time × 1 lens = 2 concurrent CLI subprocesses — within the lane's intent. **Verdict: No material issues found.** - The one thing worth noting (not introduced by this diff, and not triggered by this workflow config): `runPass` appends `MAX_THINKING_TOKENS=…` *after* `claudeEnv()`, which already keeps `MAX_THINKING_TOKENS` if it exists in the parent env. If `MAX_THINKING_TOKENS` were ever exported into the runner env, the subprocess would get two entries (first wins in glibc `getenv`), silently overriding the `:max` budget with the parent value. This PR's config does not set it, so it's a latent engine edge case, not a defect of this change. Verified by reading `engine.go:148-155` and `engine.go:221-240`. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> I verified the YAML change against the code that consumes it. This is a workflow-only diff: it adds `claude-code/opus:max` to `GADFLY_MODELS` and bumps the pinned image tag (which already carries the `:max`/`MAX_THINKING_TOKENS` parsing logic from the prior PR). Checked the edge cases in my lane: - **`:max` spec parsing** (`cmd/gadfly/engine.go:83-121`): `strings.Cut` on `:` correctly splits `opus:max` into model=`opus`, thinking=`max`; `parseThinking("max")` returns `31999`. No panic/nil path on malformed suffix — unknown values fall through to `0` (no-op), which is exactly the "best-effort, degrades to opus" behavior described. - **Env injection** (`engine.go:152-154`): `MAX_THINKING_TOKENS` only appended when `thinkingTokens > 0`; empty/zero never leaks. No swallowed error here. - **Provider lane** (`entrypoint.sh:159`): `provider_of` splits on first `/` → `claude-code` for `claude-code/opus:max`, so it lands in the `claude-code=2` concurrency lane as intended. No empty-provider edge case. - **Status file uniqueness** (`entrypoint.sh:165`): `status_file_for` sanitizes `:`→`_`, so `claude-code/opus` (`claude-code_opus`) and `claude-code/opus:max` (`claude-code_opus_max`) get distinct files — no collision that would make the live board conflate the two runs. - **Run failure handling** (`entrypoint.sh:183`): a failed model logs and continues; adding a third claude-code entry doesn't change that, and with cap=2 the third queues rather than erroring. No ignored errors, missing cleanup, panics on bad input, or boundary issues introduced by this change. VERDICT: No material issues found. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 1m 2s</sub>

🪰 Gadfly review — glm-5.1:cloud (ollama-cloud)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

VERDICT: No material issues found

Reviewed the diff through the security lens:

  • Secret isolation intact: The claudeCodeEngine subprocess gets its env from claudeEnv() (engine.go:221), which whitelists only PATH/HOME/… and CLAUDE_*/ANTHROPIC_*/GADFLY_CLAUDE_* prefixes — deliberately dropping GITEA_TOKEN, GADFLY_FINDINGS_TOKEN, and provider keys. The new MAX_THINKING_TOKENS value is appended after that whitelist (engine.go:154), never from the runner's env, so no secret leakage path is introduced.

  • No injection surface: The :max suffix is parsed by parseThinking into a compile-time constant (31999, engine.go:66), then strconv.Itoa'd into an env var. There is no user-controlled string that reaches a shell, CLI flag, or template.

  • Permission boundary unchanged: The Claude CLI still runs under --permission-mode plan (read-only sandbox, engine.go:104). The :max spec only affects a thinking-budget env var — it does not grant additional capabilities to the subprocess.

  • Image pinning: Both old and new tags use immutable :sha- digests, not :latest, preventing supply-chain substitution between runs.

  • CLAUDE_CODE_OAUTH_TOKEN remains a GitHub secret referenced via ${{ secrets.* }}, never echoed or logged.

🎯 Correctness — No material issues found

VERDICT: No material issues found

The diff adds claude-code/opus:max to the GADFLY_MODELS list and pins the Docker image to sha-c342bdb (which includes the :thinking spec parsing from the prior PR). I verified the following correctness concerns:

  • :max suffix parsing: cmd/gadfly/engine.go:78-121parseThinking("max") correctly returns maxThinkingTokens (31999), and newClaudeCodeEngine("claude-code/opus:max", ...) correctly splits into model="opus", thinkingTokens=31999. Tests at engine_test.go:221-278 confirm this end-to-end, including that MAX_THINKING_TOKENS=31999 is injected into the subprocess env for :max specs.

  • Provider routing: modelProvider() (model.go:142-151) uses strings.Cut(model, "/") to extract the provider lane, yielding "claude-code" from "claude-code/opus:max" — matching the claude-code=2 concurrency key. Correct.

  • Concurrency: With 3 claude-code models (sonnet, opus, opus:max) and claude-code=2 concurrency, two run at a time and the third queues. No overflow or misconfiguration.

  • The 31999 value: Intentionally just under a 32k boundary; consistent across code (const maxThinkingTokens = 31999), tests (MTT=31999), and PR description. Internally consistent.

No logic bugs or semantic correctness issues introduced.

🧯 Error handling & edge cases — No material issues found

VERDICT: No material issues found

I reviewed the diff through the error-handling & edge-case lens. The change adds claude-code/opus:max to the model fleet and pins the image to sha-c342bdb. I verified the full parsing and execution path:

  • newClaudeCodeEngine("claude-code/opus:max", …): strings.Cut(after, ":") correctly splits into model=opus, thinking suffix=max. parseThinking("max") returns maxThinkingTokens (31999). Verified in cmd/gadfly/engine.go:83–121 and confirmed by TestClaudeCodeThinking in engine_test.go.

  • runPass env injection: MAX_THINKING_TOKENS=31999 is appended to the subprocess env only when thinkingTokens > 0. claudeEnv() explicitly keeps MAX_THINKING_TOKENS if already in the parent env (line 224), and since the append happens after claudeEnv(), the value is always present for :max specs. No env leak path.

  • Graceful degradation: If the CLI build ignores MAX_THINKING_TOKENS, the env var is simply unused — the subprocess runs normally. This is the documented "no-op" behavior. No error is swallowed.

  • Entrypoint model normalization: provider_of("claude-code/opus:max") extracts claude-code (first /-segment), matching the claude-code=2 concurrency lane correctly. The : in the spec becomes _ in the status filename via tr -c '[:alnum:]._-' '_' — unique and safe.

  • Negative/invalid thinking values: parseThinking returns 0 for any non-integer, non-"max" string, and for n ≤ 0 from strconv.Atoi. No overflow or injection risk.

  • Process cleanup on timeout/cancel: cmd.Cancel kills the entire process group with SIGKILL. The cancelled-context check returns an error, never partial output. Verified in engine.go:148–176.

  • Failed model runs: entrypoint.sh:183 logs the failure and continues (|| log "model $1 failed (continuing)"), so one model's crash doesn't abort the swarm.

No unhandled errors, missing defers, nil-unsafety, or edge-case gaps found in this change.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 59s

<!-- gadfly-review:ollama:glm-5.1:cloud --> ### 🪰 Gadfly review — `glm-5.1:cloud` (ollama-cloud) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> **VERDICT: No material issues found** Reviewed the diff through the security lens: - **Secret isolation intact**: The `claudeCodeEngine` subprocess gets its env from `claudeEnv()` (engine.go:221), which whitelists only `PATH/HOME/…` and `CLAUDE_*`/`ANTHROPIC_*`/`GADFLY_CLAUDE_*` prefixes — deliberately dropping `GITEA_TOKEN`, `GADFLY_FINDINGS_TOKEN`, and provider keys. The new `MAX_THINKING_TOKENS` value is appended *after* that whitelist (engine.go:154), never from the runner's env, so no secret leakage path is introduced. - **No injection surface**: The `:max` suffix is parsed by `parseThinking` into a compile-time constant (`31999`, engine.go:66), then `strconv.Itoa`'d into an env var. There is no user-controlled string that reaches a shell, CLI flag, or template. - **Permission boundary unchanged**: The Claude CLI still runs under `--permission-mode plan` (read-only sandbox, engine.go:104). The `:max` spec only affects a thinking-budget env var — it does not grant additional capabilities to the subprocess. - **Image pinning**: Both old and new tags use immutable `:sha-` digests, not `:latest`, preventing supply-chain substitution between runs. - **CLAUDE_CODE_OAUTH_TOKEN** remains a GitHub secret referenced via `${{ secrets.* }}`, never echoed or logged. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> **VERDICT: No material issues found** The diff adds `claude-code/opus:max` to the `GADFLY_MODELS` list and pins the Docker image to `sha-c342bdb` (which includes the `:thinking` spec parsing from the prior PR). I verified the following correctness concerns: - **`:max` suffix parsing**: `cmd/gadfly/engine.go:78-121` — `parseThinking("max")` correctly returns `maxThinkingTokens` (31999), and `newClaudeCodeEngine("claude-code/opus:max", ...)` correctly splits into `model="opus"`, `thinkingTokens=31999`. Tests at `engine_test.go:221-278` confirm this end-to-end, including that `MAX_THINKING_TOKENS=31999` is injected into the subprocess env for `:max` specs. - **Provider routing**: `modelProvider()` (`model.go:142-151`) uses `strings.Cut(model, "/")` to extract the provider lane, yielding `"claude-code"` from `"claude-code/opus:max"` — matching the `claude-code=2` concurrency key. Correct. - **Concurrency**: With 3 claude-code models (sonnet, opus, opus:max) and `claude-code=2` concurrency, two run at a time and the third queues. No overflow or misconfiguration. - **The 31999 value**: Intentionally just under a 32k boundary; consistent across code (`const maxThinkingTokens = 31999`), tests (`MTT=31999`), and PR description. Internally consistent. No logic bugs or semantic correctness issues introduced. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> **VERDICT: No material issues found** I reviewed the diff through the error-handling & edge-case lens. The change adds `claude-code/opus:max` to the model fleet and pins the image to `sha-c342bdb`. I verified the full parsing and execution path: - **`newClaudeCodeEngine("claude-code/opus:max", …)`**: `strings.Cut(after, ":")` correctly splits into model=`opus`, thinking suffix=`max`. `parseThinking("max")` returns `maxThinkingTokens` (31999). Verified in `cmd/gadfly/engine.go:83–121` and confirmed by `TestClaudeCodeThinking` in `engine_test.go`. - **`runPass` env injection**: `MAX_THINKING_TOKENS=31999` is appended to the subprocess env only when `thinkingTokens > 0`. `claudeEnv()` explicitly keeps `MAX_THINKING_TOKENS` if already in the parent env (line 224), and since the append happens *after* `claudeEnv()`, the value is always present for `:max` specs. No env leak path. - **Graceful degradation**: If the CLI build ignores `MAX_THINKING_TOKENS`, the env var is simply unused — the subprocess runs normally. This is the documented "no-op" behavior. No error is swallowed. - **Entrypoint model normalization**: `provider_of("claude-code/opus:max")` extracts `claude-code` (first `/`-segment), matching the `claude-code=2` concurrency lane correctly. The `:` in the spec becomes `_` in the status filename via `tr -c '[:alnum:]._-' '_'` — unique and safe. - **Negative/invalid thinking values**: `parseThinking` returns 0 for any non-integer, non-"max" string, and for `n ≤ 0` from `strconv.Atoi`. No overflow or injection risk. - **Process cleanup on timeout/cancel**: `cmd.Cancel` kills the entire process group with `SIGKILL`. The cancelled-context check returns an error, never partial output. Verified in `engine.go:148–176`. - **Failed model runs**: `entrypoint.sh:183` logs the failure and continues (`|| log "model $1 failed (continuing)"`), so one model's crash doesn't abort the swarm. No unhandled errors, missing defers, nil-unsafety, or edge-case gaps found in this change. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 59s</sub>

🪰 Gadfly review — claude-code/sonnet (claude-code)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

No material issues found

  • Image SHA pin (sha-c342bdb): Pinning to an immutable SHA is the right security practice here. The CLAUDE.md explicitly warns that :latest is cached by act_runner and may silently run a stale image; a SHA pin avoids that and provides integrity.

  • claude-code/opus:max model name handling (verified via cmd/gadfly/engine.go): The :max suffix is parsed by parseThinking() and converted to the integer 31999 via strconv.Atoi — never shell-interpolated. The model name itself is passed as a separate argv element to exec.CommandContext, not concatenated into a shell string, so there is no command-injection surface.

  • MAX_THINKING_TOKENS env var (verified via engine.go:152-155): Set on the subprocess via cmd.Env = append(…, "MAX_THINKING_TOKENS="+strconv.Itoa(e.thinkingTokens)). Value originates from the hardcoded constant 31999 in engine.go, not from any untrusted input.

  • Secret exposure to subprocess (verified via engine.go's claudeEnv() allowlist): GITEA_TOKEN, GADFLY_FINDINGS_TOKEN, and other provider credentials are explicitly excluded from the Claude Code subprocess environment. CLAUDE_CODE_OAUTH_TOKEN reaches the subprocess via the CLAUDE_* wildcard — this is intentional and unchanged from the existing sonnet/opus runners.

  • GADFLY_MODELS value: Hardcoded in the workflow YAML, not derived from PR content or any user-controlled input, so no injection vector from that surface.

Nothing in this diff opens a new security surface relative to the existing claude-code/sonnet and claude-code/opus entries that are already present.

🎯 Correctness — No material issues found

Based on my investigation, I've verified the key correctness properties of this change.


Verdict: No material issues found

What I verified:

  • claude-code/opus:max parses correctly. engine.go:83–109 (newClaudeCodeEngine) splits on the first / to extract the provider (claude-code) then splits the remainder on : to separate the model id (opus) from the thinking tier (max). parseThinking("max") maps that to maxThinkingTokens = 31999 and injects it as MAX_THINKING_TOKENS into the subprocess env. The spec is unambiguous because Claude model IDs contain no colons. Verified by reading engine.go.

  • MAX_THINKING_TOKENS=31999. This is a pre-existing constant from PR #5, not introduced here. The PR description correctly characterises it as best-effort: if the CLI subprocess ignores the env var, opus:max degrades to plain opus. Verified by reading engine.go:66.

  • Concurrency with three claude-code models is fine. claude-code=2 means the third model queues behind whichever of the first two finishes; all three still run. The cap was already set before this PR and only controls peak parallelism, not total execution. Verified by reading entrypoint.sh.

  • Fleet comment is accurate. The updated comment ("6 cloud (3 at a time) + Claude Code (sonnet, opus, and opus with max extended thinking)") correctly describes the new model list of 6 ollama-cloud models + 3 claude-code models. Verified against the GADFLY_MODELS value in the diff.

🧯 Error handling & edge cases — No material issues found

No material issues found

The diff touches only workflow YAML: a new image tag and appending claude-code/opus:max to GADFLY_MODELS. I verified the error-handling paths through the actual code.

  • :max suffix parsingparseThinking() in engine.go:111-121 correctly maps "max" → 31999 tokens. Invalid suffixes degrade silently to 0 (thinking disabled), which is tested and intentional (engine_test.go:231). The new spec is well-formed and hits the happy path.

  • MAX_THINKING_TOKENS injectionengine.go:148-155 guards injection behind thinkingTokens > 0, so it's only set for :max. The PR description's "best-effort / no-op if CLI ignores it" matches the code exactly — the subprocess treats the env var as optional. No error path here.

  • Concurrency: claude-code=2 with 3 claude-code modelsrun_lane (entrypoint.sh:206-218) handles this correctly. With cap=2 and 3 models, model 3 queues: it's launched only after wait -n frees a slot. Each lane runs in its own subshell (line 250), so wait -n inside the claude-code lane cannot accidentally reap processes from the ollama-cloud lane. The wait -n 2>/dev/null || wait fallback degrades gracefully to sequential execution if wait -n is unavailable. No overflow, no starvation, no missing wait.

  • Status board pre-seedingentrypoint.sh:233-236 pre-seeds all models in MODEL_LIST at startup (including the new third claude-code entry), so the board correctly tracks all 3 from t=0, even while the 3rd is queued.

No new error paths are introduced. All relevant edge cases are handled by existing, tested code.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5m 32s

<!-- gadfly-review:ollama:claude-code/sonnet --> ### 🪰 Gadfly review — `claude-code/sonnet` (claude-code) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> **No material issues found** - **Image SHA pin** (`sha-c342bdb`): Pinning to an immutable SHA is the right security practice here. The CLAUDE.md explicitly warns that `:latest` is cached by act_runner and may silently run a stale image; a SHA pin avoids that and provides integrity. - **`claude-code/opus:max` model name handling** (verified via `cmd/gadfly/engine.go`): The `:max` suffix is parsed by `parseThinking()` and converted to the integer `31999` via `strconv.Atoi` — never shell-interpolated. The model name itself is passed as a separate argv element to `exec.CommandContext`, not concatenated into a shell string, so there is no command-injection surface. - **`MAX_THINKING_TOKENS` env var** (verified via `engine.go:152-155`): Set on the subprocess via `cmd.Env = append(…, "MAX_THINKING_TOKENS="+strconv.Itoa(e.thinkingTokens))`. Value originates from the hardcoded constant `31999` in engine.go, not from any untrusted input. - **Secret exposure to subprocess** (verified via `engine.go`'s `claudeEnv()` allowlist): `GITEA_TOKEN`, `GADFLY_FINDINGS_TOKEN`, and other provider credentials are explicitly excluded from the Claude Code subprocess environment. `CLAUDE_CODE_OAUTH_TOKEN` reaches the subprocess via the `CLAUDE_*` wildcard — this is intentional and unchanged from the existing sonnet/opus runners. - **`GADFLY_MODELS` value**: Hardcoded in the workflow YAML, not derived from PR content or any user-controlled input, so no injection vector from that surface. Nothing in this diff opens a new security surface relative to the existing `claude-code/sonnet` and `claude-code/opus` entries that are already present. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> Based on my investigation, I've verified the key correctness properties of this change. --- **Verdict: No material issues found** **What I verified:** - **`claude-code/opus:max` parses correctly.** `engine.go:83–109` (`newClaudeCodeEngine`) splits on the first `/` to extract the provider (`claude-code`) then splits the remainder on `:` to separate the model id (`opus`) from the thinking tier (`max`). `parseThinking("max")` maps that to `maxThinkingTokens = 31999` and injects it as `MAX_THINKING_TOKENS` into the subprocess env. The spec is unambiguous because Claude model IDs contain no colons. *Verified by reading `engine.go`.* - **`MAX_THINKING_TOKENS=31999`.** This is a pre-existing constant from PR #5, not introduced here. The PR description correctly characterises it as best-effort: if the CLI subprocess ignores the env var, `opus:max` degrades to plain `opus`. *Verified by reading `engine.go:66`.* - **Concurrency with three claude-code models is fine.** `claude-code=2` means the third model queues behind whichever of the first two finishes; all three still run. The cap was already set before this PR and only controls peak parallelism, not total execution. *Verified by reading `entrypoint.sh`.* - **Fleet comment is accurate.** The updated comment ("6 cloud (3 at a time) + Claude Code (sonnet, opus, and opus with max extended thinking)") correctly describes the new model list of 6 ollama-cloud models + 3 claude-code models. *Verified against the `GADFLY_MODELS` value in the diff.* </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> **No material issues found** The diff touches only workflow YAML: a new image tag and appending `claude-code/opus:max` to `GADFLY_MODELS`. I verified the error-handling paths through the actual code. - **`:max` suffix parsing** — `parseThinking()` in `engine.go:111-121` correctly maps `"max"` → 31999 tokens. Invalid suffixes degrade silently to 0 (thinking disabled), which is tested and intentional (`engine_test.go:231`). The new spec is well-formed and hits the happy path. - **`MAX_THINKING_TOKENS` injection** — `engine.go:148-155` guards injection behind `thinkingTokens > 0`, so it's only set for `:max`. The PR description's "best-effort / no-op if CLI ignores it" matches the code exactly — the subprocess treats the env var as optional. No error path here. - **Concurrency: `claude-code=2` with 3 claude-code models** — `run_lane` (`entrypoint.sh:206-218`) handles this correctly. With `cap=2` and 3 models, model 3 queues: it's launched only after `wait -n` frees a slot. Each lane runs in its own subshell (line 250), so `wait -n` inside the claude-code lane cannot accidentally reap processes from the ollama-cloud lane. The `wait -n 2>/dev/null || wait` fallback degrades gracefully to sequential execution if `wait -n` is unavailable. No overflow, no starvation, no missing `wait`. - **Status board pre-seeding** — `entrypoint.sh:233-236` pre-seeds *all* models in `MODEL_LIST` at startup (including the new third claude-code entry), so the board correctly tracks all 3 from t=0, even while the 3rd is queued. No new error paths are introduced. All relevant edge cases are handled by existing, tested code. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5m 32s</sub>

🪰 Gadfly review — minimax-m3:cloud (ollama-cloud)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

🔒 Security — No material issues found.

Verified the change against the security lens:

  • Image bump sha-80d8f53 → sha-c342bdb: same pinned-SHA pattern, same internal registry, same permissions: contents: read / issues: write / pull-requests: write block, same secrets. No new trust boundary.
  • claude-code/opus:max model spec (adversarial-review.yml:63): parsed by newClaudeCodeEngine in cmd/gadfly/engine.go:83-109. Traced through:
    • provider_of("claude-code/opus:max") → provider claude-code (read via ${1%%/*} at entrypoint.sh:159). No path/colon ambiguity.
    • strings.Cut("opus:max", ":") → model="opus", thinking=parseThinking("max")=31999 (engine.go:113-121). The literal "max" never reaches the model — only an integer token budget does, so no spec-string injection vector.
    • CLI invocation: claude -p ... --model opus --permission-mode plan + MAX_THINKING_TOKENS=31999. --permission-mode plan (read-only) is unchanged (engine.go:104); no write tools granted to the new reviewer.
  • Subprocess env sanitization (engine.go:221-241, claudeEnv): GITEA_TOKEN, GADFLY_FINDINGS_TOKEN, OLLAMA_CLOUD_API_KEY, and other runner secrets are still dropped. The keep-list (PATH, HOME, *CLAUDE_*, *ANTHROPIC_*, *GADFLY_CLAUDE_*, *NODE_*, *XDG_*, …) is unchanged. MAX_THINKING_TOKENS is allowed through (already on the keep list) AND force-appended with strconv.Itoa(e.thinkingTokens) (engine.go:152-155), so a leaked-from-runner value is overridden deterministically — no env-injection risk.
  • Auth: the new spec reuses the existing CLAUDE_CODE_OAUTH_TOKEN; no new secret, no new credential surface.
  • Trigger gating: the issue_comment if: allow-list (steve|fizi|dazed) is unchanged. Untrusted commenters still can't fire a secret-bearing run.
  • Concurrency cap (claude-code=2): bounds in-flight subprocesses; no DoS surface introduced.

The :max spec only adjusts an integer thinking-budget env var on a read-only Claude Code subprocess under the same minimal env as the existing claude-code/opus. Nothing in my lens is materially wrong.

🎯 Correctness — No material issues found

VERDICT: No material issues found

Reviewed through the Correctness lens. I verified the diff end-to-end against the checked-out code (cmd/gadfly/engine.go, cmd/gadfly/engine_test.go, cmd/gadfly/main.go, cmd/gadfly/lens_concurrency_test.go, cmd/gadfly/model.go, entrypoint.sh).

What I confirmed works correctly:

  • :max spec parsingengine.go:83-109 cuts on / then :; claude-code/opus:maxmodel="opus", thinkingTokens=31999. parseThinking("max") at engine.go:113-121 maps to the maxThinkingTokens const (31999), a positive int to itself, anything else to 0 (off). Covered by TestClaudeCodeThinking (engine_test.go:221-241) and TestRunPassInjectsThinkingTokens (engine_test.go:259-281).
  • Subprocess env injectionengine.go:152-155 appends MAX_THINKING_TOKENS=31999 to the CLI subprocess only when thinkingTokens>0. The CLI env allowlist at engine.go:224 explicitly keeps MAX_THINKING_TOKENS, so it survives claudeEnv()'s strict filtering.
  • Provider lane resolutionentrypoint.sh:159's provider_of extracts claude-code as the lane for claude-code/opus:max (matches on */*, takes ${1%%/*}).
  • Concurrency mathGADFLY_PROVIDER_CONCURRENCY: "ollama-cloud=3,claude-code=2" means the three claude-code entries (sonnet, opus, opus:max) cap at 2 in-flight; lenses default to serial (1) inside the claude-code lane since it's absent from GADFLY_PROVIDER_LENS_CONCURRENCY. Peak: 2 claude subprocesses + 9 ollama-cloud queries ≤ the 10-comment budget called out at adversarial-review.yml:66. Job-level timeout-minutes: 90 (5400 s) comfortably covers worst-case serialization.
  • Image bump:sha-c342bdb is asserted (in the PR description) to carry the :thinking spec parse from #5; the constant maxThinkingTokens = 31999 is defined in that same engine.go file and is what gets injected — verified end-to-end by TestRunPassInjectsThinkingTokens.
  • Fallback honesty:max on an older CLI that ignores MAX_THINKING_TOKENS simply degrades to plain claude-code/opus, as documented. No silent corruption.

The :max magic number 31999 is the Claude Code CLI's documented "ultrathink" ceiling, not a re-derivable constant — it's the vendor's value, applied as a best-effort env override. No bug.

No logic errors, no domain-incorrect constants, no behavioral regressions introduced by this PR through the correctness lens.

🧯 Error handling & edge cases — No material issues found

Nothing in my lens specifically. The diff is a YAML-only change to a Gitea workflow that adds a new GADFLY_MODELS entry. The engine code clearly supports the :max spec (mapped to 31999), the env sanitization preserves MAX_THINKING_TOKENS, and timeout/concurrency envelopes still fit.

VERDICT: No material issues found

Nothing in my error-handling / edge-cases lens is materially wrong with this diff:

  • :max parsing: Verified at cmd/gadfly/engine.go:111-121 (parseThinking) and the constant at cmd/gadfly/engine.go:66 (maxThinkingTokens = 31999) — exactly the value the PR description claims gets injected.
  • Env preservation: claudeEnv() at cmd/gadfly/engine.go:224 explicitly keeps MAX_THINKING_TOKENS through its allow-list, so the subprocess will see it (and not have it silently dropped).
  • Unrecognized suffix fallback: parseThinking returns 0 for any garbage (typos like claude-code/opus:Maxx or negative ints), which claudeCodeEngine.runPass then treats as "no thinking override" — cmd.Env simply doesn't get MAX_THINKING_TOKENS appended (engine.go:152 guards on > 0). This is a safe degradation, not a swallowed error, and the PR's own spec is correct.
  • Concurrency / timeout envelope: claude-code=2 means sonnet + opus + opus:max (3 entries) queues 1 — well within GADFLY_TIMEOUT_SECS=600 × 3 lenses and the 90-minute job ceiling. No overflow concern introduced by adding the third claude-code lane.
  • Process-group cleanup: Already handled at engine.go:159-165 (Setpgid + group SIGKILL on cancel), so a long-running :max run that gets timed out can't orphan Node children. Pre-existing, not regressed by this diff.

The "best-effort no-op if the installed CLI build ignores it" caveat in the PR description is honest: engine.go:65 calls it out, and the claude-code/opus lane has always had this posture for MAX_THINKING_TOKENS.

No findings to report in this lane.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 1m 22s

<!-- gadfly-review:ollama:minimax-m3:cloud --> ### 🪰 Gadfly review — `minimax-m3:cloud` (ollama-cloud) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> 🔒 **Security — No material issues found.** Verified the change against the security lens: - **Image bump** `sha-80d8f53 → sha-c342bdb`: same pinned-SHA pattern, same internal registry, same `permissions: contents: read / issues: write / pull-requests: write` block, same secrets. No new trust boundary. - **`claude-code/opus:max` model spec** (`adversarial-review.yml:63`): parsed by `newClaudeCodeEngine` in `cmd/gadfly/engine.go:83-109`. Traced through: - `provider_of("claude-code/opus:max")` → provider `claude-code` (read via `${1%%/*}` at `entrypoint.sh:159`). No path/colon ambiguity. - `strings.Cut("opus:max", ":")` → model="opus", thinking=`parseThinking("max")`=31999 (`engine.go:113-121`). The literal `"max"` never reaches the model — only an integer token budget does, so no spec-string injection vector. - CLI invocation: `claude -p ... --model opus --permission-mode plan` + `MAX_THINKING_TOKENS=31999`. `--permission-mode plan` (read-only) is unchanged (`engine.go:104`); no write tools granted to the new reviewer. - **Subprocess env sanitization** (`engine.go:221-241`, `claudeEnv`): `GITEA_TOKEN`, `GADFLY_FINDINGS_TOKEN`, `OLLAMA_CLOUD_API_KEY`, and other runner secrets are still dropped. The keep-list (`PATH`, `HOME`, `*CLAUDE_*`, `*ANTHROPIC_*`, `*GADFLY_CLAUDE_*`, `*NODE_*`, `*XDG_*`, …) is unchanged. `MAX_THINKING_TOKENS` is allowed through (already on the keep list) AND force-appended with `strconv.Itoa(e.thinkingTokens)` (`engine.go:152-155`), so a leaked-from-runner value is overridden deterministically — no env-injection risk. - **Auth**: the new spec reuses the existing `CLAUDE_CODE_OAUTH_TOKEN`; no new secret, no new credential surface. - **Trigger gating**: the `issue_comment` `if:` allow-list (`steve|fizi|dazed`) is unchanged. Untrusted commenters still can't fire a secret-bearing run. - **Concurrency cap** (`claude-code=2`): bounds in-flight subprocesses; no DoS surface introduced. The `:max` spec only adjusts an integer thinking-budget env var on a read-only Claude Code subprocess under the same minimal env as the existing `claude-code/opus`. Nothing in my lens is materially wrong. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> ## VERDICT: No material issues found Reviewed through the **Correctness** lens. I verified the diff end-to-end against the checked-out code (`cmd/gadfly/engine.go`, `cmd/gadfly/engine_test.go`, `cmd/gadfly/main.go`, `cmd/gadfly/lens_concurrency_test.go`, `cmd/gadfly/model.go`, `entrypoint.sh`). What I confirmed works correctly: - **`:max` spec parsing** — `engine.go:83-109` cuts on `/` then `:`; `claude-code/opus:max` → `model="opus"`, `thinkingTokens=31999`. `parseThinking("max")` at `engine.go:113-121` maps to the `maxThinkingTokens` const (31999), a positive int to itself, anything else to 0 (off). Covered by `TestClaudeCodeThinking` (`engine_test.go:221-241`) and `TestRunPassInjectsThinkingTokens` (`engine_test.go:259-281`). - **Subprocess env injection** — `engine.go:152-155` appends `MAX_THINKING_TOKENS=31999` to the CLI subprocess only when `thinkingTokens>0`. The CLI env allowlist at `engine.go:224` explicitly keeps `MAX_THINKING_TOKENS`, so it survives `claudeEnv()`'s strict filtering. - **Provider lane resolution** — `entrypoint.sh:159`'s `provider_of` extracts `claude-code` as the lane for `claude-code/opus:max` (matches on `*/*`, takes `${1%%/*}`). - **Concurrency math** — `GADFLY_PROVIDER_CONCURRENCY: "ollama-cloud=3,claude-code=2"` means the three claude-code entries (sonnet, opus, opus:max) cap at 2 in-flight; lenses default to serial (1) inside the claude-code lane since it's absent from `GADFLY_PROVIDER_LENS_CONCURRENCY`. Peak: 2 claude subprocesses + 9 ollama-cloud queries ≤ the 10-comment budget called out at `adversarial-review.yml:66`. Job-level `timeout-minutes: 90` (5400 s) comfortably covers worst-case serialization. - **Image bump** — `:sha-c342bdb` is asserted (in the PR description) to carry the `:thinking` spec parse from #5; the constant `maxThinkingTokens = 31999` is defined in that same engine.go file and is what gets injected — verified end-to-end by `TestRunPassInjectsThinkingTokens`. - **Fallback honesty** — `:max` on an older CLI that ignores `MAX_THINKING_TOKENS` simply degrades to plain `claude-code/opus`, as documented. No silent corruption. The `:max` magic number 31999 is the Claude Code CLI's documented "ultrathink" ceiling, not a re-derivable constant — it's the vendor's value, applied as a best-effort env override. No bug. No logic errors, no domain-incorrect constants, no behavioral regressions introduced by this PR through the correctness lens. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> Nothing in my lens specifically. The diff is a YAML-only change to a Gitea workflow that adds a new GADFLY_MODELS entry. The engine code clearly supports the `:max` spec (mapped to `31999`), the env sanitization preserves `MAX_THINKING_TOKENS`, and timeout/concurrency envelopes still fit. VERDICT: No material issues found Nothing in my error-handling / edge-cases lens is materially wrong with this diff: - **`:max` parsing**: Verified at `cmd/gadfly/engine.go:111-121` (`parseThinking`) and the constant at `cmd/gadfly/engine.go:66` (`maxThinkingTokens = 31999`) — exactly the value the PR description claims gets injected. - **Env preservation**: `claudeEnv()` at `cmd/gadfly/engine.go:224` explicitly keeps `MAX_THINKING_TOKENS` through its allow-list, so the subprocess will see it (and not have it silently dropped). - **Unrecognized suffix fallback**: `parseThinking` returns `0` for any garbage (typos like `claude-code/opus:Maxx` or negative ints), which `claudeCodeEngine.runPass` then treats as "no thinking override" — `cmd.Env` simply doesn't get `MAX_THINKING_TOKENS` appended (`engine.go:152` guards on `> 0`). This is a safe degradation, not a swallowed error, and the PR's own spec is correct. - **Concurrency / timeout envelope**: `claude-code=2` means sonnet + opus + opus:max (3 entries) queues 1 — well within `GADFLY_TIMEOUT_SECS=600` × 3 lenses and the 90-minute job ceiling. No overflow concern introduced by adding the third claude-code lane. - **Process-group cleanup**: Already handled at `engine.go:159-165` (`Setpgid` + group `SIGKILL` on cancel), so a long-running `:max` run that gets timed out can't orphan Node children. Pre-existing, not regressed by this diff. The "best-effort no-op if the installed CLI build ignores it" caveat in the PR description is honest: `engine.go:65` calls it out, and the `claude-code/opus` lane has always had this posture for `MAX_THINKING_TOKENS`. No findings to report in this lane. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 1m 22s</sub>

🪰 Gadfly review — claude-code/opus (claude-code)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

I've verified the relevant code path. My review is complete.

VERDICT: No material issues found

This PR is a workflow-config-only change: it bumps the dogfood image tag (sha-80d8f53sha-c342bdb), appends claude-code/opus:max to GADFLY_MODELS, and edits comments. I traced the security-relevant path for the new :max spec:

  • No injection from the model spec. claude-code/opus:max flows into newClaudeCodeEngineparseThinking("max")maxThinkingTokens (31999), which is applied as an env var (engine.go:152-154), and the CLI is invoked via exec.CommandContext with an argv slice (engine.go:127-139,149) — no shell, so no command-injection surface. The constant 31999 is hardcoded, not derived from the spec. Verified by reading engine.go.
  • Spec values are maintainer-controlled. GADFLY_MODELS is set in the committed workflow YAML, not from untrusted PR input, so there's no tainted-data path here regardless.
  • Secrets remain scoped away from the third-party CLI. claudeEnv() (engine.go:221-241) allowlists env into the subprocess and deliberately drops GITEA_TOKEN/provider keys; MAX_THINKING_TOKENS and CLAUDE_* (the OAuth token) are intentionally kept, which is correct and unchanged by this PR. Verified by reading the function.
  • Image pinning is the safer choice. Pinning to an immutable :sha-… tag rather than :latest reduces (not increases) supply-chain risk; it points at the project's own registry. No concern.

Nothing in the security lens is materially wrong with this change.

Outside my lens: the PR's correctness rests on the image sha-c342bdb actually containing the :thinking/:max parse — the checked-out source has it (engine.go:83-121), but the image contents can't be verified from the repo. That's a build/correctness matter, not security.

🎯 Correctness — No material issues found

I've verified the change end-to-end against the actual code. Here's my review.


VERDICT: No material issues found

This is a config-only change (image tag bump + one new GADFLY_MODELS entry + comment). Through the correctness lens, I verified every way the new claude-code/opus:max spec is consumed downstream, and it is handled correctly:

  • Spec parsing (cmd/gadfly/engine.go:83-121): claude-code/opus:maxstrings.Cut(spec, "/") yields opus:max, then strings.Cut(after, ":") yields model opus + thinking suffix max. parseThinking("max") returns maxThinkingTokens (engine.go:66 = 31999), which is set as MAX_THINKING_TOKENS on the subprocess (engine.go:152-154). The magic number 31999 matches Claude Code's documented "ultrathink" budget — correct. There's an explicit test for exactly this spec (engine_test.go:229, :251, :270-273), and the comment/PR description's 31999 is consistent with the constant.
  • No model-id collision with the existing claude-code/opus entry: both resolve to --model opus, but they are intentionally distinct review passes (the stated goal — same model at two thinking depths). I checked the two places identity matters and neither collides:
    • PR comment marker (scripts/run.sh:50) keys on the full ${MODEL} spec, so the markers are …claude-code/opus --> vs …claude-code/opus:max --> — separate comments, no clobbering.
    • Status-board file (entrypoint.sh:165) sanitizes the full spec to claude-code_opus vs claude-code_opus_max — distinct files.
  • Provider/lane routing (entrypoint.sh:159): provider_of splits on /claude-code, matching the claude-code=2 concurrency entry. The new entry correctly joins the existing claude-code lane (now 3 models, cap 2), and the cloud-lane concurrency comment ("3 cloud × 3 lenses = 9") is untouched and unaffected.
  • Best-effort claim is accurate: MAX_THINKING_TOKENS is only appended when thinkingTokens > 0 and is a plain env var on the CLI subprocess, so it's genuinely a no-op if the installed CLI ignores it — matching the PR description.

I verified the above by reading engine.go, engine_test.go, entrypoint.sh, and run.sh directly (not from the diff alone). The image-tag bump to :sha-c342bdb is asserted to contain this :max parsing; I can confirm the parsing exists at this checkout but cannot verify the contents of the published image from here — that's the only unverifiable element, and it's outside the correctness lens.

No logic, unit, or threshold errors in my lane.

🧯 Error handling & edge cases — No material issues found

I verified the relevant code rather than reviewing the YAML in isolation. The diff is workflow-only: it bumps the image to :sha-c342bdb, adds claude-code/opus:max to GADFLY_MODELS, and updates a comment. My lens is error handling & edge cases, so I traced where :max actually gets parsed and executed.

VERDICT: No material issues found

Findings (all within the error-handling / edge-case lens):

  • The :max suffix on claude-code/opus:max is parsed safely. Verified in cmd/gadfly/engine.go:86-95: strings.Cut(after, ":") splits the thinking tier, and parseThinking (engine.go:113-121) maps "max"maxThinkingTokens (31999), a positive int → itself, and anything else → 0 (off). Bad/garbage suffixes degrade to "no extended thinking" rather than panicking — no strconv panic path, since the error from Atoi is checked.
  • The CSV-with-colons edge case is fine. claude-code/opus:max contains a colon, but GADFLY_MODELS is comma-separated and existing entries (qwen3-coder:480b-cloud) already carry colons, so colon-in-spec is an established, handled shape. The colon only gets special meaning after the claude-code/ slash prefix (engine.go:88-90), which is correct.
  • The "best-effort / no-op if CLI ignores it" claim in the PR holds without a failure mode: MAX_THINKING_TOKENS is only appended to the subprocess env when thinkingTokens > 0 (engine.go:152-154), and it's explicitly whitelisted in the minimal env passthrough (engine.go:224). If the installed CLI ignores the var, the run still produces a normal result — no error swallowed, no crash.
  • Unhappy paths in the claude-code engine that this model exercises are already handled: timeout/cancel surfaces as an error instead of partial bytes (engine.go:174-176), empty results error out (engine.go:190), and process-group kill prevents orphaned children (engine.go:159-165). This PR adds no new unhandled path.

Nothing in my lane is materially wrong. The change introduces no new error-handling or edge-case surface beyond the :max parse, which is covered (and tested — engine_test.go:229,247,272).

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 56s

<!-- gadfly-review:ollama:claude-code/opus --> ### 🪰 Gadfly review — `claude-code/opus` (claude-code) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> I've verified the relevant code path. My review is complete. ## VERDICT: No material issues found This PR is a workflow-config-only change: it bumps the dogfood image tag (`sha-80d8f53` → `sha-c342bdb`), appends `claude-code/opus:max` to `GADFLY_MODELS`, and edits comments. I traced the security-relevant path for the new `:max` spec: - **No injection from the model spec.** `claude-code/opus:max` flows into `newClaudeCodeEngine` → `parseThinking("max")` → `maxThinkingTokens` (31999), which is applied as an env var (`engine.go:152-154`), and the CLI is invoked via `exec.CommandContext` with an argv slice (`engine.go:127-139,149`) — no shell, so no command-injection surface. The constant `31999` is hardcoded, not derived from the spec. Verified by reading `engine.go`. - **Spec values are maintainer-controlled.** `GADFLY_MODELS` is set in the committed workflow YAML, not from untrusted PR input, so there's no tainted-data path here regardless. - **Secrets remain scoped away from the third-party CLI.** `claudeEnv()` (`engine.go:221-241`) allowlists env into the subprocess and deliberately drops `GITEA_TOKEN`/provider keys; `MAX_THINKING_TOKENS` and `CLAUDE_*` (the OAuth token) are intentionally kept, which is correct and unchanged by this PR. Verified by reading the function. - **Image pinning is the safer choice.** Pinning to an immutable `:sha-…` tag rather than `:latest` reduces (not increases) supply-chain risk; it points at the project's own registry. No concern. Nothing in the security lens is materially wrong with this change. Outside my lens: the PR's correctness rests on the *image* `sha-c342bdb` actually containing the `:thinking`/`:max` parse — the checked-out source has it (`engine.go:83-121`), but the image contents can't be verified from the repo. That's a build/correctness matter, not security. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> I've verified the change end-to-end against the actual code. Here's my review. --- **VERDICT: No material issues found** This is a config-only change (image tag bump + one new `GADFLY_MODELS` entry + comment). Through the correctness lens, I verified every way the new `claude-code/opus:max` spec is consumed downstream, and it is handled correctly: - **Spec parsing** (`cmd/gadfly/engine.go:83-121`): `claude-code/opus:max` → `strings.Cut(spec, "/")` yields `opus:max`, then `strings.Cut(after, ":")` yields model `opus` + thinking suffix `max`. `parseThinking("max")` returns `maxThinkingTokens` (`engine.go:66 = 31999`), which is set as `MAX_THINKING_TOKENS` on the subprocess (`engine.go:152-154`). The magic number 31999 matches Claude Code's documented "ultrathink" budget — correct. There's an explicit test for exactly this spec (`engine_test.go:229`, `:251`, `:270-273`), and the comment/PR description's `31999` is consistent with the constant. - **No model-id collision with the existing `claude-code/opus` entry**: both resolve to `--model opus`, but they are intentionally distinct review passes (the stated goal — same model at two thinking depths). I checked the two places identity matters and neither collides: - PR comment marker (`scripts/run.sh:50`) keys on the full `${MODEL}` spec, so the markers are `…claude-code/opus -->` vs `…claude-code/opus:max -->` — separate comments, no clobbering. - Status-board file (`entrypoint.sh:165`) sanitizes the full spec to `claude-code_opus` vs `claude-code_opus_max` — distinct files. - **Provider/lane routing** (`entrypoint.sh:159`): `provider_of` splits on `/` → `claude-code`, matching the `claude-code=2` concurrency entry. The new entry correctly joins the existing claude-code lane (now 3 models, cap 2), and the cloud-lane concurrency comment ("3 cloud × 3 lenses = 9") is untouched and unaffected. - **Best-effort claim is accurate**: `MAX_THINKING_TOKENS` is only appended when `thinkingTokens > 0` and is a plain env var on the CLI subprocess, so it's genuinely a no-op if the installed CLI ignores it — matching the PR description. I verified the above by reading `engine.go`, `engine_test.go`, `entrypoint.sh`, and `run.sh` directly (not from the diff alone). The image-tag bump to `:sha-c342bdb` is asserted to contain this `:max` parsing; I can confirm the parsing exists at this checkout but cannot verify the contents of the published image from here — that's the only unverifiable element, and it's outside the correctness lens. No logic, unit, or threshold errors in my lane. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> I verified the relevant code rather than reviewing the YAML in isolation. The diff is workflow-only: it bumps the image to `:sha-c342bdb`, adds `claude-code/opus:max` to `GADFLY_MODELS`, and updates a comment. My lens is error handling & edge cases, so I traced where `:max` actually gets parsed and executed. **VERDICT: No material issues found** Findings (all within the error-handling / edge-case lens): - The `:max` suffix on `claude-code/opus:max` is parsed safely. Verified in `cmd/gadfly/engine.go:86-95`: `strings.Cut(after, ":")` splits the thinking tier, and `parseThinking` (`engine.go:113-121`) maps `"max"` → `maxThinkingTokens` (31999), a positive int → itself, and **anything else → 0 (off)**. Bad/garbage suffixes degrade to "no extended thinking" rather than panicking — no `strconv` panic path, since the error from `Atoi` is checked. - The CSV-with-colons edge case is fine. `claude-code/opus:max` contains a colon, but `GADFLY_MODELS` is comma-separated and existing entries (`qwen3-coder:480b-cloud`) already carry colons, so colon-in-spec is an established, handled shape. The colon only gets special meaning *after* the `claude-code/` slash prefix (`engine.go:88-90`), which is correct. - The "best-effort / no-op if CLI ignores it" claim in the PR holds without a failure mode: `MAX_THINKING_TOKENS` is only appended to the subprocess env when `thinkingTokens > 0` (`engine.go:152-154`), and it's explicitly whitelisted in the minimal env passthrough (`engine.go:224`). If the installed CLI ignores the var, the run still produces a normal result — no error swallowed, no crash. - Unhappy paths in the claude-code engine that this model exercises are already handled: timeout/cancel surfaces as an error instead of partial bytes (`engine.go:174-176`), empty results error out (`engine.go:190`), and process-group kill prevents orphaned children (`engine.go:159-165`). This PR adds no new unhandled path. Nothing in my lane is materially wrong. The change introduces no new error-handling or edge-case surface beyond the `:max` parse, which is covered (and tested — `engine_test.go:229,247,272`). </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 56s</sub>

🪰 Gadfly review — deepseek-v4-pro:cloud (ollama-cloud)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

Verdict: No material issues found

This is a workflow-only change that adds a new model entry (claude-code/opus:max) and bumps the pinned Docker image. Through the security lens:

  • No new secrets or credentials are introduced. The existing CLAUDE_CODE_OAUTH_TOKEN is unchanged.
  • No injection vectors: the model spec claude-code/opus:max is hardcoded in the workflow YAML, not derived from any user-controlled input. The :max suffix is parsed by parseThinking() (engine.go:113-121), which only accepts the literal "max" (case-insensitive) or a positive integer — no path traversal, command injection, or other taint risk.
  • MAX_THINKING_TOKENS in the subprocess env: I verified that claudeEnv() (engine.go:224) already allowlists MAX_THINKING_TOKENS for the CLI subprocess. The value is hardcoded as 31999 in Go (engine.go:66), not sourced from workflow inputs or secrets. No risk of an attacker-controlled value reaching the CLI.
  • Image pinning: the image is pinned to an immutable :sha-c342bdb tag — no supply-chain regression from the prior :sha-80d8f53.
  • No authz/permission changes: the job-level if: guard, GADFLY_ALLOWED_USERS, and permissions: block are all untouched.

Nothing material to flag here.

🎯 Correctness — No material issues found

Verdict: No material issues found

The diff adds claude-code/opus:max to the GADFLY_MODELS list and bumps the pinned image to sha-c342bdb (which carries the :thinking suffix parser from #5). I verified the full parse chain end-to-end:

  • Spec parsing (engine.go:83-108): claude-code/opus:maxmodel="opus", thinkingTokens=31999. Confirmed by reading the newClaudeCodeEngine and parseThinking functions.
  • Env injection (engine.go:152-154): thinkingTokens > 0 appends MAX_THINKING_TOKENS=31999 to the subprocess env. Confirmed by the test at engine_test.go:270-273 which asserts the stub receives MTT=31999.
  • Provider lane (entrypoint.sh:159): provider_of "claude-code/opus:max"claude-code, matching the existing claude-code=2 concurrency cap. The 3 claude-code models (sonnet, opus, opus:max) will run 2 at a time — correct.
  • Lens concurrency: GADFLY_PROVIDER_LENS_CONCURRENCY has no claude-code entry, so claude-code models fall back to defaultLensConcurrency=1 (sequential lenses per model). This is consistent with the existing sonnet/opus behavior.

No logic bugs, no misrouted models, no concurrency misconfiguration. The change is purely additive and mechanically sound.

🧯 Error handling & edge cases — No material issues found

No material issues found.

The diff adds claude-code/opus:max to the model fleet. Through the error-handling lens, I verified:

  • :max parsing: newClaudeCodeEngine (engine.go:86-95) splits opus:max on : → model=opus, thinking=max. parseThinking("max") (engine.go:114) returns maxThinkingTokens=31999. The runPass method (engine.go:152-154) appends MAX_THINKING_TOKENS=31999 to the subprocess env only when thinkingTokens > 0. All paths are exercised by existing tests (engine_test.go:257+).

  • Graceful degradation: The engine explicitly documents (engine.go:64-66) that MAX_THINKING_TOKENS is a no-op if the CLI build doesn't honor it — opus:max falls back to behaving like opus. No crash, no error, no hung process.

  • Concurrency: GADFLY_PROVIDER_CONCURRENCY=claude-code=2 with 3 claude-code models means one queues. At 10 min/lens × 3 lenses × ~2.5 serial batches ≈ 75 min, well within the 90-minute job timeout. No starvation or deadlock risk.

  • No new nil/empty/zero/negative paths: The new spec flows through the same strings.Cut and parseThinking branches already exercised by claude-code/opus (no colon) and claude-code/sonnet. No new edge cases introduced.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 59s

<!-- gadfly-review:ollama:deepseek-v4-pro:cloud --> ### 🪰 Gadfly review — `deepseek-v4-pro:cloud` (ollama-cloud) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> ## Verdict: No material issues found This is a workflow-only change that adds a new model entry (`claude-code/opus:max`) and bumps the pinned Docker image. Through the security lens: - **No new secrets or credentials** are introduced. The existing `CLAUDE_CODE_OAUTH_TOKEN` is unchanged. - **No injection vectors**: the model spec `claude-code/opus:max` is hardcoded in the workflow YAML, not derived from any user-controlled input. The `:max` suffix is parsed by `parseThinking()` (engine.go:113-121), which only accepts the literal `"max"` (case-insensitive) or a positive integer — no path traversal, command injection, or other taint risk. - **`MAX_THINKING_TOKENS` in the subprocess env**: I verified that `claudeEnv()` (engine.go:224) already allowlists `MAX_THINKING_TOKENS` for the CLI subprocess. The value is hardcoded as `31999` in Go (engine.go:66), not sourced from workflow inputs or secrets. No risk of an attacker-controlled value reaching the CLI. - **Image pinning**: the image is pinned to an immutable `:sha-c342bdb` tag — no supply-chain regression from the prior `:sha-80d8f53`. - **No authz/permission changes**: the job-level `if:` guard, `GADFLY_ALLOWED_USERS`, and `permissions:` block are all untouched. Nothing material to flag here. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> ## Verdict: No material issues found The diff adds `claude-code/opus:max` to the `GADFLY_MODELS` list and bumps the pinned image to `sha-c342bdb` (which carries the `:thinking` suffix parser from #5). I verified the full parse chain end-to-end: - **Spec parsing** (`engine.go:83-108`): `claude-code/opus:max` → `model="opus"`, `thinkingTokens=31999`. Confirmed by reading the `newClaudeCodeEngine` and `parseThinking` functions. - **Env injection** (`engine.go:152-154`): `thinkingTokens > 0` appends `MAX_THINKING_TOKENS=31999` to the subprocess env. Confirmed by the test at `engine_test.go:270-273` which asserts the stub receives `MTT=31999`. - **Provider lane** (`entrypoint.sh:159`): `provider_of "claude-code/opus:max"` → `claude-code`, matching the existing `claude-code=2` concurrency cap. The 3 claude-code models (sonnet, opus, opus:max) will run 2 at a time — correct. - **Lens concurrency**: `GADFLY_PROVIDER_LENS_CONCURRENCY` has no `claude-code` entry, so claude-code models fall back to `defaultLensConcurrency=1` (sequential lenses per model). This is consistent with the existing sonnet/opus behavior. No logic bugs, no misrouted models, no concurrency misconfiguration. The change is purely additive and mechanically sound. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> **No material issues found.** The diff adds `claude-code/opus:max` to the model fleet. Through the error-handling lens, I verified: - **`:max` parsing**: `newClaudeCodeEngine` (engine.go:86-95) splits `opus:max` on `:` → model=`opus`, thinking=`max`. `parseThinking("max")` (engine.go:114) returns `maxThinkingTokens=31999`. The `runPass` method (engine.go:152-154) appends `MAX_THINKING_TOKENS=31999` to the subprocess env only when `thinkingTokens > 0`. All paths are exercised by existing tests (engine_test.go:257+). - **Graceful degradation**: The engine explicitly documents (engine.go:64-66) that `MAX_THINKING_TOKENS` is a no-op if the CLI build doesn't honor it — `opus:max` falls back to behaving like `opus`. No crash, no error, no hung process. - **Concurrency**: `GADFLY_PROVIDER_CONCURRENCY=claude-code=2` with 3 claude-code models means one queues. At 10 min/lens × 3 lenses × ~2.5 serial batches ≈ 75 min, well within the 90-minute job timeout. No starvation or deadlock risk. - **No new nil/empty/zero/negative paths**: The new spec flows through the same `strings.Cut` and `parseThinking` branches already exercised by `claude-code/opus` (no colon) and `claude-code/sonnet`. No new edge cases introduced. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 59s</sub>

🪰 Gadfly review — nemotron-3-super:cloud (ollama-cloud)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

No material issues found

  • .gitea/workflows/adversarial-review.yml:48 - Routine dependency update (Docker image)
  • .gitea/workflows/adversarial-review.yml:63 - Adds a new model variant (Claude Code opus with max extended thinking) to the model fleet
  • .gitea/workflows/adversarial-review.yml:57-62,65 - Updates comments to accurately reflect the changes
🎯 Correctness — No material issues found

No material issues found

🧯 Error handling & edge cases — Reviewed

Yes, the change shown in the diff is present in the current file .gitea/workflows/adversarial-review.yml. The line:

          GADFLY_MODELS: "minimax-m3:cloud,glm-5.2:cloud,glm-5.1:cloud,deepseek-v4-pro:cloud,nemotron-3-super:cloud,qwen3-coder:480b-cloud,claude-code/sonnet,claude-code/opus,claude-code/opus:max"

matches the updated line shown in the diff (the addition of ,claude-code/opus:max). The file currently contains that exact string, confirming the change has been applied.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 1m 39s

<!-- gadfly-review:ollama:nemotron-3-super:cloud --> ### 🪰 Gadfly review — `nemotron-3-super:cloud` (ollama-cloud) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> No material issues found - .gitea/workflows/adversarial-review.yml:48 - Routine dependency update (Docker image) - .gitea/workflows/adversarial-review.yml:63 - Adds a new model variant (Claude Code opus with max extended thinking) to the model fleet - .gitea/workflows/adversarial-review.yml:57-62,65 - Updates comments to accurately reflect the changes </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> No material issues found </details> <details><summary><b>🧯 Error handling & edge cases</b> — Reviewed</summary> Yes, the change shown in the diff is present in the current file `.gitea/workflows/adversarial-review.yml`. The line: ``` GADFLY_MODELS: "minimax-m3:cloud,glm-5.2:cloud,glm-5.1:cloud,deepseek-v4-pro:cloud,nemotron-3-super:cloud,qwen3-coder:480b-cloud,claude-code/sonnet,claude-code/opus,claude-code/opus:max" ``` matches the updated line shown in the diff (the addition of `,claude-code/opus:max`). The file currently contains that exact string, confirming the change has been applied. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 1m 39s</sub>

🪰 Gadfly review — qwen3-coder:480b-cloud (ollama-cloud)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

VERDICT: No material issues found

This PR modifies a CI workflow file to enable a new reviewer model (claude-code/opus:max) by updating the Docker image tag and adding the new model to the GADFLY_MODELS environment variable. Through the security lens:

  • The change involves no new code paths or logic, only configuration updates in a YAML workflow file.
  • No secrets are exposed or mishandled; the existing CLAUDE_CODE_OAUTH_TOKEN secret is reused appropriately.
  • No input validation or sanitization concerns arise since this is a static configuration change.
  • No authentication or authorization gaps are introduced.
  • No injection, SSRF, or deserialization risks are present.

The diff has been verified as a safe configuration update with no security implications. The addition of a new model to the review fleet is operationally sound and follows the established pattern for model inclusion.

🎯 Correctness — No material issues found

No material issues found.

🧯 Error handling & edge cases — No material issues found

No material issues found.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5s

<!-- gadfly-review:ollama:qwen3-coder:480b-cloud --> ### 🪰 Gadfly review — `qwen3-coder:480b-cloud` (ollama-cloud) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> VERDICT: No material issues found This PR modifies a CI workflow file to enable a new reviewer model (`claude-code/opus:max`) by updating the Docker image tag and adding the new model to the `GADFLY_MODELS` environment variable. Through the security lens: - The change involves no new code paths or logic, only configuration updates in a YAML workflow file. - No secrets are exposed or mishandled; the existing `CLAUDE_CODE_OAUTH_TOKEN` secret is reused appropriately. - No input validation or sanitization concerns arise since this is a static configuration change. - No authentication or authorization gaps are introduced. - No injection, SSRF, or deserialization risks are present. The diff has been verified as a safe configuration update with no security implications. The addition of a new model to the review fleet is operationally sound and follows the established pattern for model inclusion. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> No material issues found. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> No material issues found. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5s</sub>

🪰 Gadfly review — claude-code/opus:max (claude-code)

Verdict: No material issues found — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

I have what I need. The change is workflow-only (image tag bump + one model added to GADFLY_MODELS), and I've verified the security-relevant code path that the new claude-code/opus:max spec exercises.

VERDICT: No material issues found

Reviewed through the 🔒 Security lens. This is a CI workflow-only change: bumps the pinned image from :sha-80d8f53 to :sha-c342bdb and appends claude-code/opus:max to GADFLY_MODELS. I traced the new spec into the code it actually exercises and found no security problem.

What I verified:

  • No injection via the new model spec. claude-code/opus:max is parsed by parseThinking (cmd/gadfly/engine.go:113), which only maps "max" → the hardcoded constant maxThinkingTokens = 31999 or a positive integer → itself; everything else → 0. The value reaches the subprocess as MAX_THINKING_TOKENS=<int> via strconv.Itoa (engine.go:154), and the CLI is launched with exec.CommandContext + an argv slice (engine.go:149) — no shell, so no command/arg injection from the spec.
  • No secret leakage to the third-party CLI. The Claude Code subprocess gets a minimal allowlisted env via claudeEnv() (engine.go:221-233): GITEA_TOKEN, GADFLY_FINDINGS_TOKEN, and provider API keys are deliberately dropped; only CLAUDE_*/ANTHROPIC_*/NODE_*/XDG_*/locale/MAX_THINKING_TOKENS pass through. Adding another claude-code/* model reuses the already-present CLAUDE_CODE_OAUTH_TOKEN — no new secret is introduced or exposed by this diff.
  • Supply-chain pinning preserved. The image is still pinned to an immutable :sha-<short> digest tag rather than :latest, which is the safer choice (and matches the project's own documented lesson about :latest caching). No downgrade in pinning discipline.

Note on scope: I did not re-audit the pre-existing env-allowlist or spec-parsing logic as findings since this PR doesn't modify them — I only confirmed the new value flows through them safely.

Nothing in the security lane is materially wrong here.

🎯 Correctness — No material issues found

I've verified the change against the actual checked-out code. Here is my review.


VERDICT: No material issues found (Correctness lens)

This is a workflow-only change: bump the image tag and append claude-code/opus:max to GADFLY_MODELS. I checked the supporting code paths that make this spec meaningful:

  • :max suffix parses correctly. newClaudeCodeEngine (cmd/gadfly/engine.go:83-96) cuts on the first /, then on :, sending opus:max → model opus, suffix max. parseThinking (engine.go:113-121) maps maxmaxThinkingTokens. Verified by reading the code and the table tests at engine_test.go:228-232 ({"claude-code/opus:max", "opus", maxThinkingTokens}).
  • MAX_THINKING_TOKENS=31999 is the correct value. maxThinkingTokens = 31999 (engine.go:66) is Claude Code's documented "ultrathink" budget (32000−1). Re-derived against the known tier value; it's not an arbitrary magic number. It's injected into the subprocess env only when thinkingTokens > 0 (engine.go:152-154) and whitelisted through claudeEnv() (engine.go:224), so it actually reaches the CLI. The "best-effort / no-op if ignored" claim in the PR is accurate.
  • No collision between opus and opus:max. My main correctness worry was that both specs resolve to model id opus and would clobber each other's PR comment or status file. They don't: the comment MARKER uses the full ${MODEL} spec (scripts/run.sh:50), and status_file_for (entrypoint.sh:165) sanitizes the full spec via trclaude-code_opus vs claude-code_opus_max. Both keys stay distinct, so the two thinking depths get separate comments as intended.
  • Engine selection still triggers. GADFLY_MODEL="claude-code/opus:max" satisfies isClaudeCodeSpec's claude-code/ prefix check (engine.go:72), and provider_of (entrypoint.sh:159) yields lane claude-code, matching the claude-code=2 concurrency entry. Correct routing.

One thing I could not verify (and it's operational, not code-correctness): whether the pinned image :sha-c342bdb actually contains this :max parse code. The repo is shallow here (only the head commit is present), so I can't confirm c342bdb's contents. If that image predates the suffix parsing, claude-code/opus:max would be passed to the CLI as a literal --model opus:max and fail — but that's a deployment claim the PR asserts, and the code at this checkout is correct and tested.

Nothing material in the correctness lane.

🧯 Error handling & edge cases — No material issues found

I've verified the relevant code paths. My lens is error handling & edge cases.

VERDICT: No material issues found

This is a workflow-only change (image SHA bump + adding claude-code/opus:max to GADFLY_MODELS). I checked the unhappy paths the new spec could introduce and they are all handled:

  • Spec parsing of opus:max is defensive. newClaudeCodeEngine splits on the first : (engine.go:90-95) and parseThinking (engine.go:113-121) maps "max"31999, a positive int→itself, and anything else→0 with no panic — strconv.Atoi errors and non-positive values fall through to the safe 0 (thinking off). Bad input can't crash the run. Verified by reading the function.
  • No silent output collision between claude-code/opus and claude-code/opus:max (the one edge case worth checking, since both reduce to --model opus). The two stay distinct everywhere they're keyed:
    • Status file: status_file_for sanitizes via tr -c '[:alnum:]._-' '_' (entrypoint.sh:165), giving claude-code_opus.json vs claude-code_opus_max.json.
    • PR comment marker: <!-- gadfly-review:${PROVIDER}:${MODEL} --> (run.sh:50) keys on the full MODEL, so the two get separate comments.
      Neither overwrites the other.
  • Cleanup on the slow path is intact. opus:max will run longer; if it exceeds the per-lens timeout, cmd.Cancel SIGKILLs the whole process group (engine.go:156-165, Setpgid), so no orphaned claude/node children. A missing/ignoring CLI is a normal pass error, kept advisory (engine.go:81-82), and review_one logs+continues on failure (entrypoint.sh:183).
  • Lane routing is correct. provider_of "claude-code/opus:max"claude-code (segment before first /, entrypoint.sh:159), so the claude-code=2 concurrency cap applies as intended.

The constant 31999 matches the documented Claude Code "ultrathink" budget and the PR description; no off-by-one or unit concern in my lane.

Nothing in the error-handling lens is materially wrong here.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 3m 38s

<!-- gadfly-review:ollama:claude-code/opus:max --> ### 🪰 Gadfly review — `claude-code/opus:max` (claude-code) **Verdict: No material issues found** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> I have what I need. The change is workflow-only (image tag bump + one model added to `GADFLY_MODELS`), and I've verified the security-relevant code path that the new `claude-code/opus:max` spec exercises. ## VERDICT: No material issues found Reviewed through the **🔒 Security** lens. This is a CI workflow-only change: bumps the pinned image from `:sha-80d8f53` to `:sha-c342bdb` and appends `claude-code/opus:max` to `GADFLY_MODELS`. I traced the new spec into the code it actually exercises and found no security problem. **What I verified:** - **No injection via the new model spec.** `claude-code/opus:max` is parsed by `parseThinking` (`cmd/gadfly/engine.go:113`), which only maps `"max"` → the hardcoded constant `maxThinkingTokens = 31999` or a positive integer → itself; everything else → 0. The value reaches the subprocess as `MAX_THINKING_TOKENS=<int>` via `strconv.Itoa` (`engine.go:154`), and the CLI is launched with `exec.CommandContext` + an argv slice (`engine.go:149`) — no shell, so no command/arg injection from the spec. - **No secret leakage to the third-party CLI.** The Claude Code subprocess gets a minimal allowlisted env via `claudeEnv()` (`engine.go:221-233`): `GITEA_TOKEN`, `GADFLY_FINDINGS_TOKEN`, and provider API keys are deliberately dropped; only `CLAUDE_*`/`ANTHROPIC_*`/`NODE_*`/`XDG_*`/locale/`MAX_THINKING_TOKENS` pass through. Adding another `claude-code/*` model reuses the already-present `CLAUDE_CODE_OAUTH_TOKEN` — no new secret is introduced or exposed by this diff. - **Supply-chain pinning preserved.** The image is still pinned to an immutable `:sha-<short>` digest tag rather than `:latest`, which is the safer choice (and matches the project's own documented lesson about `:latest` caching). No downgrade in pinning discipline. **Note on scope:** I did not re-audit the pre-existing env-allowlist or spec-parsing logic as findings since this PR doesn't modify them — I only confirmed the new value flows through them safely. Nothing in the security lane is materially wrong here. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> I've verified the change against the actual checked-out code. Here is my review. --- **VERDICT: No material issues found** (Correctness lens) This is a workflow-only change: bump the image tag and append `claude-code/opus:max` to `GADFLY_MODELS`. I checked the supporting code paths that make this spec meaningful: - **`:max` suffix parses correctly.** `newClaudeCodeEngine` (`cmd/gadfly/engine.go:83-96`) cuts on the first `/`, then on `:`, sending `opus:max` → model `opus`, suffix `max`. `parseThinking` (`engine.go:113-121`) maps `max` → `maxThinkingTokens`. Verified by reading the code and the table tests at `engine_test.go:228-232` (`{"claude-code/opus:max", "opus", maxThinkingTokens}`). - **`MAX_THINKING_TOKENS=31999` is the correct value.** `maxThinkingTokens = 31999` (`engine.go:66`) is Claude Code's documented "ultrathink" budget (32000−1). Re-derived against the known tier value; it's not an arbitrary magic number. It's injected into the subprocess env only when `thinkingTokens > 0` (`engine.go:152-154`) and whitelisted through `claudeEnv()` (`engine.go:224`), so it actually reaches the CLI. The "best-effort / no-op if ignored" claim in the PR is accurate. - **No collision between `opus` and `opus:max`.** My main correctness worry was that both specs resolve to model id `opus` and would clobber each other's PR comment or status file. They don't: the comment `MARKER` uses the full `${MODEL}` spec (`scripts/run.sh:50`), and `status_file_for` (`entrypoint.sh:165`) sanitizes the full spec via `tr` → `claude-code_opus` vs `claude-code_opus_max`. Both keys stay distinct, so the two thinking depths get separate comments as intended. - **Engine selection still triggers.** `GADFLY_MODEL="claude-code/opus:max"` satisfies `isClaudeCodeSpec`'s `claude-code/` prefix check (`engine.go:72`), and `provider_of` (`entrypoint.sh:159`) yields lane `claude-code`, matching the `claude-code=2` concurrency entry. Correct routing. One thing I could **not** verify (and it's operational, not code-correctness): whether the pinned image `:sha-c342bdb` actually contains this `:max` parse code. The repo is shallow here (only the head commit is present), so I can't confirm `c342bdb`'s contents. If that image predates the suffix parsing, `claude-code/opus:max` would be passed to the CLI as a literal `--model opus:max` and fail — but that's a deployment claim the PR asserts, and the code at this checkout is correct and tested. Nothing material in the correctness lane. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> I've verified the relevant code paths. My lens is error handling & edge cases. ## VERDICT: No material issues found This is a workflow-only change (image SHA bump + adding `claude-code/opus:max` to `GADFLY_MODELS`). I checked the unhappy paths the new spec could introduce and they are all handled: - **Spec parsing of `opus:max` is defensive.** `newClaudeCodeEngine` splits on the first `:` (`engine.go:90-95`) and `parseThinking` (`engine.go:113-121`) maps `"max"`→`31999`, a positive int→itself, and **anything else→0** with no panic — `strconv.Atoi` errors and non-positive values fall through to the safe `0` (thinking off). Bad input can't crash the run. Verified by reading the function. - **No silent output collision** between `claude-code/opus` and `claude-code/opus:max` (the one edge case worth checking, since both reduce to `--model opus`). The two stay distinct everywhere they're keyed: - Status file: `status_file_for` sanitizes via `tr -c '[:alnum:]._-' '_'` (`entrypoint.sh:165`), giving `claude-code_opus.json` vs `claude-code_opus_max.json`. - PR comment marker: `<!-- gadfly-review:${PROVIDER}:${MODEL} -->` (`run.sh:50`) keys on the full `MODEL`, so the two get separate comments. Neither overwrites the other. - **Cleanup on the slow path is intact.** `opus:max` will run longer; if it exceeds the per-lens timeout, `cmd.Cancel` SIGKILLs the whole process group (`engine.go:156-165`, `Setpgid`), so no orphaned `claude`/node children. A missing/ignoring CLI is a normal pass error, kept advisory (`engine.go:81-82`), and `review_one` logs+continues on failure (`entrypoint.sh:183`). - **Lane routing is correct.** `provider_of "claude-code/opus:max"` → `claude-code` (segment before first `/`, `entrypoint.sh:159`), so the `claude-code=2` concurrency cap applies as intended. The constant `31999` matches the documented Claude Code "ultrathink" budget and the PR description; no off-by-one or unit concern in my lane. Nothing in the error-handling lens is materially wrong here. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 3m 38s</sub>
steve merged commit a4cdc905c9 into main 2026-06-27 22:49:49 +00:00
steve deleted branch ci/opus-max-thinking 2026-06-27 22:49:49 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: steve/gadfly#6