feat: add claude-code/opus reviewer + max-thinking spec support #5
Reference in New Issue
Block a user
Delete Branch "ci/add-opus-reviewer"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Per Steve: add Claude Code opus to gadfly's own swarm, and prep a max-thinking variant.
This PR
claude-code/opusalongsideclaude-code/sonnet(claude-code lane → 2 so both run in parallel), and bumps the image pin to:sha-80d8f53so the clean-lens telemetry fix from #4 is actually live in dogfood reviews.claude-code/<model>:<thinking>spec now forces an extended-thinking budget for that run viaMAX_THINKING_TOKENSon the CLI subprocess —:max(high "ultrathink" tier) or:<n>. Best-effort (a no-op if the installed CLI ignores it); harmless, never errors.:thinkingsuffix; new tests cover the spec parse.Follow-up (PR B)
claude-code/opus:maxis not enabled here on purpose: the currently-pinned image predates the:thinkingparse and would mis-routeopus:maxas a model id. Once this PR merges and its image builds, a one-line follow-up addsclaude-code/opus:maxand pins to that image — giving you sonnet + opus + opus-with-max-thinking, three Claude Code reviewers.gofmt/vet/build/test -racegreen; workflow YAML validated.🤖 Generated with Claude Code
🪰 Gadfly — live review status
8/8 reviewers finished · updated 2026-06-27 22:34:08Z
claude-code/opus· claude-code — ✅ doneclaude-code/sonnet· claude-code — ✅ donedeepseek-v4-pro:cloud· ollama-cloud — ✅ doneglm-5.1:cloud· ollama-cloud — ✅ doneglm-5.2:cloud· ollama-cloud — ✅ doneminimax-m3:cloud· ollama-cloud — ✅ donenemotron-3-super:cloud· ollama-cloud — ✅ doneqwen3-coder:480b-cloud· ollama-cloud — ✅ doneLive status board. Findings are posted in each model's own comment. Advisory only — does not block merge.
🪰 Gadfly review —
claude-code/sonnet(claude-code)Verdict: No material issues found — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
No material issues found.
Security findings for this PR:
MAX_THINKING_TOKENSenv injection (engine.go:152-155): The value injected is always an integer, derived viastrconv.Atoi(guardedn > 0) or the literal constant31999.strconv.Itoaback to string means no injection surface exists regardless of spec content. Verified by readingparseThinking(lines 111–121) and the append site.claudeEnv()isolation still holds: The whitelist does not includeMAX_THINKING_TOKENS, so it cannot bleed in from the Gadfly process's own environment. The new code adds it explicitly and only whenthinkingTokens > 0. No double-append risk. Verified: the whitelist passesCLAUDE_*,ANTHROPIC_*,GADFLY_CLAUDE_*, and a few system keys — notMAX_THINKING_TOKENS. Confirmed by readingclaudeEnv(lines 216–241).GADFLY_CLAUDE_*prefix pass-through to subprocess (pre-existing, not introduced here): operator-controlled env vars under that prefix are whitelisted into the subprocess, which is intentional. Not a regression from this PR.Model spec origin:
GADFLY_MODELSis a hardcoded literal in the workflow YAML (no${{ }}interpolation), making it fully operator-controlled, not reachable from PR-author content. Verified in.gitea/workflows/adversarial-review.yml.The change is clean from a security perspective: the only new subprocess-env mutation is an integer, and the
claudeEnv()isolation boundary is unchanged.🎯 Correctness — No material issues found
No material issues found
I traced every claim through the code before reporting.
isClaudeCodeSpecrouting with thinking suffix — verifiedengine.go:68-70:strings.HasPrefix("claude-code/opus:max", "claude-code/")istrue, so the spec routes tonewClaudeCodeEngineand never reachesresolveModel(). No mis-routing.strings.Cutparse chain — verifiedengine.go:82-90:"claude-code/opus:max"→ slash-cut →after="opus:max"→ colon-cut →model="opus",t="max"→parseThinking("max")→maxThinkingTokens. All cases in the test table (plain,:max,:<n>,:bogus, bare) parse correctly.GADFLY_CLAUDE_MODELoverride interaction — verifiedengine.go:91-92: env override applies after spec parsing, so:max's thinking budget survives the model-ID override. TheTestClaudeCodeThinkingEnvOverrideKeepsSuffixtest validates this. Semantically correct: the env replaces only the--modelargument, not the thinking tier.maxThinkingTokens = 31999— I re-derived this independently. Claude Code's ultrathink tier maps to 31,999 in its internal source (just under the 32 k boundary). The value is consistent with the intended tier. Even if the exact value drifted in a newer CLI build, the PR documents it as best-effort ("a no-op if unsupported"), so the blast radius of any delta is zero:opus:maxdegrades toopusat worst.Edge cases in
parseThinking—""→ 0,"0"→ 0 (failsn > 0), negative integers → 0,"MAX"/"Max"→maxThinkingTokens(case-insensitive fold). All correct.Concurrency bump
claude-code=2— purely a workflow config change matching the new two-model fleet; no binary logic involved.🧯 Error handling & edge cases — No material issues found
No material issues found
I verified the following unhappy paths and edge cases introduced by this diff:
parseThinkinginput exhaustion:strconv.Atoireturns an error on overflow (handled byerr == nil);n > 0rejects 0 and negatives;"bogus"/ multi-colon suffixes like"max:extra"fall through to0silently — safe, degrades to no thinking budget rather than erroring or panicking. Verified by readingengine.go:111-121and the test table atengine_test.go:221-241.MAX_THINKING_TOKENSnever leaks from parent env:claudeEnv()uses a strict allowlist (CLAUDE_*,ANTHROPIC_*,GADFLY_CLAUDE_*,NODE_*,XDG_*, and eight named vars).MAX_THINKING_TOKENSmatches none of those prefixes, so it is always stripped from parent env before the subprocess starts. The code then explicitly appends it only whenthinkingTokens > 0(line 154). No duplication possible. Confirmed by readingclaudeEnvat lines 221-241.GADFLY_CLAUDE_MODELoverride interaction: The thinking value is parsed from the spec before the env override replaces the model, so the two paths are independent — no way the override accidentally zeroes outthinkingTokensor vice versa. Covered byTestClaudeCodeThinkingEnvOverrideKeepsSuffix.Empty/degenerate specs:
"claude-code/"→model="",thinking=0;"claude-code"(no slash) → same. Both are pre-existing behaviour, unchanged. Clean.Process group kill on context cancel: Unchanged; the new env append sits before
cmd.SysProcAttrsetup and has no interaction with the kill logic.No swallowed errors, no missing cleanup paths, no panics on bad input introduced by this diff.
Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 10m 13s
🪰 Gadfly review —
claude-code/opus(claude-code)Verdict: No material issues found — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
I've verified the change against the actual code. Here is my review through the Security lens.
VERDICT: No material issues found
The change is small and security-clean within my lens. What I checked:
MAX_THINKING_TOKENSenv injection (engine.go:152-155) — The appended value is built frome.thinkingTokens, which is anint, stringified viastrconv.Itoa. Tracing the taint: spec string →parseThinking(engine.go:113) → returns either themaxThinkingTokensconstant orstrconv.Atoi-validated positive int, else0. No string from the spec ever reaches the env value, so there is no env/argument injection vector even if the spec suffix were attacker-influenced. Verified by reading the full data flow.Spec →
--modelargv (engine.go:90-95, 129-130) — The new:-split narrows what reaches--model(everything after the first colon is stripped to the thinking tier), and--modelis passed as a discrete argv element viaexec.CommandContext(no shell), so no new injection surface is introduced. The model id remains operator-controlled config (GADFLY_MODELS/GADFLY_CLAUDE_MODEL), not PR-author/untrusted input — unchanged by this PR.Secret handling (engine.go:151, 216-241) —
claudeEnv()still builds a minimal allowlisted env that deliberately dropsGITEA_TOKEN,GADFLY_FINDINGS_TOKEN, and provider keys. The new line appends only a numericMAX_THINKING_TOKENSafterclaudeEnv(), leaking nothing. NoteMAX_THINKING_TOKENSis not in the allowlist, so a value is only ever present when this code sets it — no accidental passthrough of a parent value. No credential exposure introduced.Workflow YAML (
adversarial-review.yml) — Only the image pin, model list, and concurrency changed.CLAUDE_CODE_OAUTH_TOKENwiring is unchanged; no secrets newly exposed in logs or env. Out of lens, but nothing security-relevant here either.Nothing in the authn/authz, injection, SSRF, deserialization, secret-leakage, or input-validation categories is materially wrong in this diff.
🎯 Correctness — No material issues found
I've read the full engine.go and verified the change. Let me note my verification of the key semantic constant.
The critical magic number here is
maxThinkingTokens = 31999. Re-deriving from Claude Code's documented thinking-budget tiers:think→ 4000,megathink/think hard→ 10000,ultrathink→ 31999 (deliberately one below the 32000 boundary). The constant matches the "ultrathink" tier exactly. ✓VERDICT: No material issues found
I reviewed this strictly through the correctness lens and verified against the checked-out code (
cmd/gadfly/engine.go).maxThinkingTokens = 31999is correct (engine.go:66). I re-derived it independently: this is Claude Code's high "ultrathink" budget (31999, not 32000), so the comment's "high ultrathink tier" claim is accurate. Not a plausible-but-wrong magic number.parseThinkingis correct across all branches (engine.go:113)."max"→31999; positive int→itself; negative/zero/non-numeric→0 (off). Then > 0guard correctly rejects-5and0so no zero/negative budget leaks through. The five test cases match the implementation.engine.go:86-96).claude-code/opus:max→Cut("/")→"opus:max"→Cut(":")→ model"opus", thinking 31999. Bareclaude-code→ no slash → model"", thinking 0. Trailing-colon (opus:) and missing-model (:max) cases degrade harmlessly. Verified against both new tests.GADFLY_CLAUDE_MODELoverride (engine.go:97-99). Old code skipped slash-parsing when the env var was set; new code always parses the spec then overridesmodelfrom env. End result formodelis identical (env still wins), and the thinking suffix now correctly survives the override — whichTestClaudeCodeThinkingEnvOverrideKeepsSuffixpins down. Correct and intentional.cmd.Envappend is safe (engine.go:151-155). I confirmedclaudeEnv()(engine.go:221-241) returns a freshly-allocated slice each call, so appendingMAX_THINKING_TOKENS=...cannot mutate shared state across passes/lenses.adversarial-review.yml).claude-code/opus(no:thinkingsuffix) is added andclaude-code=2lets sonnet+opus run in parallel. The PR deliberately withholdsopus:maxbecause the pinned:sha-80d8f53image predates this parse — correct, since on the old binaryopus:maxwould be passed verbatim as a--modelid. No correctness problem on either the old or new image.Considered and dismissed (not findings): a model id containing a
:(e.g. a Bedrock-style...:0) would be mis-split by the colon parse, but such ids are only ever supplied viaGADFLY_CLAUDE_MODEL, which bypasses the colon parsing entirely — so the documentedclaude-code/<alias>spec form (no colons) is never affected. Within the lens, this is a non-issue.Nothing in the correctness lane is wrong here.
🧯 Error handling & edge cases — No material issues found
I've read the full
engine.go. Let me verify my reasoning about the edge cases inparseThinkingand confirm theclaudeEnv/MAX_THINKING_TOKENSinteraction has no duplicate-key or inheritance hazard — both points I can confirm directly from the code above. No further files needed; the change is self-contained toengine.go(the YAML/README/test changes carry no unhappy paths).VERDICT: No material issues found
Reviewed strictly through the error handling & edge cases lens. The new logic is defensively written and I could not find a real unhappy-path defect.
What I checked and confirmed clean:
parseThinking(engine.go:113-121) handles every bad input without panicking or leaking a bad value. I traced each case against the actual code:Atoi("")errors →0(off) ✓:-5) →Atoisucceeds butn > 0is false →0✓ (never sets a negativeMAX_THINKING_TOKENS):0) →n > 0false →0✓:99999999999999999999) →Atoireturns an error →0✓ (no silent wrap):bogus) →0, matching the test and the documented "no-op if unrecognized" contract ✓MAX_THINKING_TOKENS.claudeEnv()(engine.go:221-241) only keepsMAX_THINKING_TOKENSif it matches one of its prefixes — it does not (MAX_isn't in thekeepset), so a parent-env value is stripped. The append at engine.go:152-154 only fires whenthinkingTokens > 0, so the subprocess gets exactly zero or one copy. No aliasing either:claudeEnv()returns a freshly-allocated slice each call, so theappendcan't corrupt shared state. Verified by reading both functions.claude-code(no slash) →model="", thinking=0;claude-code/:max→ empty model + max thinking; both produce a coherent argv (--modelomitted at engine.go:129-131). No nil/empty hazard.One transparent, non-material observation (not a finding, not blocking): a multi-colon spec like
claude-code/opus:max:foosplits on the first:, soparseThinking("max:foo")falls through to0and thinking is silently disabled rather than honoringmax. This is consistent with the documented "anything else → off" contract and is an implausible input, so I'm not treating it as a defect — noting it only for completeness.Outside my lens: the
:31999constant is the genuine Claude Code "ultrathink" budget, so it checks out — flagging only so the magic-number question isn't left open for the correctness reviewer.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 3m 53s
🪰 Gadfly review —
minimax-m3:cloud(ollama-cloud)Verdict: Minor issues — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
MAX_THINKING_TOKENSdoesn't match any of the prefixes inclaudeEnv()'s keep-list (PATH/HOME/USER/LOGNAME/TMPDIR/LANG/TERM/LC_/CLAUDE_/ANTHROPIC_/GADFLY_CLAUDE_/NODE_/XDG_), so if a parent process happened to setMAX_THINKING_TOKENSin its env, it would be stripped before the engine appends its own — meaning the engine's appended value is authoritative whenthinkingTokens > 0. Good — that's the right behavior (and the appended value wins over earlier entries incmd.Envperexec.Cmdsemantics).VERDICT: No material issues found.
Through the security lens specifically:
:<thinking>suffix ofGADFLY_MODELand the env varsGADFLY_CLAUDE_*. Both are operator-controlled (workflow YAML / repo), not user-supplied PR content.cmd/gadfly/main.go:156-159confirmsGADFLY_MODELis consumed from the env at process start.MAX_THINKING_TOKENS=...is assembled as"MAX_THINKING_TOKENS="+strconv.Itoa(e.thinkingTokens)wheree.thinkingTokensis constrained to either the const31999or astrconv.Atoi-validated positive int (engine.go:113-121). No path through which=,\n, or shell metacharacters can enter the value. Verified by readingparseThinkingand the append site.claudeEnv()(engine.go:221-241) is unchanged and still correctly scrubsGITEA_TOKEN,GADFLY_FINDINGS_TOKEN, etc. The newMAX_THINKING_TOKENSis appended afterclaudeEnv(), and becauseMAX_THINKING_TOKENSdoes not match any keep-prefix, the engine's value cannot be silently overridden by a stray parent env var — it always wins when set.e.args()(engine.go:127-139) is unchanged in its argument construction; the only new env-var is appended tocmd.Env, not argv, so there's no new shell-quoting/argv-injection surface. TheGADFLY_CLAUDE_EXTRA_ARGSwhitespace-splitting is pre-existing and untouched.--permission-mode plan(read-only) remains the default.The diff is clean from a security perspective. (Other lenses — correctness of the
:thinkingparse, the unverified CLI mechanism in the PR description, and the image-pin bump — are out of scope here.)🎯 Correctness — Minor issues
VERDICT: Minor issues
Findings
cmd/gadfly/engine_test.go:221-241— thinking-env plumbing has no end-to-end test.TestClaudeCodeThinkingonly checkse.thinkingTokensafter construction; nothing exercisesrunPassto verifyMAX_THINKING_TOKENS=<n>actually lands incmd.Envfor the subprocess. The flow is load-bearing two-step:claudeEnv()atengine.go:221-241filters inherited env (noMAX_THINKING_TOKENSprefix under the keep-switch at lines 222-233), then the append atengine.go:152-155re-adds it. A regression that reorders the append before the filter (or drops the append) would silently ship a no-op:maxspec and no test would notice. Recommend aTestRunPassSetsMaxThinkingEnvthat stubs the CLI to dump its env and assertsMAX_THINKING_TOKENS=31999is present..gitea/workflows/adversarial-review.yml:65— concurrency-budget comment uses stale fleet arithmetic. Comment says "3 cloud models x 3 lenses = 9 concurrent ollama-cloud queries (under the 10 budget)" — this described the old 3-model fleet. The new fleet is 6 cloud models at cap=3; the arithmetic coincidentally still yields 9 (3-at-a-time × 3 lenses), so the budget number isn't wrong, but "3 cloud models" is stale and will mislead a future reader who bumps the fleet to 7+ and re-checks the math against the 10-query ceiling.README.md:98— README example uses:maxspec that the current workflow doesn't include. The workflow'sGADFLY_MODELSline at.gitea/workflows/adversarial-review.yml:62listsclaude-code/sonnet,claude-code/opus(no:max), while the README example at line 98 advertisesclaude-code/opus:maxto readers. Copy-pasting the README line today routes:maxthrough the in-binary parser (verified:engine.go:90-95Cuton:setsmodel=opus,thinking=31999), so it does work in the current image — but the inconsistency between README and workflow is confusing. Either align the workflow with the README example (addclaude-code/opus:maxas a competitor reviewer) or add a caveat noting the example is illustrative of the:<thinking>syntax.🧯 Error handling & edge cases — Minor issues
VERDICT: Minor issues
Through the error-handling & edge-cases lens, the new
:thinkingparse and env injection are defensively written (no panics on empty/whitespace/overflow/negative input). A few small loose ends:parseThinkingsilently swallows malformed suffixes —cmd/gadfly/engine.go:113-121. A spec likeclaude-code/opus:mxa,:0,:-1, or:1.5returns0with no warning. The new tests cover:bogus, but0/-1/1.5aren't asserted and a typo'd suffix is invisible to the operator (they'll think they got max-thinking and didn't). Either log to stderr once at engine build (gadfly: claude-code spec ...: unrecognized thinking suffix; using default), or document the silent fall-through in the README near the:maxexample. Verified by reading the function:EqualFold("mxa","max")=false,strconv.Atoi("1.5")errors, returns 0;Atoi("-1")succeeds butn > 0rejects.Multi-colon spec silently truncates —
cmd/gadfly/engine.go:90.claude-code/foo:bar:bazusesstrings.Cut(first-colon split), so model becomesfooand the:bar:baztail is fed toparseThinking, which returns 0. The comment asserts "Claude model aliases/ids contain no:" so this is by design, but it's a real footgun if that assumption ever breaks. Astrings.Count(after, ":") > 1guard (treat as parse error, log + ignore thinking) would harden it without regressing current models.MAX_THINKING_TOKENSenv var name + magic31999are unverified against the CLI —cmd/gadfly/engine.go:66,154. PR description hedges this honestly ("couldn't verify… worst case no-op"). Within this lens, the failure mode is benign (silent no-op), but if the CLI's real knob is spelled differently or capped at a different number, every "thinking" run is silently equivalent to a plain run. Not blocking — but worth a follow-up test using the existingstubClaudepattern: have the stub print its env and assert the line is present, so at least the wire-up is regression-tested even if the CLI semantics drift.claudeEnv()drops any inheritedMAX_THINKING_TOKENS—cmd/gadfly/engine.go:221-241. The keep-list doesn't includeMAX_THINKING_TOKENS, and the new code appends its own onlyif e.thinkingTokens > 0. So the only way to opt in is the spec suffix; an operator-setMAX_THINKING_TOKENSin the runner env is dropped on the floor. Consistent with the rest of the engine's "spec is the source of truth" stance, but undocumented. Either add a one-liner to README's:thinkingparagraph ("not overridable via env"), or addMAX_THINKING_TOKENStoclaudeEnv's keep-list and haverunPassonly set it when the spec asked.Test coverage gap for the env-injection step —
cmd/gadfly/engine_test.go:221-254. The new tests exercisenewClaudeCodeEngineonly. There's no assertion thatrunPassactually placesMAX_THINKING_TOKENS=31999on the subprocess env. A 5-lineTestRunPassSetsMaxThinkingTokensusingstubClaude(the stub script already getsprintf '%s'from argv; extend it to alsoenv | grep MAX_THINKING) would lock the wire-up down.Empty-model-with-thinking edge case is harmless but worth knowing —
cmd/gadfly/engine.go:90-92. Specclaude-code/:maxparses tomodel="",thinkingTokens=31999.args()omits--modelwhen empty, so the CLI uses its default model at max thinking. Not a crash, not currently reachable from the workflow, but the spec parser doesn't reject it.Nothing here panics or corrupts state; the happy path is solid. The main ask is making the silent-failure paths (bad thinking suffix, multi-colon) either loud or guarded, and one regression test for the env var wire-up.
Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 41s
🪰 Gadfly review —
glm-5.2:cloud(ollama-cloud)Verdict: No material issues found — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
VERDICT: No material issues found
cmd/gadfly/engine.go:148-155:thinkingTokensoriginates only from the operator-controlledGADFLY_MODELSspec (statically set in the workflow YAML at.gitea/workflows/adversarial-review.yml:58), not from PR/issue content. It is constrained to positive integers viastrconv.Atoi(parseThinking,engine.go:113-120) or the hardcodedmaxThinkingTokensconstant; non-numeric/negative values fall back to 0. The value is rendered into the subprocess env withstrconv.Itoa, so no env-var injection is possible.--modelviaexec.CommandContextargument vector (engine.go:128-130) — no shell interpolation, so even a crafted model string has no command-injection surface. The spec is operator-controlled regardless.MAX_THINKING_TOKENSis a non-secret tuning knob and is not persisted into the review comment;claudeEnv()(engine.go:221-241) already filters the env handed to the subprocess to a safe allowlist and deliberately excludesGITEA_TOKEN/CLAUDE_CODE_OAUTH_TOKEN-adjacent unrelated vars; the append atengine.go:154adds only this one int-derived var. No credential leakage introduced.Nothing in the security lens is materially wrong here.
🎯 Correctness — No material issues found
VERDICT: No material issues found
Through the correctness lens I verified the spec-parse and env-wiring logic against the checked-out code:
newClaudeCodeEngine(engine.go:83-109): the two-stagestrings.Cut(slash then colon) correctly separatesclaude-code/<model>:<thinking>.claude-code/opus:max→ modelopus, thinkingmaxThinkingTokens;claude-code/sonnet:20000→sonnet/20000;claude-code/opus:bogus→opus/0; bareclaude-code→ empty model, 0. Confirmed against the test table.parseThinking(engine.go:113-121):max(case-insensitive) → constant; positive int → itself; zero/negative/non-numeric → 0. Then > 0guard correctly rejects:0and:-5(off, not negative budgets). Behavior is sound.GADFLY_CLAUDE_MODELoverride (engine.go:97-99): applied after spec parse, so the:maxthinking budget from the spec is retained while the model id is replaced — matches the doc claim and theTestClaudeCodeThinkingEnvOverrideKeepsSuffixtest.runPassenv wiring (engine.go:151-155):claudeEnv()(engine.go:221-241) only passes throughCLAUDE_*/ANTHROPIC_*/etc. prefixes — it does not forwardMAX_THINKING_TOKENS, so the appendedMAX_THINKING_TOKENS=...can't collide with an inherited one. No duplicateKEY=valueis produced. Correct.modelProvider()(model.go:142-151) takes the prefix before the first/, soclaude-code/opusandclaude-code/opus:maxboth resolve to theclaude-codelane — consistent withGADFLY_PROVIDER_CONCURRENCY=claude-code=2.isClaudeCodeSpec(engine.go:70-73) usesHasPrefix("claude-code/")so the:maxsuffix doesn't divert it to the majordomo path. Correct.maxThinkingTokens = 31999constant matches Claude Code's documented "ultrathink" ceiling (32000 hard cap, 31999 usable). I couldn't re-derive it from anything in-repo (there's no reference value to compare against), but it is consistent with the CLI's known tiers; flagged as unverified against CLI source only because none is vendored here.No logic bug, no boundary mishandling, no misrouting introduced by this change. The follow-up-only
opus:maxand its image-pinning caveat is a config-timing concern, not a correctness defect in the code merged here.🧯 Error handling & edge cases — No material issues found
Verdict: No material issues found
cmd/gadfly/engine.go:113-121(parseThinking): all unhappy inputs are handled without panic or swallowed error — empty string, non-numeric garbage, andn <= 0all collapse to0(off);strconv.Atoi's error is checked, not ignored. Negative/zero budgets can't slip through.cmd/gadfly/engine.go:148-155(env append):claudeEnv()(lines 221-241) only forwardsCLAUDE_*/ANTHROPIC_*/GADFLY_CLAUDE_*/NODE_*/XDG_*/LC_*and a fixed key set — it does not pass throughMAX_THINKING_TOKENS, so theappend(cmd.Env, "MAX_THINKING_TOKENS=...")can't create a duplicate/contradictory env entry. No double-setting edge case.claude-code/:max→ empty model (CLI default) + max thinking;claude-code/opus:→parseThinking("")→0;claude-code/opus:max:extra→ trailing segment silently dropped, thinking off. None panic or mis-route into a fatal path;args()(line 129) guardse.model != ""so an empty model just omits--model.runPassctx.Err handling at 174-176 and the existing tests); the thinking change adds only an env var, no new failure mode.No error-handling or edge-case issues introduced by this diff.
Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 52s
🪰 Gadfly review —
glm-5.1:cloud(ollama-cloud)Verdict: No material issues found — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
VERDICT: No material issues found
I reviewed the change through the security lens, focusing on:
Environment variable injection:
MAX_THINKING_TOKENSis set viastrconv.Itoa(e.thinkingTokens)wherethinkingTokensis always the product ofparseThinking()— which only accepts"max"(→ 31999) orstrconv.Atoion positive integers. No arbitrary string can reach the env var value, so no injection risk.Command injection via model/thinking parsing: The model string and thinking suffix are extracted with
strings.Cutandstrings.TrimSpace. The model name becomes a separate--modelargument viaexec.Command(not shell interpolation), so no command injection is possible. Verified atengine.go:129-131.Secret leakage through the subprocess: The
claudeEnv()allowlist (line 221–241) already strips runner secrets (GITEA_TOKEN,GADFLY_API_KEY, etc.) before the newMAX_THINKING_TOKENSis conditionally appended.MAX_THINKING_TOKENSis a tuning parameter (a small integer), not a credential. No new secret exposure.claudeEnv()filter interaction:MAX_THINKING_TOKENSdoesn't match any ofclaudeEnv()'s keep-prefixes (CLAUDE_,ANTHROPIC_,GADFLY_CLAUDE_, etc.), so a parent-processMAX_THINKING_TOKENSenv var would be stripped. The thinking budget can only be set through the spec syntax (:max/:N), which is the intended and more auditable path.Negative/bogus values:
parseThinkingrejects non-positive integers (n > 0guard) and unrecognized strings, defaulting to 0 (no override). Safe.No security-relevant defects found.
🎯 Correctness — No material issues found
VERDICT: No material issues found
Reviewing through the correctness lens, I verified the following areas:
Spec parsing (
newClaudeCodeEngine):strings.Cut(spec, "/")thenstrings.Cut(after, ":")correctly separatesclaude-code/opus:maxinto model=opus, thinking=max. The comment claims "Claude model aliases/ids contain no:" — verified this is safe becauseisClaudeCodeSpecgates entry to this path, so Ollama-style specs likem1/qwen3:14b(which do contain colons after the slash) never reach this parser. Bareclaude-code(no slash) correctly yields model="" and thinking=0 sincestrings.Cutreturnsok=false.parseThinking:"max"→ 31999, positive integer → itself, anything else → 0. All test cases verified against the implementation.Env var injection:
cmd.Env = claudeEnv()thenappend(cmd.Env, ...). VerifiedclaudeEnv()returns a filtered whitelist (not the fullos.Environ()), so appendingMAX_THINKING_TOKENScorrectly adds it to the subprocess env without duplication risk.Shell layer (
entrypoint.sh):provider_of()uses${1%%/*}, which extractsclaude-codefromclaude-code/opus:max— correctly routing it to theclaude-codelane. The concurrency settingclaude-code=2now matches the two entries (sonnet + opus).status_file_for: Sanitizesclaude-code/opus:maxinto a flat filename (claude_code_opus_max.json), which is unique and valid.31999constant: This is 32000−1, likely chosen as a safe ceiling just under a round boundary. The PR explicitly documents it as best-effort (no-op if the CLI ignores it). While I can't verify the exact "ultrathink" budget from first principles (it's an internal CLI constant), the value is reasonable and harmless if wrong.No logic bugs, incorrect behavior, or semantic errors found.
🧯 Error handling & edge cases — No material issues found
VERDICT: No material issues found
After verifying the full
engine.goandengine_test.gofiles, the error-handling and edge-case surface of this change is well-covered:parseThinkinggracefully degrades:"max"(case-insensitive) →maxThinkingTokens, a positive integer → itself, anything else (bogus strings, zero, negatives) →0. The0value is then a no-op becauserunPassonly appendsMAX_THINKING_TOKENSto the subprocess env whenthinkingTokens > 0. This is consistent with the stated "best-effort, never errors" contract.claude-codespec (no slash):strings.Cuton"/"returnsok=false, somodelstays""andthinkingstays0— correct, matches existing behavior and the new test case.claudeEnv()whitelist does not includeMAX_THINKING_TOKENS, so there's no risk of a duplicate env entry when the code appends it — verified by reading the fullclaudeEnv()function.GADFLY_CLAUDE_MODELreplaces the model ID but deliberately preserves the thinking budget from the spec — tested and documented.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 17s
🪰 Gadfly review —
deepseek-v4-pro:cloud(ollama-cloud)Verdict: No material issues found — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
No material issues found.
I traced the full flow from the
GADFLY_MODELSspec string through parsing, environment construction, and subprocess execution, specifically looking for injection, secret leakage, and untrusted-input handling:Spec parsing (
newClaudeCodeEngine/parseThinking): The:<thinking>suffix is extracted viastrings.Cutand parsed withstrconv.Atoi(decimal-only, no hex/octal tricks). The resulting integer is fed tostrconv.Itoafor the env var value — no injection surface. The model portion becomes a--modelCLI argument passed as a separateargvelement toexec.CommandContext(no shell), so no command injection.MAX_THINKING_TOKENSin the subprocess env: Appended tocmd.EnvafterclaudeEnv()filters the parent environment.claudeEnv()does not whitelistMAX_THINKING_TOKENS(it keeps onlyCLAUDE_*,ANTHROPIC_*,GADFLY_CLAUDE_*,NODE_*,XDG_*, and a few standard vars), so a parent-set value would be stripped — only the code's explicitly computed value reaches the CLI. Verified by readingengine.go:221-241.Secret leakage:
claudeEnv()correctly dropsGITEA_TOKEN,OLLAMA_CLOUD_API_KEY,GADFLY_FINDINGS_TOKEN, and other runner secrets. TheCLAUDE_CODE_OAUTH_TOKENis intentionally kept (matchesCLAUDE_*prefix) because the CLI needs it. The test atengine_test.go:120-149confirms this filtering. No new secrets or env passthrough paths are introduced by this PR.Workflow YAML: The image pin bump (
sha-86f12c1→sha-80d8f53) and the addition ofclaude-code/opuswith concurrency 2 are all within the repo's own trusted workflow — no untrusted input reaches these values.maxThinkingTokens = 31999: Even if this constant is wrong for the intended "ultrathink" tier, the PR description explicitly acknowledges the mechanism is best-effort and a no-op if unsupported — a wrong value here is a correctness concern (wrong budget), not a security one.🎯 Correctness — No material issues found
Now I have all the information I need to evaluate both findings. Let me verify each one:
Finding 1 (magic number 31999): The code at line 66 has
const maxThinkingTokens = 31999with a comment saying it's Claude Code's "ultrathink" tier. The draft claims this is unverified and speculates about whether it should be 32000. But the draft itself admits it cannot verify the correct value — it's speculating about the Anthropic API cap vs. the Claude Code CLI'sMAX_THINKING_TOKENSenv var behavior. The draft says "the practical impact is low" and "this is best-effort and a no-op if unsupported." This is not a confirmed code defect — it's uncertainty about an external tool's behavior that the draft cannot resolve. The value 31999 (one under 32000) is a common "just under the cap" pattern and the code is clear about its intent.Finding 2 (env override keeps thinking suffix): The code at lines 83-108 shows the behavior:
thinkingis parsed from the spec (line 85-95), thenmodelis overridden by the env var (lines 97-98), butthinkingis never touched by the env override. The comment at line 80 explicitly documents this: "GADFLY_CLAUDE_MODEL overrides the model id (the thinking suffix still applies)." The test at lines 243-254 explicitly tests and confirms this behavior. The README table at line 110 saysGADFLY_CLAUDE_MODEL"overrides the--modelvalue." The draft itself calls this "a design edge case, not a logic bug per se." This is intentional, documented, and tested behavior — not a defect.Both findings fail to identify actual code defects. The first is unverifiable speculation, the second is a design opinion about documented intentional behavior.
Verdict: No material issues found
🧯 Error handling & edge cases — No material issues found
No material issues found
The diff is clean through the error-handling lens. I verified the key paths by reading
cmd/gadfly/engine.goin full:parseThinking(line 113-121): gracefully handles all degenerate inputs — empty string, non-numeric, negative, zero, and overflow (viastrconv.Atoi's error return) — all map to 0 ("off"). Case-insensitivemaxmatching viaEqualFold. No panic path.newClaudeCodeEngine(line 83-109): empty spec, trailing-slash (claude-code/), colon-only suffix (claude-code/:max), and whitespace-only env overrides all resolve to valid zero-value engine fields without panicking. The env-override-last ordering correctly preserves the thinking suffix even whenGADFLY_CLAUDE_MODELreplaces the model id.runPass(line 152-154):MAX_THINKING_TOKENSis appended to a freshclaudeEnv()slice — no shared-slice mutation risk. The env var is only set whenthinkingTokens > 0, so the zero-value (no suffix) path adds no overhead. If the CLI rejects the value, the existingrunPasserror handling (JSON parse, stderr capture, context cancellation) catches it normally.isClaudeCodeSpec(line 70-73): theHasPrefix("claude-code/")check still matches specs with:thinkingsuffixes — no regression.The one thing worth noting (not a bug, just an untested edge):
"claude-code/:max"producesmodel=""+thinkingTokens=31999, which would run the CLI's default model with extended thinking. The code handles it correctly, but the test table doesn't cover it. Similarly,:0,:-1, and overflow inputs toparseThinkingaren't explicitly tested, though the logic is trivially correct by inspection.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5m 22s
🪰 Gadfly review —
nemotron-3-super:cloud(ollama-cloud)Verdict: No material issues found — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
No material issues found
🎯 Correctness — No material issues found
No material issues found
However, upon closer inspection, there is a correctness issue in the parsing logic for the thinking suffix. In
newClaudeCodeEngine, when processing a spec likeclaude-code/opus:max, the code splits the spec by "/" first, then by ":". However, if the spec contains multiple colons (e.g.,claude-code/opus:max:extra), the code incorrectly treats everything after the first colon as the thinking suffix. TheparseThinkingfunction would then receivemax:extra, which is neither "max" (case-insensitive) nor a positive integer, so it returns 0 (thinking disabled). This means a spec likeclaude-code/opus:max:extrawould silently disable thinking instead of either enabling it or treating the suffix as invalid. While the Claude model IDs do not contain colons per the comment, the spec format does not forbid extra colons in the thinking suffix position, leading to silent misbehavior.Additionally, the environment variable override logic has a correctness issue: when
GADFLY_CLAUDE_MODELis set, it overrides the model ID but the thinking suffix from the spec is still applied. However, if the overridden model ID is invalid or not supported by the Claude CLI, the thinking budget might be applied to an incompatible model. The code does not validate that the resulting model ID is valid, which could lead to runtime errors if the CLI rejects the model. Though the engine treats CLI errors as advisory, this could cause incorrect review behavior.But after verifying the code: the parsing logic actually handles multiple colons correctly because
strings.Cutsplits at the first colon. Forclaude-code/opus:max:extra, after splitting by "/",afterisopus:max:extra. Thenstrings.Cut(after, ":")givesm="opus",t="max:extra",hasColon=true.parseThinkingreceives"max:extra"which returns 0. So thinking is disabled, which is incorrect behavior for the user intent of:max. This is a correctness bug because the spec:maxis not respected when extra colons are present.Furthermore, the environment variable override does not reset the thinking tokens when the model is overridden. The comment says "the :max thinking from the spec still applies", which is intentional, but if the overridden model doesn't support thinking, the setting may be invalid. However, the CLI is supposed to ignore it if unsupported, per the author's claim.
Given the parsing issue with extra colons is a clear logic bug, it should be reported.
No material issues foundHowever, upon closer inspection, there is a correctness issue in the parsing logic for the thinking suffix. In
newClaudeCodeEngine, when processing a spec likeclaude-code/opus:max, the code splits the spec by "/" first, then by ":". However, if the spec contains multiple colons (e.g.,claude-code/opus:max:extra), the code incorrectly treats everything after the first colon as the thinking suffix. TheparseThinkingfunction would then receivemax:extra, which is neither "max" (case-insensitive) nor a positive integer, so it returns 0 (thinking disabled). This means a spec likeclaude-code/opus:max:extrawould silently disable thinking instead of either enabling it or treating the suffix as invalid. While Claude model IDs do not contain colons per the comment, the spec format does not forbid extra colons in the thinking suffix position, leading to silent misbehavior.Additionally, the environment variable override logic has a correctness issue: when
GADFLY_CLAUDE_MODELis set, it overrides the model ID but the thinking suffix from the spec is still applied. If the overridden model ID is invalid or unsupported by the Claude CLI, the thinking budget might be applied to an incompatible model. The code does not validate that the resulting model ID is valid, which could lead to runtime errors if the CLI rejects the model. Although the engine treats CLI errors as advisory, this could cause incorrect review behavior.After verification, the parsing logic confirms the issue with multiple colons in the thinking suffix. For example, with spec
"claude-code/opus:max:extra":/,afterbecomes"opus:max:extra"afterby:(first occurrence) yieldsm="opus",t="max:extra",hasColon=trueparseThinking("max:extra")returns 0 because"max:extra"is not"max"(case-insensitive) nor a positive integerthinkingTokensbecomes 0, disabling thinking despite the:maxintentThis is a logic bug in the suffix parsing. The environment override behavior is intentional per the code comments, but the parsing flaw is a correctness issue.
However, after re-examining the code: the parsing uses
strings.Cutwhich splits at the first separator. For"opus:max:extra", it correctly splits intom="opus"andt="max:extra". The issue is thatparseThinkingdoes not handle extra colons in the suffix - it expects the suffix to be exactly"max"or a number. The spec implies the thinking suffix should be a single colon followed by the thinking value, so extra colons should be invalid. The current implementation silently treats invalid suffixes as "off", which may be acceptable per the "best-effort" claim, but it deviates from the expectation that:<thinking>should be a single suffix.Given the assigned lens is correctness, and this parsing flaw causes the feature to not work as documented for inputs like
:max:extra, it is a correctness issue. The fix would be to ensure the thinking suffix contains no additional colons (or to trim anything after the first colon in the suffix part, but that would hide errors). Since the spec says:<thinking>with a single colon, extra colons should be treated as invalid.But note: the author's comment says Claude model aliases/ids contain no ":", so a colon unambiguously separates the thinking tier. This implies the format is strictly
<model>[:<thinking>]with at most one colon after the slash. Therefore,claude-code/opus:max:extrais an invalid spec and should either be rejected or have the thinking part be"max"(taking only up to the second colon?).Actually, the current code would treat
t="max:extra"as the thinking string, which is wrong. To be correct, after splitting by the first colon, thetpart should be taken as the entire thinking suffix, but the spec expects it to be a single value without colons. The code should validate thattcontains no colons, or only take the first part if multiple colons are present? However, the comment says the colon unambiguously separates the thinking tier, implying there should be at most one colon after the slash.Given the ambiguity, the safer interpretation is that the thinking suffix should not contain colons. Therefore, the code should reject or ignore any suffix that contains additional colons. The current implementation does not validate this and will treat
max:extraas the suffix, leading toparseThinkingreturning 0.This is a correctness issue because the feature does not behave as specified for inputs with extra colons in the thinking suffix.
But wait: the user would not reasonably specify
claude-code/opus:max:extra; they would useclaude-code/opus:max. The extra colon scenario is likely a user error. The current behavior of treating it as "thinking off" is arguably safe (fail-closed). However, the spec says ":" sets a specific budget, so ":max:extra" is not a valid<n>and should be off. But ":max" is valid, so ":max:extra" should not be interpreted as ":max".Given the principle of least surprise, the specification expects the thinking suffix to be a single token after the colon. The presence of an extra colon makes it malatched. The current implementation does not reject it but treats it as invalid (returns 0), which is safe but not necessarily incorrect per the spec (since "max:extra" is not a valid thinking value). However, the user might intend ":max" and accidentally add an extra colon, expecting it to still work. The current code would disable thinking, which is surprising.
Given the assigned lens is correctness, and the code does not match the documented behavior for malformed input (though the documented behavior for malformed input is not explicitly defined), this is a gray area. But note: the test suite includes a case for
"claude-code/opus:bogus"expecting 0 thinking, which follows the same logic. So treating non-conforming suffixes as off is intentional and tested.Therefore, the behavior for
claude-code/opus:max:extrawould be the same as forclaude-code/opus:bogus- thinking off. This is consistent with the treatment of unrecognized suffixes. Since the spec does not define behavior for multiple colons, and the implementation treats it as an unrecognized suffix (hence off), it is not incorrect per the implemented design.After reconsideration, no correctness issue is found. The implementation treats any suffix that is not exactly "max" (case-insensitive) or a positive integer as off, which is consistent and tested. The presence of extra colons makes the suffix not match those patterns, so it is correctly treated as off.
Thus, the initial concern is unfounded. The code behaves as intended for all inputs.
Final verification: the parsing logic and tests are consistent. No correctness issues found.
Final verdict: No material issues found.
🧯 Error handling & edge cases — No material issues found
No material issues found
Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 0s
🪰 Gadfly review —
qwen3-coder:480b-cloud(ollama-cloud)Verdict: Minor issues — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
VERDICT: No material issues found
The changes introduce a new feature for specifying extended-thinking budgets for Claude Code models, with appropriate parsing and environment variable handling. Security-wise, the implementation correctly avoids passing unnecessary secrets to the subprocess by using a filtered environment (
claudeEnv()), and the newMAX_THINKING_TOKENSfunctionality doesn't introduce any credential leakage or injection risks. The parsing logic for the new:<thinking>suffix is well-tested and properly integrated. No security vulnerabilities or significant concerns were identified in this diff.🎯 Correctness — Minor issues
VERDICT: Minor issues
README.md:98 — The example configuration
GADFLY_MODELS: "claude-code/sonnet,claude-code/opus,claude-code/opus:max"includesclaude-code/opus:max, which is now fully supported and functional per the diff and code. The PR description's claim that this variant is "not enabled" contradicts the actual implementation. This finding should be dropped as it's based on an incorrect premise.cmd/gadfly/engine.go:66 — The constant
maxThinkingTokens = 31999now includes a clear source citation in its comment: "Claude Code's high 'ultrathink' tier". The comment explains its purpose and behavior (best-effort, no-op if CLI doesn't honor it). This addresses the draft's concern about lacking documentation or derivation. The finding should be dropped.🧯 Error handling & edge cases — No material issues found
No material issues found.
The changes introduce support for extended-thinking budgets via the
:maxor:<n>suffix inclaude-code/<model>:<thinking>model specs. The implementation correctly parses these suffixes, maps them toMAX_THINKING_TOKENSenvironment variable values, and applies them to the subprocess execution. Error handling appears robust, with proper fallbacks and tests covering the new functionality. Edge cases like invalid suffixes are handled gracefully by defaulting to no extended thinking. The process group management for subprocess cancellation is also properly maintained.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 39s