fix: per-lens timeout, errored-verdict honesty, accurate provider label, tighter lens focus, run timing
Build & push image / build-and-push (push) Successful in 8s

Five fixes, several surfaced by the live bake-off:

- PER-LENS TIMEOUT (critical): GADFLY_TIMEOUT_SECS now applies to EACH specialist
  (own context), not shared across the suite. A slow model (e.g. a 35B local MLX)
  was exhausting the whole 600s budget on lens 1, leaving the rest "step 0:
  context deadline exceeded". Default lowered to 300s (per-lens). cmd/gadfly/main.go.
- ERRORED VERDICT: a lens whose review pass failed no longer counts as "clean".
  Header shows "· ⚠️ N/M lens(es) errored" (or "Review incomplete — all lenses
  errored"); the section reads "⚠️ could not complete". consolidate.go.
- PROVIDER LABEL: the comment header now shows the model's ACTUAL backend from the
  spec ("m1pro/qwen3.6:35b-mlx" -> m1pro), not the global GADFLY_PROVIDER default
  (was wrongly "ollama-cloud" for local models). scripts/run.sh.
- LENS FOCUS: base prompt no longer licenses "report anything serious"; each lens
  stays in its lane, says "nothing in my area" rather than re-reporting another
  lens's bug, with a one-line "Outside my lens:" escape hatch. The re-derive-
  constants discipline is now lane-scoped, not "every lens". system-prompt.txt + specialists.go.
- RUN TIMING: run.sh posts a " Reviewing…" placeholder at model start and updates
  it with "⏱️ reviewed in 1m 23s" on finish, for per-model comparison.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Steve Dudenhoeffer
2026-06-25 20:15:40 -04:00
parent 4b8f9aa39b
commit 49f3623204
5 changed files with 114 additions and 58 deletions
+23 -3
View File
@@ -49,6 +49,7 @@ type specialistResult struct {
spec Specialist spec Specialist
out string out string
verdict verdict verdict verdict
errored bool // the review pass failed (timeout/model error) — not a clean result
} }
// worstVerdict returns the most severe verdict across results. The optional // worstVerdict returns the most severe verdict across results. The optional
@@ -70,17 +71,36 @@ func worstVerdict(results []specialistResult) verdict {
// followed by one verbatim section per specialist. run.sh wraps this with the // followed by one verbatim section per specialist. run.sh wraps this with the
// "🪰 Gadfly review — <model>" header and the advisory footer. // "🪰 Gadfly review — <model>" header and the advisory footer.
func renderConsolidated(results []specialistResult) string { func renderConsolidated(results []specialistResult) string {
errored := 0
for _, r := range results {
if r.errored {
errored++
}
}
headline := "Verdict: " + worstVerdict(results).label()
if len(results) > 0 && errored == len(results) {
// Every lens errored — do NOT report this as "clean".
headline = "Review incomplete — all lenses errored"
} else if errored > 0 {
headline += fmt.Sprintf(" · ⚠️ %d/%d lens(es) errored", errored, len(results))
}
var b strings.Builder var b strings.Builder
fmt.Fprintf(&b, "**Verdict: %s** — %d reviewers: %s\n", fmt.Fprintf(&b, "**%s** — %d reviewers: %s\n",
worstVerdict(results).label(), len(results), strings.Join(specialistNames(results), ", ")) headline, len(results), strings.Join(specialistNames(results), ", "))
for _, r := range results { for _, r := range results {
body := strings.TrimSpace(r.out) body := strings.TrimSpace(r.out)
if body == "" { if body == "" {
body = "_(no output)_" body = "_(no output)_"
} }
summary := r.verdict.label()
if r.errored {
summary = "⚠️ could not complete"
}
fmt.Fprintf(&b, "\n<details><summary><b>%s</b> — %s</summary>\n\n%s\n\n</details>\n", fmt.Fprintf(&b, "\n<details><summary><b>%s</b> — %s</summary>\n\n%s\n\n</details>\n",
r.spec.Title, r.verdict.label(), body) r.spec.Title, summary, body)
} }
return strings.TrimRight(b.String(), "\n") return strings.TrimRight(b.String(), "\n")
} }
+24 -17
View File
@@ -69,10 +69,14 @@ import (
const ( const (
defaultMaxSteps = 24 defaultMaxSteps = 24
// defaultTimeoutSecs is the overall deadline shared by ALL specialist passes // defaultTimeoutSecs is the deadline for EACH specialist's passes (review +
// in one invocation; the default suite runs several lenses sequentially. // recheck). It is per-lens, not shared across the suite, so one slow lens
defaultTimeoutSecs = 600 // (e.g. a big local model) can't starve the others. Slow local models may
// need this raised (and a higher job timeout to match the suite total).
defaultTimeoutSecs = 300
defaultMaxDiffChars = 60000 defaultMaxDiffChars = 60000
// autoSelectTimeout bounds the dynamic specialist-selection call.
autoSelectTimeout = 120 * time.Second
// defaultWrapUpReserve is how many steps before the cap the agent is told // defaultWrapUpReserve is how many steps before the cap the agent is told
// to stop investigating and write its final answer. Reserving a margin is // to stop investigating and write its final answer. Reserving a margin is
// what keeps a thorough reviewer from spending its whole budget on tool // what keeps a thorough reviewer from spending its whole budget on tool
@@ -148,17 +152,15 @@ func run() error {
fmt.Fprintln(os.Stderr, "gadfly:", e) fmt.Fprintln(os.Stderr, "gadfly:", e)
} }
timeout := time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
// Dynamic selection: a (cheap) model picks the lenses this diff needs. // Dynamic selection: a (cheap) model picks the lenses this diff needs.
if auto { if auto {
selector, serr := resolveSelectorModel(mdl) selector, serr := resolveSelectorModel(mdl)
if serr != nil { if serr != nil {
return fmt.Errorf("resolve selector model: %w", serr) return fmt.Errorf("resolve selector model: %w", serr)
} }
picked, aerr := autoSelectSpecialists(ctx, selector, os.Getenv("GADFLY_TITLE"), os.Getenv("GADFLY_BODY"), diff, registry) selCtx, cancel := context.WithTimeout(context.Background(), autoSelectTimeout)
picked, aerr := autoSelectSpecialists(selCtx, selector, os.Getenv("GADFLY_TITLE"), os.Getenv("GADFLY_BODY"), diff, registry)
cancel()
if aerr != nil { if aerr != nil {
fmt.Fprintln(os.Stderr, "gadfly: auto-select failed; falling back to the default suite:", aerr) fmt.Fprintln(os.Stderr, "gadfly: auto-select failed; falling back to the default suite:", aerr)
specialists = suiteFromRegistry(registry, defaultSuite) specialists = suiteFromRegistry(registry, defaultSuite)
@@ -176,24 +178,29 @@ func run() error {
task := buildTask(diff) task := buildTask(diff)
results := make([]specialistResult, 0, len(specialists)) results := make([]specialistResult, 0, len(specialists))
for _, sp := range specialists { for _, sp := range specialists {
out := reviewWithSpecialist(ctx, mdl, fsTools, base, sp, task, diff) out, errored := reviewWithSpecialist(mdl, fsTools, base, sp, task, diff)
results = append(results, specialistResult{spec: sp, out: out, verdict: parseVerdict(out)}) results = append(results, specialistResult{spec: sp, out: out, verdict: parseVerdict(out), errored: errored})
} }
fmt.Println(renderConsolidated(results)) fmt.Println(renderConsolidated(results))
return nil return nil
} }
// reviewWithSpecialist runs one lens end-to-end: a review pass under the // reviewWithSpecialist runs one lens end-to-end under its OWN timeout, so a slow
// specialist's composed prompt, then the shared adversarial recheck pass. A // model on one lens can't starve the others: a review pass under the
// failed pass is rendered as an inline notice (advisory — one lens failing // specialist's composed prompt, then the shared adversarial recheck pass. The
// never sinks the others or the job). // returned bool is true when the review pass failed (rendered as an inline
func reviewWithSpecialist(ctx context.Context, mdl llm.Model, fsTools *repoFS, base string, sp Specialist, task, diff string) string { // notice — advisory; one lens failing never sinks the others or the job).
func reviewWithSpecialist(mdl llm.Model, fsTools *repoFS, base string, sp Specialist, task, diff string) (string, bool) {
timeout := time.Duration(envInt("GADFLY_TIMEOUT_SECS", defaultTimeoutSecs)) * time.Second
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
draft, err := runAgent(ctx, mdl, fsTools, composeSpecialistPrompt(base, sp), task, draft, err := runAgent(ctx, mdl, fsTools, composeSpecialistPrompt(base, sp), task,
envInt("GADFLY_MAX_STEPS", defaultMaxSteps)) envInt("GADFLY_MAX_STEPS", defaultMaxSteps))
if err != nil { if err != nil {
fmt.Fprintf(os.Stderr, "gadfly: specialist %q review pass failed: %v\n", sp.Name, err) fmt.Fprintf(os.Stderr, "gadfly: specialist %q review pass failed: %v\n", sp.Name, err)
return fmt.Sprintf("⚠️ This reviewer failed to complete: %v", err) return fmt.Sprintf("⚠️ This reviewer failed to complete: %v", err), true
} }
final := draft final := draft
@@ -206,7 +213,7 @@ func reviewWithSpecialist(ctx context.Context, mdl llm.Model, fsTools *repoFS, b
final = rechecked final = rechecked
} }
} }
return final return final, false
} }
// runAgent runs one agent pass (its own fresh toolbox over the sandbox) and // runAgent runs one agent pass (its own fresh toolbox over the sandbox) and
+7 -2
View File
@@ -192,8 +192,13 @@ func specialistNamesOf(specs []Specialist) []string {
// composeSpecialistPrompt appends a specialist's lens to the base system prompt. // composeSpecialistPrompt appends a specialist's lens to the base system prompt.
func composeSpecialistPrompt(base string, sp Specialist) string { func composeSpecialistPrompt(base string, sp Specialist) string {
return strings.TrimRight(base, "\n") + return strings.TrimRight(base, "\n") +
"\n\n## Your review lens: " + sp.Title + "\n" + "\n\n## Your assigned lens " + sp.Title + "\n" +
"Focus your review on this lens; other reviewers cover the rest. " + sp.Focus "Review the change specifically and ONLY through this lens. Scrutinize it for:\n" +
sp.Focus +
"\n\nStay in this lane: other lenses (correctness, security, performance, etc.) are reviewed " +
"separately, so don't duplicate their findings here. If nothing in your lens is materially " +
"wrong, reply with the \"No material issues found\" verdict for this lens — do not reach for " +
"another lens's issue just to have something to say."
} }
func loadFileConfig(repoDir string) (fileConfig, bool, error) { func loadFileConfig(repoDir string) (fileConfig, bool, error) {
+48 -29
View File
@@ -42,6 +42,14 @@ MAX_DIFF_CHARS="${MAX_DIFF_CHARS:-60000}"
MARKER="<!-- gadfly-review:${PROVIDER}:${MODEL} -->" MARKER="<!-- gadfly-review:${PROVIDER}:${MODEL} -->"
say() { echo "[gadfly-review:${PROVIDER}:${MODEL}] $*" >&2; } say() { echo "[gadfly-review:${PROVIDER}:${MODEL}] $*" >&2; }
# Display the model's ACTUAL backend: the provider segment of the spec
# ("m1pro/qwen3.6:35b-mlx" -> "m1pro"); a bare id uses GADFLY_PROVIDER (default
# ollama-cloud). This is what the comment header shows, not the run.sh lane.
case "$MODEL" in
*/*) MODEL_PROVIDER="${MODEL%%/*}" ;;
*) MODEL_PROVIDER="${GADFLY_PROVIDER:-ollama-cloud}" ;;
esac
# jq is required for payload building / response parsing; install if missing. # jq is required for payload building / response parsing; install if missing.
if ! command -v jq >/dev/null 2>&1; then if ! command -v jq >/dev/null 2>&1; then
say "jq not found; attempting install" say "jq not found; attempting install"
@@ -55,6 +63,32 @@ fi
# binary / agy, not here.) # binary / agy, not here.)
API_TIMEOUT="--connect-timeout 20 --max-time 30" API_TIMEOUT="--connect-timeout 20 --max-time 30"
# upsert_comment BODY — create or update (by MARKER) this model's single comment.
upsert_comment() {
local body="$1" post_body existing_id page=1 cmts
post_body="$(jq -n --arg b "$body" '{body:$b}')"
existing_id=""
while [ "$page" -le 10 ]; do
cmts="$(curl $API_TIMEOUT -fsS -H "Authorization: token ${GITEA_TOKEN}" \
"${GITEA_API}/issues/${PR}/comments?limit=50&page=${page}" || echo '[]')"
[ "$(echo "$cmts" | jq 'length')" = "0" ] && break
existing_id="$(echo "$cmts" | jq -r --arg m "$MARKER" \
'.[] | select(.body != null and (.body | startswith($m))) | .id' | head -n1)"
[ -n "$existing_id" ] && break
page=$((page+1))
done
if [ -n "$existing_id" ]; then
curl $API_TIMEOUT -sS -X PATCH -H "Authorization: token ${GITEA_TOKEN}" -H "Content-Type: application/json" \
"${GITEA_API}/issues/comments/${existing_id}" -d "$post_body" >/dev/null
else
curl $API_TIMEOUT -sS -X POST -H "Authorization: token ${GITEA_TOKEN}" -H "Content-Type: application/json" \
"${GITEA_API}/issues/${PR}/comments" -d "$post_body" >/dev/null
fi
}
# fmt_duration SECONDS -> "1m 23s" / "45s"
fmt_duration() { if [ "$1" -ge 60 ]; then echo "$(($1/60))m $(($1%60))s"; else echo "$1s"; fi; }
# --- fetch PR context ------------------------------------------------------- # --- fetch PR context -------------------------------------------------------
say "fetching PR #${PR} context" say "fetching PR #${PR} context"
DIFF="$(curl $API_TIMEOUT -fsS -H "Authorization: token ${GITEA_TOKEN}" "${GITEA_API}/pulls/${PR}.diff" || true)" DIFF="$(curl $API_TIMEOUT -fsS -H "Authorization: token ${GITEA_TOKEN}" "${GITEA_API}/pulls/${PR}.diff" || true)"
@@ -81,6 +115,12 @@ SYS="$(cat "${SCRIPT_DIR}/system-prompt.txt")"
USR="$(printf 'PR #%s: %s\n\nDescription:\n%s\n\nUnified diff to review:\n```diff\n%s\n```%s' \ USR="$(printf 'PR #%s: %s\n\nDescription:\n%s\n\nUnified diff to review:\n```diff\n%s\n```%s' \
"$PR" "$TITLE" "$BODY" "$DIFF" "$TRUNC_NOTE")" "$PR" "$TITLE" "$BODY" "$DIFF" "$TRUNC_NOTE")"
# --- announce start (placeholder comment) -----------------------------------
START_TS="$(date +%s)"
say "starting review with ${MODEL}"
upsert_comment "$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n⏳ Reviewing… this comment will update with findings and run time.' \
"$MARKER" "$MODEL" "$MODEL_PROVIDER")"
# --- call the model --------------------------------------------------------- # --- call the model ---------------------------------------------------------
REVIEW="" REVIEW=""
case "$PROVIDER" in case "$PROVIDER" in
@@ -96,7 +136,7 @@ case "$PROVIDER" in
if [ -n "${OLLAMA_CLOUD_API_KEY:-}" ] && [ -z "${OLLAMA_API_KEY:-}" ]; then if [ -n "${OLLAMA_CLOUD_API_KEY:-}" ] && [ -z "${OLLAMA_API_KEY:-}" ]; then
export OLLAMA_API_KEY="$OLLAMA_CLOUD_API_KEY" export OLLAMA_API_KEY="$OLLAMA_CLOUD_API_KEY"
fi fi
GADFLY_PROVIDER_EFF="${GADFLY_PROVIDER:-ollama-cloud}" GADFLY_PROVIDER_EFF="$MODEL_PROVIDER"
# Only the default cloud provider strictly needs a key up front; local Ollama # Only the default cloud provider strictly needs a key up front; local Ollama
# and other providers either need none or read their own standard env var. # and other providers either need none or read their own standard env var.
@@ -152,31 +192,10 @@ $(tail -c 1500 agy.err 2>/dev/null)
say "unknown provider: ${PROVIDER}"; exit 1 ;; say "unknown provider: ${PROVIDER}"; exit 1 ;;
esac esac
# --- assemble comment ------------------------------------------------------- # --- assemble + post final comment (with run time) --------------------------
COMMENT="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s\n\n<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge.</sub>' \ ELAPSED="$(( $(date +%s) - START_TS ))"
"$MARKER" "$MODEL" "${GADFLY_PROVIDER_EFF:-$PROVIDER}" "$REVIEW")" DUR="$(fmt_duration "$ELAPSED")"
POST_BODY="$(jq -n --arg b "$COMMENT" '{body:$b}')" COMMENT="$(printf '%s\n### 🪰 Gadfly review — `%s` (%s)\n\n%s\n\n<sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in %s</sub>' \
"$MARKER" "$MODEL" "$MODEL_PROVIDER" "$REVIEW" "$DUR")"
# --- upsert by marker ------------------------------------------------------- upsert_comment "$COMMENT"
EXISTING_ID="" say "done in ${DUR}"
page=1
while [ "$page" -le 10 ]; do
CMTS="$(curl $API_TIMEOUT -fsS -H "Authorization: token ${GITEA_TOKEN}" \
"${GITEA_API}/issues/${PR}/comments?limit=50&page=${page}" || echo '[]')"
[ "$(echo "$CMTS" | jq 'length')" = "0" ] && break
EXISTING_ID="$(echo "$CMTS" | jq -r --arg m "$MARKER" \
'.[] | select(.body != null and (.body | startswith($m))) | .id' | head -n1)"
[ -n "$EXISTING_ID" ] && break
page=$((page+1))
done
if [ -n "$EXISTING_ID" ]; then
say "updating existing comment ${EXISTING_ID}"
curl $API_TIMEOUT -sS -X PATCH -H "Authorization: token ${GITEA_TOKEN}" -H "Content-Type: application/json" \
"${GITEA_API}/issues/comments/${EXISTING_ID}" -d "$POST_BODY" >/dev/null
else
say "creating new comment"
curl $API_TIMEOUT -sS -X POST -H "Authorization: token ${GITEA_TOKEN}" -H "Content-Type: application/json" \
"${GITEA_API}/issues/${PR}/comments" -d "$POST_BODY" >/dev/null
fi
say "done"
+12 -7
View File
@@ -1,9 +1,14 @@
You are Gadfly, an ADVERSARIAL code reviewer. Your job is to find real problems in the You are Gadfly, an ADVERSARIAL code reviewer. Your job is to find real problems in the
pull request below — not to praise it. A gadfly does not let things slide. pull request below — not to praise it. A gadfly does not let things slide.
You review through ONE assigned lens (given at the end of this prompt). Stay in your lane — You review through ONE assigned lens (named at the end of this prompt). Evaluate the change
other reviewers cover the other angles — but do report anything clearly serious you happen THROUGH THAT LENS — that is your job. A separate reviewer independently covers each other
to notice. angle, so problems outside your lens WILL be caught without you. Do not restate a finding that
plainly belongs to another lens just to have something to report — that only creates noise. If
your lens turns up nothing material, say so plainly; an honest "nothing in my area" beats
re-reporting the obvious bug every other lens already sees. Only exception: if you spot a
SEVERE issue clearly outside your lens, you may add ONE line prefixed "Outside my lens:" — but
your actual findings must stay within your lens.
You are AGENTIC: you have read-only tools over the repository AT THIS PR's checked-out You are AGENTIC: you have read-only tools over the repository AT THIS PR's checked-out
state. USE THEM to verify before you report. Do not review the diff in isolation. state. USE THEM to verify before you report. Do not review the diff in isolation.
@@ -23,10 +28,10 @@ Mandatory verification discipline — this is the whole point of giving you tool
- If you cannot confirm a suspicion with the tools, either drop it or clearly label it - If you cannot confirm a suspicion with the tools, either drop it or clearly label it
"unverified" — do NOT present an unchecked guess as a finding. "unverified" — do NOT present an unchecked guess as a finding.
Be skeptical and concrete, and apply your assigned lens rigorously. A recurring, high-value Be skeptical and concrete, and apply your assigned lens rigorously. (If your lens leads you to
discipline regardless of lens: do NOT trust a constant, conversion factor, formula, unit, or a constant, conversion factor, formula, unit, or threshold, don't trust it because it looks
threshold just because it looks reasonable — RE-DERIVE the expected value from first principles reasonable — re-derive the expected value from first principles and compare; plausible-looking
and compare. Plausible-looking magic numbers are where real bugs hide. magic numbers hide real bugs. Pursue this when it's in your lane, not as a reason to leave it.)
Output rules: Output rules:
- Output GitHub-flavored markdown, concise. No filler, no restating the diff. - Output GitHub-flavored markdown, concise. No filler, no restating the diff.