run: fix statusFor — don't relabel a generic error / caller-cancel as timeout (gadfly #11 )

The WithCancelCause+timer rewrite made MaxRuntime surface as Canceled (not DeadlineExceeded), so statusFor's context.Cause(DeadlineExceeded) check could relabel (a) a genuine run error as 'timeout' and (b) a caller cancel/deadline as 'timeout' (was 'cancelled'). Convergent gadfly finding (glm-5.2 + cluster). Fix: keep MaxRuntime as WithTimeout (its DeadlineExceeded propagates → 'timeout', preserving own-timeout vs caller-cancel), add a NESTED WithCancelCause layer only for the kill. statusFor consults context.Cause ONLY for ErrCriticKill; everything else is classified by the run error itself. Tests: generic-error-not-relabeled + caller-cancel-stays-cancelled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run: critic parity — fuller RecordStep + cause-carrying Kill (distinct status)
2026-06-27 17:00:26 -04:00 · 2026-06-27 16:35:13 -04:00 · 2026-06-27 14:53:58 -04:00 · 2026-06-27 14:41:14 -04:00 · 2026-06-27 14:38:48 -04:00 · 2026-06-27 14:16:03 -04:00
14 changed files with 821 additions and 59 deletions
@@ -41,25 +41,32 @@ jobs:
              || github.actor == 'fizi'
              || github.actor == 'dazed'))
    runs-on: ubuntu-latest
-    # 3 cloud models, all concurrent, 3-lens suite. ~12 min typical.
+    # Full fleet: 3 cloud (lens fan-out) + M1/M5 Macs via foreman. The slow local
-    timeout-minutes: 30
+    # lanes dominate wall time, so allow plenty of headroom.
    timeout-minutes: 90
    steps:
-      - uses: docker://gitea.stevedudenhoeffer.com/steve/gadfly:sha-d0de034
+      - uses: docker://gitea.stevedudenhoeffer.com/steve/gadfly:sha-d7f364d
        env:
          GITEA_API: ${{ github.server_url }}/api/v1/repos/${{ github.repository }}
          GITEA_TOKEN: ${{ secrets.GITEA_TOKEN }}
          OLLAMA_CLOUD_API_KEY: ${{ secrets.OLLAMA_CLOUD_API_KEY }}
-          # executus uses CLOUD MODELS ONLY. The local Macs (m1/m5) were dropped:
+          # Local Macs, reached through their foreman queues (native Ollama on the
-          # on a P2-review measurement they took 26–29 min (with lens timeouts)
+          # wire). GADFLY_ENDPOINT_M5 registers provider "m5",
-          # and contributed ZERO real findings — the two cloud models found every
+          # each a foreman-preset Ollama client at the secret's URL, of the form:
-          # genuine bug in 6–12 min. Cloud-only is faster AND higher-signal.
+          #   foreman|https://<foreman-host>|<token>
-          # 3 cloud models. Concurrency now lives in the LENSES, not the models:
+          # Needs an image with foreman provider-type support (this one). If a Mac
-          # one model runs at a time (PROVIDER_CONCURRENCY=1) with its 3 lenses
+          # is offline that model's comment shows an error and the others still post.
-          # concurrent (PROVIDER_LENS_CONCURRENCY=3). So the first model's
+          # (Gitea secrets aren't auto-exposed — map each explicitly.)
-          # comment lands sooner and each model finishes a bit faster, at the
+          GADFLY_ENDPOINT_M5: ${{ secrets.GADFLY_ENDPOINT_M5 }}
-          # cost of the other two models' comments arriving in series after it.
+          # Full fleet: 3 cloud + M1 Pro + M5 Max. The Macs are back so the
-          GADFLY_MODELS: "minimax-m3:cloud,deepseek-v4-flash:cloud,glm-5.2:cloud"
+          # gadfly-reports scoreboard can quantify whether they earn their keep
-          GADFLY_PROVIDER_CONCURRENCY: "ollama-cloud=1"
+          # (they previously took 26–29 min for ZERO real findings — now measured).
          # Cloud concurrency lives in the LENSES: one cloud model at a time
          # (ollama-cloud=1) with its 3 lenses concurrent (LENS ollama-cloud=3) so
          # its comment lands sooner; each Mac runs one model, lenses serial (its
          # foreman queue serializes anyway). All three provider lanes run parallel.
          GADFLY_MODELS: "minimax-m3:cloud,glm-5.2:cloud,glm-5.1:cloud,kimi-k2.7-code:cloud,deepseek-v4-pro:cloud,nemotron-3-super:cloud,gpt-oss:120b-cloud,qwen3-coder:480b-cloud,gemma4:cloud,m5/qwen3.6:35b-mlx"
          GADFLY_PROVIDER_CONCURRENCY: "ollama-cloud=3,m5=1"
          GADFLY_PROVIDER_LENS_CONCURRENCY: "ollama-cloud=3"
          # Default => the 3-lens suite (security, correctness, error-handling).
          # Set the repo var GADFLY_SPECIALISTS to override (csv / "all" / "auto").
@@ -69,6 +76,12 @@ jobs:
          GADFLY_MAX_STEPS: "14"
          # Allow-list for the comment trigger (mirrors the job-level if: guard).
          GADFLY_ALLOWED_USERS: "steve,fizi,dazed"
          # --- findings telemetry: POST runs + findings to the gadfly-reports store ---
          # Advisory & off unless GADFLY_FINDINGS_URL is set; failures only log to
          # stderr and never affect the review. GADFLY_REPO / GADFLY_PR are derived
          # in-container; the URL + token are user-scope secrets.
          GADFLY_FINDINGS_URL: ${{ secrets.GADFLY_FINDINGS_URL }}
          GADFLY_FINDINGS_TOKEN: ${{ secrets.GADFLY_FINDINGS_TOKEN }}
          # --- event context (leave as-is) ---
          EVENT_NAME: ${{ github.event_name }}
          PR: ${{ github.event.pull_request.number || github.event.issue.number || github.event.inputs.pr_number }}
@@ -47,9 +47,10 @@ CORE (majordomo + stdlib):
            toolbox + majordomo loop + compaction +
            run-bounding (V10 detached timeout) + step/
            audit observers + Budget gate; RunnableAgent
-            DTO + nil-safe run.Ports. Follow-ups: wire
+            DTO + nil-safe run.Ports. Palette delegation +
-            Critic/Checkpointer/PaletteSource/Delivery,
+            Critic (monitor/deadline/steer) + Delivery
-            Phases, and the no-tools direct path         [P2]
+            WIRED. Follow-ups: Checkpointer (needs a
            majordomo msg-history hook), Phases           [C0c]
  dispatchguard/  loop/depth/fan-out caps                [P0 ✓]
  pendingattach/  attachment dedupe                      [P0 ✓]
  tool/     registry + 3-stage permissions + ssrf        [P1 ✓]
@@ -10,13 +10,17 @@
 // Mort plugs its LLM critic-agent in as an Escalator; ExtendOnce is the
 // zero-dependency default.
 //
-// NOTE: the executor's call into run.Ports.Critic is a P2 follow-up; this
+// The executor wires run.Ports.Critic (C0b): it feeds the handle activity,
-// battery provides the seam + impl ahead of that wiring.
+// binds the run context to its extendable Deadline, drains its Steer, and polls
 // MaxSteps each step so an Escalator can also raise a long run's step ceiling
 // (Decision.RaiseStepsBy).
 package critic
 import (
 	"context"
 	"errors"
 	"log/slog"
 	"math"
 	"sync"
 	"time"
@@ -38,6 +42,7 @@ type Progress struct {
 type Decision struct {
 	Nudge        []llm.Message // injected before the agent's next turn (a steer)
 	ExtendBy     time.Duration // push the hard deadline out by this much
 	RaiseStepsBy int           // raise the run's tool-dispatch step ceiling by this
 	Kill         bool          // cancel the run now
 	KillReason   string
 }
@@ -136,6 +141,7 @@ func (s *System) Monitor(ctx context.Context, info run.RunInfo, softTimeout time
 		now:          s.now,
 		lastActivity: now,
 		deadline:     now.Add(time.Duration(float64(softTimeout) * s.backstopMul)),
 		maxSteps:     info.MaxIterations, // base ceiling; an Escalator may RaiseStepsBy
 		stopCh:       make(chan struct{}),
 	}
 	go h.watch(ctx, check)
@@ -155,13 +161,17 @@ type handle struct {
 	deadline     time.Time
 	steer        []llm.Message
 	iterations   int
 	maxSteps     int // current tool-dispatch ceiling (base MaxIterations, raised by RaiseStepsBy)
 	lastTool     string
 	killed       bool  // sticky: once an Escalator kills, no later decision un-kills it
 	killCause    error // non-nil once killed; surfaced via KillCause for "killed" status
 	stopped      bool
 	stopCh       chan struct{}
 }
-func (h *handle) RecordStep(iter int) {
+func (h *handle) RecordStep(iter int, _ *llm.Response) {
 	// This battery's Progress tracks iteration count + activity, not per-step
 	// payload, so the response is unused here; a richer Escalator could record it.
 	h.mu.Lock()
 	h.iterations = iter
 	h.lastActivity = h.now()
@@ -192,6 +202,18 @@ func (h *handle) Deadline() time.Time {
 	return h.deadline
 }
 func (h *handle) MaxSteps() int {
 	h.mu.Lock()
 	defer h.mu.Unlock()
 	return h.maxSteps
 }
 func (h *handle) KillCause() error {
 	h.mu.Lock()
 	defer h.mu.Unlock()
 	return h.killCause
 }
 func (h *handle) Stop() {
 	h.mu.Lock()
 	if !h.stopped {
@@ -254,6 +276,11 @@ func (h *handle) tick(ctx context.Context) {
 	}
 	if d.Kill {
 		h.killed = true
 		reason := d.KillReason
 		if reason == "" {
 			reason = "critic killed the run"
 		}
 		h.killCause = errors.New(reason) // surfaced via KillCause → "killed" status
 		h.deadline = h.now()             // immediate hard deadline → executor cancels
 		return                           // ignore any Nudge/ExtendBy paired with a Kill
 	}
@@ -263,4 +290,13 @@ func (h *handle) tick(ctx context.Context) {
 	if d.ExtendBy > 0 {
 		h.deadline = h.deadline.Add(d.ExtendBy)
 	}
 	if d.RaiseStepsBy > 0 {
 		// Overflow-safe: a buggy Escalator returning a huge delta must not wrap
 		// maxSteps negative (which the executor would read as "defer to base").
 		if d.RaiseStepsBy > math.MaxInt-h.maxSteps {
 			h.maxSteps = math.MaxInt
 		} else {
 			h.maxSteps += d.RaiseStepsBy
 		}
 	}
 }
@@ -51,7 +51,7 @@ func TestMonitorEscalatesOncePerIdlePeriodAndExtends(t *testing.T) {
 		t.Error("deadline should have been extended past the original")
 	}
 	// A fresh step re-arms; another idle period escalates again.
-	h.RecordStep(1)
+	h.RecordStep(1, nil)
 	time.Sleep(60 * time.Millisecond)
 	mu.Lock()
 	c2 := calls
@@ -125,6 +125,7 @@ google.golang.org/protobuf v1.23.1-0.20200526195155-81db48ad09cc/go.mod h1:EGpAD
 google.golang.org/protobuf v1.25.0/go.mod h1:9JNX74DMeImyA3h4bdi1ymwjUzf21/xIlbajtzgsN7c=
 google.golang.org/protobuf v1.34.2 h1:6xV6lTsCfpGD21XK49h7MhtcApnLqkfYgPcdHftf6hg=
 google.golang.org/protobuf v1.34.2/go.mod h1:qYOHts0dSfpeUzUFpOMr/WGzszTmLH+DiWniOlNbLDw=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
 gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
 gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
@@ -0,0 +1,124 @@
 package run
 import (
 	"context"
 	"fmt"
 	"time"
 	"gitea.stevedudenhoeffer.com/steve/majordomo/agent"
 	"gitea.stevedudenhoeffer.com/steve/majordomo/llm"
 )
 // criticDeadlineCheck is how often the deadline-watch goroutine polls the
 // critic's hard deadline. Small relative to any realistic soft timeout.
 const criticDeadlineCheck = time.Second
 // criticBinding wires a CriticHandle into a run: the executor forwards activity
 // (steps + tool starts) to it, binds the run's hard cancellation to the critic's
 // extendable deadline, and exposes the critic's Steer messages as an agent
 // RunOption. All methods are nil-safe so the executor can call them
 // unconditionally when no critic is configured.
 type criticBinding struct {
 	h CriticHandle
 }
 // startCritic begins critic monitoring for this run when one is configured and
 // the agent enables it. It launches a goroutine that cancels runCtx (via
 // cancelCause) the moment the critic's hard deadline passes — the critic may
 // extend that deadline, so a healthy-but-slow run is given room while a hung one
 // is killed. When the deadline passes because the critic KILLED the run
 // (KillCause() != nil), the cancellation cause is ErrCriticKill (→ status
 // "killed"); when the backstop simply expired, it is context.DeadlineExceeded (→
 // "timeout"). Returns (nil, no-op stop) when there is no critic. The caller MUST
 // defer the returned stop.
 func (e *Executor) startCritic(runCtx context.Context, cancelCause context.CancelCauseFunc, ra RunnableAgent, info RunInfo) (*criticBinding, func()) {
 	noop := func() {}
 	if e.cfg.Ports.Critic == nil || !ra.Critic.Enabled {
 		return nil, noop
 	}
 	soft := e.cfg.Defaults.CriticSoftTimeout
 	if soft <= 0 {
 		soft = 90 * time.Second // defensive: withFallbacks normally guarantees >0
 	}
 	h := e.cfg.Ports.Critic.Monitor(runCtx, info, soft)
 	if h == nil {
 		return nil, noop
 	}
 	done := make(chan struct{})
 	go func() {
 		// A host CriticHandle.Deadline() that panics must not crash the process
 		// (this runs on its own goroutine, so the executor's top-level recover
 		// can't catch it). Log-free best-effort: just stop watching.
 		defer func() { _ = recover() }()
 		t := time.NewTicker(criticDeadlineCheck)
 		defer t.Stop()
 		for {
 			select {
 			case <-done:
 				return
 			case <-runCtx.Done():
 				return
 			case <-t.C:
 				// A zero deadline = no hard cap (not yet set); otherwise cancel
 				// once we're at or past it, distinguishing an explicit kill from a
 				// natural backstop expiry so the run gets the right status.
 				if d := h.Deadline(); !d.IsZero() && !time.Now().Before(d) {
 					if cause := h.KillCause(); cause != nil {
 						cancelCause(fmt.Errorf("%w: %s", ErrCriticKill, cause.Error()))
 					} else {
 						cancelCause(context.DeadlineExceeded)
 					}
 					return
 				}
 			}
 		}
 	}()
 	return &criticBinding{h: h}, func() {
 		close(done)
 		h.Stop()
 	}
 }
 func (b *criticBinding) recordStep(iter int, resp *llm.Response) {
 	if b != nil {
 		b.h.RecordStep(iter, resp)
 	}
 }
 // recordToolStart forwards a tool call to the critic. NOTE: majordomo's step
 // observer only fires AFTER an iteration completes, so this currently lands
 // post-tool, not at dispatch — the activity clock is refreshed once per
 // iteration, not mid-tool. A single very long tool call (e.g. a 30-min render)
 // therefore won't refresh the clock until it returns; a host that runs such
 // tools should feed interim progress to its Critic (mort's InstallProgressBridge
 // pattern). A true pre-dispatch refresh needs a majordomo hook (follow-up).
 func (b *criticBinding) recordToolStart(name, args string) {
 	if b != nil {
 		b.h.RecordToolStart(name, args)
 	}
 }
 // maxStepsOption returns the agent step-ceiling Option. With no critic it's a
 // fixed WithMaxSteps(base); with a critic it's a DYNAMIC WithMaxStepsFunc that
 // polls the handle each step (so the critic can raise a long run's budget),
 // falling back to base when the handle defers (MaxSteps() <= 0).
 func (b *criticBinding) maxStepsOption(base int) agent.Option {
 	if b == nil {
 		return agent.WithMaxSteps(base)
 	}
 	return agent.WithMaxStepsFunc(func() int {
 		if n := b.h.MaxSteps(); n > 0 {
 			return n
 		}
 		return base
 	})
 }
 // steerOptions returns the agent RunOptions that drain the critic's steer
 // messages into the loop. Empty when there is no critic.
 func (b *criticBinding) steerOptions() []agent.RunOption {
 	if b == nil {
 		return nil
 	}
 	return []agent.RunOption{agent.WithSteer(b.h.Steer)}
 }
@@ -0,0 +1,128 @@
 package run_test
 import (
 	"context"
 	"sync"
 	"testing"
 	"time"
 	"gitea.stevedudenhoeffer.com/steve/majordomo/llm"
 	"gitea.stevedudenhoeffer.com/steve/majordomo/provider/fake"
 	"gitea.stevedudenhoeffer.com/steve/executus/run"
 	"gitea.stevedudenhoeffer.com/steve/executus/tool"
 )
 type fakeCritic struct{ h *fakeCriticHandle }
 func (c *fakeCritic) Monitor(_ context.Context, _ run.RunInfo, _ time.Duration) run.CriticHandle {
 	return c.h
 }
 type fakeCriticHandle struct {
 	mu                  sync.Mutex
 	steps, tools, stops int
 	steered             int
 	maxSteps            int   // 0 => defer to the run's base MaxIterations
 	killCause           error // non-nil simulates a critic kill
 }
 func (h *fakeCriticHandle) RecordStep(int, *llm.Response) { h.mu.Lock(); h.steps++; h.mu.Unlock() }
 func (h *fakeCriticHandle) KillCause() error {
 	h.mu.Lock()
 	defer h.mu.Unlock()
 	return h.killCause
 }
 func (h *fakeCriticHandle) RecordToolStart(string, string) {
 	h.mu.Lock()
 	h.tools++
 	h.mu.Unlock()
 }
 func (h *fakeCriticHandle) Steer() []llm.Message { h.mu.Lock(); h.steered++; h.mu.Unlock(); return nil }
 func (h *fakeCriticHandle) Deadline() time.Time  { return time.Time{} } // no hard deadline
 func (h *fakeCriticHandle) MaxSteps() int        { h.mu.Lock(); defer h.mu.Unlock(); return h.maxSteps }
 func (h *fakeCriticHandle) Stop()                { h.mu.Lock(); h.stops++; h.mu.Unlock() }
 // TestCriticRaisesStepCeiling: a critic returning a higher MaxSteps lets the agent
 // run PAST its base MaxIterations (the dynamic step ceiling). With base=1 and no
 // critic the run would hit ErrMaxSteps after the first tool-dispatch step; the
 // critic raises it to 5 so the run completes.
 func TestCriticRaisesStepCeiling(t *testing.T) {
 	h := &fakeCriticHandle{maxSteps: 5}
 	fp := fake.New("fake")
 	fp.Enqueue("m",
 		// two tool-call steps (unknown tool → tolerated error results), then answer
 		fake.ReplyWith(llm.Response{ToolCalls: []llm.ToolCall{{ID: "c1", Name: "noop", Arguments: []byte(`{}`)}}}),
 		fake.ReplyWith(llm.Response{ToolCalls: []llm.ToolCall{{ID: "c2", Name: "noop", Arguments: []byte(`{}`)}}}),
 		fake.Reply("done after 2 tool steps"),
 	)
 	m, _ := fp.Model("m")
 	ex := run.New(run.Config{
 		Registry: tool.NewRegistry(),
 		Models:   func(ctx context.Context, _ string) (context.Context, llm.Model, error) { return ctx, m, nil },
 		Ports:    run.Ports{Critic: &fakeCritic{h: h}},
 		// large soft timeout so the deadline-watch never interferes in the test
 		Defaults: run.Defaults{CriticSoftTimeout: time.Hour},
 	})
 	res := ex.Run(context.Background(),
 		run.RunnableAgent{Name: "x", ModelTier: "m", MaxIterations: 1, Critic: run.CriticConfig{Enabled: true}},
 		tool.Invocation{RunID: "r"}, "go")
 	if res.Err != nil {
 		t.Fatalf("critic raised the ceiling to 5, run should complete past base=1: %v", res.Err)
 	}
 	if res.Output != "done after 2 tool steps" {
 		t.Errorf("output = %q", res.Output)
 	}
 }
 // TestCriticWired: an agent with Critic.Enabled gets monitored — Monitor returns
 // a handle the executor feeds (RecordStep), drains (Steer), and stops.
 func TestCriticWired(t *testing.T) {
 	h := &fakeCriticHandle{}
 	fp := fake.New("fake")
 	fp.Enqueue("m", fake.Reply("done"))
 	m, _ := fp.Model("m")
 	ex := run.New(run.Config{
 		Registry: tool.NewRegistry(),
 		Models:   func(ctx context.Context, _ string) (context.Context, llm.Model, error) { return ctx, m, nil },
 		Ports:    run.Ports{Critic: &fakeCritic{h: h}},
 	})
 	res := ex.Run(context.Background(),
 		run.RunnableAgent{Name: "watched", ModelTier: "m", Critic: run.CriticConfig{Enabled: true}},
 		tool.Invocation{RunID: "r"}, "go")
 	if res.Err != nil {
 		t.Fatalf("run error: %v", res.Err)
 	}
 	h.mu.Lock()
 	defer h.mu.Unlock()
 	if h.steps < 1 {
 		t.Errorf("critic should have seen >=1 step, got %d", h.steps)
 	}
 	if h.steered < 1 {
 		t.Errorf("critic Steer should be drained at least once, got %d", h.steered)
 	}
 	if h.stops != 1 {
 		t.Errorf("critic Stop should be called exactly once, got %d", h.stops)
 	}
 }
 // TestCriticDisabledNotMonitored: Critic.Enabled=false → Monitor never called.
 func TestCriticDisabledNotMonitored(t *testing.T) {
 	h := &fakeCriticHandle{}
 	fp := fake.New("fake")
 	fp.Enqueue("m", fake.Reply("done"))
 	m, _ := fp.Model("m")
 	ex := run.New(run.Config{
 		Registry: tool.NewRegistry(),
 		Models:   func(ctx context.Context, _ string) (context.Context, llm.Model, error) { return ctx, m, nil },
 		Ports:    run.Ports{Critic: &fakeCritic{h: h}},
 	})
 	ex.Run(context.Background(),
 		run.RunnableAgent{Name: "x", ModelTier: "m"}, // Critic.Enabled=false
 		tool.Invocation{RunID: "r"}, "go")
 	h.mu.Lock()
 	defer h.mu.Unlock()
 	if h.stops != 0 || h.steps != 0 {
 		t.Errorf("disabled critic should not be monitored: steps=%d stops=%d", h.steps, h.stops)
 	}
 }
@@ -0,0 +1,114 @@
 package run_test
 import (
 	"context"
 	"errors"
 	"testing"
 	"gitea.stevedudenhoeffer.com/steve/majordomo/llm"
 	"gitea.stevedudenhoeffer.com/steve/majordomo/provider/fake"
 	"gitea.stevedudenhoeffer.com/steve/executus/deliver"
 	"gitea.stevedudenhoeffer.com/steve/executus/run"
 	"gitea.stevedudenhoeffer.com/steve/executus/tool"
 )
 type recordingDelivery struct {
 	target   deliver.Target
 	output   string
 	errored  error
 	delivers int
 }
 func (d *recordingDelivery) Deliver(_ context.Context, t deliver.Target, output string, _ []deliver.Artifact) (string, error) {
 	d.target, d.output, d.delivers = t, output, d.delivers+1
 	return "msg-1", nil
 }
 func (d *recordingDelivery) DeliverError(_ context.Context, t deliver.Target, e error) error {
 	d.target, d.errored = t, e
 	return nil
 }
 func TestDeliveryWired(t *testing.T) {
 	d := &recordingDelivery{}
 	fp := fake.New("fake")
 	fp.Enqueue("m", fake.Reply("the output"))
 	m, _ := fp.Model("m")
 	ex := run.New(run.Config{
 		Registry: tool.NewRegistry(),
 		Models:   func(ctx context.Context, _ string) (context.Context, llm.Model, error) { return ctx, m, nil },
 		Ports:    run.Ports{Delivery: d},
 	})
 	// With a delivery target, the executor posts the output.
 	ex.Run(context.Background(),
 		run.RunnableAgent{Name: "x", ModelTier: "m"},
 		tool.Invocation{RunID: "r", DeliveryKind: "channel", DeliveryID: "chan-9"}, "go")
 	if d.delivers != 1 || d.output != "the output" || d.target.ID != "chan-9" || d.target.Kind != "channel" {
 		t.Fatalf("delivery wrong: %+v out=%q", d.target, d.output)
 	}
 }
 func TestNoDeliveryWithoutTarget(t *testing.T) {
 	d := &recordingDelivery{}
 	fp := fake.New("fake")
 	fp.Enqueue("m", fake.Reply("x"))
 	m, _ := fp.Model("m")
 	ex := run.New(run.Config{
 		Registry: tool.NewRegistry(),
 		Models:   func(ctx context.Context, _ string) (context.Context, llm.Model, error) { return ctx, m, nil },
 		Ports:    run.Ports{Delivery: d},
 	})
 	// No DeliveryID → executor delivers nothing (caller reads Result.Output).
 	ex.Run(context.Background(),
 		run.RunnableAgent{Name: "x", ModelTier: "m"},
 		tool.Invocation{RunID: "r"}, "go")
 	if d.delivers != 0 {
 		t.Errorf("no target should mean no delivery, got %d", d.delivers)
 	}
 }
 // TestNoDeliveryOnEarlyResolveError: an error BEFORE the run starts (model
 // resolve) returns before delivery is reached — neither Deliver nor DeliverError
 // fires. (Delivery covers run OUTCOMES, not pre-run setup failures.)
 func TestNoDeliveryOnEarlyResolveError(t *testing.T) {
 	d := &recordingDelivery{}
 	ex := run.New(run.Config{
 		Registry: tool.NewRegistry(),
 		Models: func(ctx context.Context, _ string) (context.Context, llm.Model, error) {
 			return ctx, nil, errors.New("resolve boom")
 		},
 		Ports: run.Ports{Delivery: d},
 	})
 	ex.Run(context.Background(),
 		run.RunnableAgent{Name: "x", ModelTier: "m"},
 		tool.Invocation{RunID: "r", DeliveryKind: "channel", DeliveryID: "chan-9"}, "go")
 	if d.delivers != 0 || d.errored != nil {
 		t.Errorf("early resolve failure should neither Deliver nor DeliverError: delivers=%d errored=%v", d.delivers, d.errored)
 	}
 }
 // TestDeliverErrorOnRunFailure: an in-loop run failure (the model errors) routes
 // through DeliverError with the run error.
 func TestDeliverErrorOnRunFailure(t *testing.T) {
 	d := &recordingDelivery{}
 	fp := fake.New("fake")
 	fp.Enqueue("m", fake.Step{Err: errors.New("model boom")}) // model errors mid-run
 	m, _ := fp.Model("m")
 	ex := run.New(run.Config{
 		Registry: tool.NewRegistry(),
 		Models:   func(ctx context.Context, _ string) (context.Context, llm.Model, error) { return ctx, m, nil },
 		Ports:    run.Ports{Delivery: d},
 	})
 	res := ex.Run(context.Background(),
 		run.RunnableAgent{Name: "x", ModelTier: "m"},
 		tool.Invocation{RunID: "r", DeliveryKind: "channel", DeliveryID: "chan-9"}, "go")
 	if res.Err == nil {
 		t.Fatal("expected a run error")
 	}
 	if d.delivers != 0 {
 		t.Errorf("a failed run should not Deliver (success path), got %d", d.delivers)
 	}
 	if d.errored == nil || d.target.ID != "chan-9" {
 		t.Errorf("a failed run with a target should DeliverError to chan-9, got errored=%v target=%+v", d.errored, d.target)
 	}
 }
@@ -10,6 +10,7 @@ import (
 	"gitea.stevedudenhoeffer.com/steve/majordomo/llm"
 	"gitea.stevedudenhoeffer.com/steve/executus/compact"
 	"gitea.stevedudenhoeffer.com/steve/executus/deliver"
 	"gitea.stevedudenhoeffer.com/steve/executus/tool"
 )
@@ -27,6 +28,7 @@ type Defaults struct {
 	MaxConsecutiveToolErrors int           // loop guard; default 3
 	MaxSameToolCallRepeats   int           // retry-storm guard; default 3
 	CompactionThresholdRatio float64       // fraction of model context to compact at; default 0.7
 	CriticSoftTimeout        time.Duration // idle window before the critic wakes; default 90s
 }
 func (d Defaults) withFallbacks() Defaults {
@@ -48,6 +50,9 @@ func (d Defaults) withFallbacks() Defaults {
 	if d.CompactionThresholdRatio <= 0 {
 		d.CompactionThresholdRatio = 0.7
 	}
 	if d.CriticSoftTimeout <= 0 {
 		d.CriticSoftTimeout = 90 * time.Second
 	}
 	return d
 }
@@ -99,10 +104,19 @@ type Result struct {
 }
 // Run executes ra with the given invocation + input and returns the Result. It
-// never propagates a panic; failures surface in Result.Err.
+// never propagates a panic; failures surface in Result.Err (a top-level recover
-func (e *Executor) Run(ctx context.Context, ra RunnableAgent, inv tool.Invocation, input string) Result {
+// converts any panic — including from a host Port — into a run error).
 func (e *Executor) Run(ctx context.Context, ra RunnableAgent, inv tool.Invocation, input string) (res Result) {
 	started := time.Now()
-	res := Result{RunID: inv.RunID}
+	res = Result{RunID: inv.RunID}
 	// Enforce the no-panic contract: a panic anywhere in the run (incl. a host
 	// Critic/Audit/Palette callback on the main goroutine) becomes Result.Err
 	// rather than unwinding into the caller.
 	defer func() {
 		if r := recover(); r != nil {
 			res.Err = fmt.Errorf("run.Executor: recovered panic: %v", r)
 		}
 	}()
 	tier := ra.ModelTier
 	if tier == "" {
@@ -141,10 +155,7 @@ func (e *Executor) Run(ctx context.Context, ra RunnableAgent, inv tool.Invocatio
 	// Audit start (optional). The recorder satisfies RunTally; stamp it on the
 	// invocation so a self-status tool can read live progress.
-	var rec RunRecorder
+	info := RunInfo{
 	var stateAcc *RunStateAccessor
 	if e.cfg.Ports.Audit != nil {
 		rec = e.cfg.Ports.Audit.StartRun(ctx, RunInfo{
 		RunID:         inv.RunID,
 		SubjectID:     ra.ID,
 		Name:          ra.Name,
@@ -153,7 +164,12 @@ func (e *Executor) Run(ctx context.Context, ra RunnableAgent, inv tool.Invocatio
 		ParentRunID:   inv.ParentRunID,
 		Inputs:        inv.SkillInputs,
 		StartedAt:     started,
-		})
+		MaxIterations: maxIter,
 	}
 	var rec RunRecorder
 	var stateAcc *RunStateAccessor
 	if e.cfg.Ports.Audit != nil {
 		rec = e.cfg.Ports.Audit.StartRun(ctx, info)
 	}
 	if rec != nil {
 		stateAcc = NewRunStateAccessor(rec, maxIter, 0, started)
@@ -168,16 +184,38 @@ func (e *Executor) Run(ctx context.Context, ra RunnableAgent, inv tool.Invocatio
 		return res
 	}
 	// Add skill__/agent__ delegation tools from the agent's palette (nil-safe:
 	// no PaletteSource or empty palette → no delegation tools).
 	if err := addDelegationTools(toolbox, ra, inv, e.cfg.Ports.Palette); err != nil {
 		res.Err = fmt.Errorf("build delegation tools: %w", err)
 		e.finishAudit(ctx, rec, "error", res, started, res.Err)
 		return res
 	}
 	// Run context: bound by MaxRuntime, detached from the caller's deadline so a
 	// lane/queue wait doesn't eat the run budget (mort's V10 lesson). Caller
 	// cancellation still propagates via MergeCancellation. Created BEFORE the
 	// step observer so the observer forwards the merged run context (not a
 	// possibly-cancelled caller ctx) to OnStep consumers.
-	runCtx, cancel := context.WithTimeout(context.WithoutCancel(ctx), maxRuntime)
+	// MaxRuntime stays a WithTimeout so its DeadlineExceeded propagates through the
-	defer cancel()
+	// child chain (→ "timeout"), preserving the run's-own-timeout vs caller-cancel
 	// distinction. A NESTED cause-carrying layer lets a critic kill surface as a
 	// distinct "killed" without disturbing that: only an ErrCriticKill cause is
 	// consulted in statusFor; a generic run error or a caller cancel is classified
 	// by the run error itself.
 	timeoutCtx, cancelTimeout := context.WithTimeout(context.WithoutCancel(ctx), maxRuntime)
 	defer cancelTimeout()
 	runCtx, cancelCause := context.WithCancelCause(timeoutCtx)
 	defer cancelCause(nil)
 	runCtx, mergeCancel := MergeCancellation(runCtx, ctx)
 	defer mergeCancel()
 	// Critic (optional): monitors the run for a stall, can nudge/extend/kill via
 	// its host Escalator. Its hard deadline is bound to runCtx (cancel on pass).
 	// nil-safe: no-op when no critic is configured or the agent doesn't enable it.
 	critic, stopCritic := e.startCritic(runCtx, cancelCause, ra, info)
 	defer stopCritic()
 	// Step instrumentation: accumulate Result.Steps + fire inv.OnStep, feed the
 	// audit recorder, and keep the live iteration counter fresh. majordomo's
 	// step observer hands us each completed iteration; we zip the model's tool
@@ -192,6 +230,7 @@ func (e *Executor) Run(ctx context.Context, ra RunnableAgent, inv tool.Invocatio
 		if rec != nil {
 			rec.OnStep(s.Index, s.Response)
 		}
 		critic.recordStep(s.Index, s.Response) // keep the critic's activity clock fresh + carry the step payload
 		var calls []llm.ToolCall
 		if s.Response != nil {
 			calls = s.Response.ToolCalls
@@ -202,6 +241,7 @@ func (e *Executor) Run(ctx context.Context, ra RunnableAgent, inv tool.Invocatio
 		}
 		for i := 0; i < n; i++ {
 			call, r := calls[i], s.Results[i]
 			critic.recordToolStart(call.Name, string(call.Arguments))
 			emitter.toolStart(runCtx, call.Name, call.Arguments)
 			emitter.toolEnd(runCtx, call, r.Content, r.IsError)
 			if rec != nil {
@@ -212,7 +252,10 @@ func (e *Executor) Run(ctx context.Context, ra RunnableAgent, inv tool.Invocatio
 	opts := []agent.Option{
 		agent.WithToolbox(toolbox),
-		agent.WithMaxSteps(maxIter),
+		// Step ceiling: a fixed WithMaxSteps(maxIter) normally, but when a critic is
 		// active it owns a DYNAMIC ceiling (WithMaxStepsFunc) so it can raise a
 		// healthy-but-long run's budget mid-flight. Falls back to maxIter.
 		critic.maxStepsOption(maxIter),
 		agent.WithToolErrorLimits(e.cfg.Defaults.MaxConsecutiveToolErrors, e.cfg.Defaults.MaxSameToolCallRepeats),
 		agent.WithStepObserver(stepObserver),
 	}
@@ -236,9 +279,9 @@ func (e *Executor) Run(ctx context.Context, ra RunnableAgent, inv tool.Invocatio
 	}
 	ag := agent.New(model, e.systemPrompt(ra), opts...)
-	runRes, runErr := ag.Run(runCtx, input)
+	runRes, runErr := ag.Run(runCtx, input, critic.steerOptions()...)
-	status := statusFor(runErr)
+	status := statusFor(runCtx, runErr)
 	if runRes != nil {
 		res.Output = runRes.Output
 		res.Usage = runRes.Usage
@@ -250,16 +293,26 @@ func (e *Executor) Run(ctx context.Context, ra RunnableAgent, inv tool.Invocatio
 	if e.cfg.Ports.Budget != nil {
 		e.cfg.Ports.Budget.Commit(detach(ctx), inv.CallerID, time.Since(started).Seconds())
 	}
 	e.deliver(ctx, inv, res, runErr)
 	return res
 }
-// statusFor maps a run error to a RunStats.Status, distinguishing a deadline
+// statusFor maps a run error to a RunStats.Status, distinguishing a critic kill
-// (timeout) and a cancellation (cancelled — caller cancel or shutdown) from a
+// (killed), a deadline (timeout), and a cancellation (cancelled — caller cancel
-// generic error so audit consumers can tell them apart.
+// or shutdown) from a generic error so audit consumers can tell them apart. The
-func statusFor(runErr error) string {
+// run context's cancellation cause carries the distinction (ErrCriticKill /
 // DeadlineExceeded), since ctx.Err() alone only reports Canceled.
 func statusFor(runCtx context.Context, runErr error) string {
 	switch {
 	case runErr == nil:
 		return "ok"
 	// Only the kill is recovered from the cancellation cause — a critic kill
 	// surfaces as a plain Canceled run error, so without this it'd read as
 	// "cancelled". Everything else is classified by the run error itself, so a
 	// genuine run error is never relabeled just because the context was later
 	// cancelled, and a caller cancel/deadline stays "cancelled" (not "timeout").
 	case errors.Is(context.Cause(runCtx), ErrCriticKill):
 		return "killed"
 	case errors.Is(runErr, context.DeadlineExceeded):
 		return "timeout"
 	case errors.Is(runErr, context.Canceled):
@@ -308,6 +361,23 @@ func (e *Executor) compactionThreshold(tier string) int {
 	return int(float64(max) * e.cfg.Defaults.CompactionThresholdRatio)
 }
 // deliver posts the run's output (or error) via run.Ports.Delivery when both a
 // Delivery and a target (inv.DeliveryID) are set. No target = the caller reads
 // Result.Output itself (the synchronous default). Best-effort + detached: a
 // delivery failure must not change the run's outcome.
 func (e *Executor) deliver(ctx context.Context, inv tool.Invocation, res Result, runErr error) {
 	if e.cfg.Ports.Delivery == nil || inv.DeliveryID == "" {
 		return
 	}
 	target := deliver.Target{Kind: inv.DeliveryKind, ID: inv.DeliveryID}
 	dctx := detach(ctx)
 	if runErr != nil {
 		_ = e.cfg.Ports.Delivery.DeliverError(dctx, target, runErr)
 		return
 	}
 	_, _ = e.cfg.Ports.Delivery.Deliver(dctx, target, res.Output, nil)
 }
 // detach derives a bounded cleanup context off ctx, detached from its
 // cancellation, for post-run writes. The cancel is intentionally not returned;
 // CleanupContextTimeout bounds the lifetime.
@@ -148,20 +148,33 @@ func TestExecutorNilModelNoPanic(t *testing.T) {
 	}
 }
-// TestStatusFor maps run errors to RunStats.Status (gadfly F3).
+// TestStatusFor maps run errors + cancellation cause to RunStats.Status (gadfly F3).
 func TestStatusFor(t *testing.T) {
 	bg := context.Background()
 	// A context cancelled with the critic-kill cause: ctx.Err() is Canceled, but
 	// context.Cause carries ErrCriticKill → "killed".
 	killCtx, killCancel := context.WithCancelCause(context.Background())
 	killCancel(fmt.Errorf("%w: hung", ErrCriticKill))
 	// A context cancelled with a non-kill cause must NOT relabel a genuine run
 	// error: a real error stays "error" even though the ctx was later cancelled.
 	cancelledCtx, cc := context.WithCancelCause(context.Background())
 	cc(context.DeadlineExceeded)
 	cases := []struct {
 		ctx  context.Context
 		err  error
 		want string
 	}{
-		{nil, "ok"},
+		{bg, nil, "ok"},
-		{context.DeadlineExceeded, "timeout"},
+		{bg, context.DeadlineExceeded, "timeout"},
-		{context.Canceled, "cancelled"},
+		{bg, context.Canceled, "cancelled"},
-		{fmt.Errorf("wrapped: %w", context.DeadlineExceeded), "timeout"},
+		{bg, fmt.Errorf("wrapped: %w", context.DeadlineExceeded), "timeout"},
-		{errors.New("boom"), "error"},
+		{bg, errors.New("boom"), "error"},
 		{killCtx, context.Canceled, "killed"},
 		{cancelledCtx, errors.New("boom"), "error"},   // generic error not relabeled by cause
 		{cancelledCtx, context.Canceled, "cancelled"}, // caller cancel stays cancelled, not timeout
 	}
 	for _, c := range cases {
-		if got := statusFor(c.err); got != c.want {
+		if got := statusFor(c.ctx, c.err); got != c.want {
 			t.Errorf("statusFor(%v) = %q, want %q", c.err, got, c.want)
 		}
 	}
@@ -0,0 +1,102 @@
 package run
 import (
 	"context"
 	"fmt"
 	"gitea.stevedudenhoeffer.com/steve/majordomo/llm"
 	"gitea.stevedudenhoeffer.com/steve/executus/tool"
 )
 // addDelegationTools adds a delegation tool to the toolbox for each
 // SkillPalette / SubAgentPalette entry, backed by the PaletteSource:
 //
 //   - skill__<name> invokes the named saved skill with structured inputs.
 //   - agent__<name> invokes the named sub-agent with a prompt.
 //
 // Each delegated call runs as a CHILD of the current run (parentRunID =
 // inv.RunID), inheriting the caller + channel. No-op when palette is nil or both
 // palettes are empty — so an agent with no palette (or a host with no
 // PaletteSource) simply has no delegation tools, exactly as before.
 func addDelegationTools(box *llm.Toolbox, ra RunnableAgent, inv tool.Invocation, palette PaletteSource) error {
 	if palette == nil {
 		return nil
 	}
 	seen := map[string]bool{} // dedupe across both palettes by final tool name
 	for _, name := range ra.SkillPalette {
 		name := name // capture
 		toolName := "skill__" + name
 		if name == "" || seen[toolName] { // skip empty / duplicate palette entries
 			continue
 		}
 		seen[toolName] = true
 		t := llm.DefineTool(
 			toolName,
 			fmt.Sprintf("Delegate the task to the %q skill. Provide its declared inputs.", name),
 			func(ctx context.Context, args skillDelegateArgs) (any, error) {
 				out, _, status, err := palette.InvokeSkill(ctx, inv.CallerID, inv.ChannelID, name, args.Inputs, inv.RunID)
 				if err != nil {
 					return nil, delegationErr("skill", name, out, err)
 				}
 				return delegationResult(name, "skill", out, status), nil
 			},
 		)
 		if err := box.Add(t); err != nil {
 			return fmt.Errorf("add %s: %w", toolName, err)
 		}
 	}
 	for _, name := range ra.SubAgentPalette {
 		name := name // capture
 		toolName := "agent__" + name
 		if name == "" || seen[toolName] {
 			continue
 		}
 		seen[toolName] = true
 		t := llm.DefineTool(
 			toolName,
 			fmt.Sprintf("Delegate the task to the %q sub-agent with a natural-language prompt.", name),
 			func(ctx context.Context, args agentDelegateArgs) (any, error) {
 				out, _, status, err := palette.InvokeAgent(ctx, inv.CallerID, inv.ChannelID, name, args.Prompt, inv.RunID, "", "", nil, nil)
 				if err != nil {
 					return nil, delegationErr("agent", name, out, err)
 				}
 				return delegationResult(name, "agent", out, status), nil
 			},
 		)
 		if err := box.Add(t); err != nil {
 			return fmt.Errorf("add %s: %w", toolName, err)
 		}
 	}
 	return nil
 }
 // delegationResult surfaces a non-ok child status to the parent agent (so it can
 // react to a timeout/cancel/budget stop) while still passing the partial output.
 func delegationResult(name, kind, out, status string) string {
 	if status != "" && status != "ok" {
 		header := fmt.Sprintf("[%s %q ended with status %q]", kind, name, status)
 		if out == "" { // no trailing blank line when there's no body
 			return header
 		}
 		return header + "\n" + out
 	}
 	return out
 }
 // delegationErr wraps a hard delegation failure, folding in any partial output
 // the child produced before failing (so it isn't silently lost).
 func delegationErr(kind, name, partial string, err error) error {
 	if partial != "" {
 		return fmt.Errorf("%s %q failed (partial output: %q): %w", kind, name, partial, err)
 	}
 	return fmt.Errorf("%s %q failed: %w", kind, name, err)
 }
 type skillDelegateArgs struct {
 	Inputs map[string]any `json:"inputs" description:"Inputs for the skill, matching its declared input schema."`
 }
 type agentDelegateArgs struct {
 	Prompt string `json:"prompt" description:"The task/prompt to hand the sub-agent."`
 }
@@ -0,0 +1,125 @@
 package run_test
 import (
 	"context"
 	"encoding/json"
 	"testing"
 	"gitea.stevedudenhoeffer.com/steve/majordomo/llm"
 	"gitea.stevedudenhoeffer.com/steve/majordomo/provider/fake"
 	"gitea.stevedudenhoeffer.com/steve/executus/run"
 	"gitea.stevedudenhoeffer.com/steve/executus/tool"
 )
 // recordingPalette captures the delegation call it received.
 type recordingPalette struct {
 	gotName, gotCaller, gotParent string
 	gotInputs                     map[string]any
 }
 func (p *recordingPalette) ResolveSkill(context.Context, string, string) (string, error) {
 	return "", nil
 }
 func (p *recordingPalette) InvokeSkill(_ context.Context, callerID, _, name string, inputs map[string]any, parentRunID string) (string, string, string, error) {
 	p.gotName, p.gotCaller, p.gotParent, p.gotInputs = name, callerID, parentRunID, inputs
 	return "the skill output", "child-run-1", "ok", nil
 }
 func (p *recordingPalette) ResolveAgent(context.Context, string, string) (string, error) {
 	return "", nil
 }
 func (p *recordingPalette) InvokeAgent(context.Context, string, string, string, string, string, string, string, []string, func(context.Context, string, string)) (string, string, string, error) {
 	return "", "", "ok", nil
 }
 // TestPaletteDelegation: an agent with a SkillPalette gets a skill__<name> tool;
 // the model calls it, the executor routes it through run.Ports.Palette as a
 // child of the current run, and the result flows back into the loop.
 func TestPaletteDelegation(t *testing.T) {
 	pal := &recordingPalette{}
 	fp := fake.New("fake")
 	fp.Enqueue("m",
 		fake.ReplyWith(llm.Response{ToolCalls: []llm.ToolCall{{
 			ID:        "c1",
 			Name:      "skill__helper",
 			Arguments: json.RawMessage(`{"inputs":{"q":"hi"}}`),
 		}}}),
 		fake.Reply("delegated and done"),
 	)
 	m, err := fp.Model("m")
 	if err != nil {
 		t.Fatal(err)
 	}
 	ex := run.New(run.Config{
 		Registry: tool.NewRegistry(),
 		Models:   func(ctx context.Context, _ string) (context.Context, llm.Model, error) { return ctx, m, nil },
 		Ports:    run.Ports{Palette: pal},
 	})
 	res := ex.Run(context.Background(),
 		run.RunnableAgent{ID: "a1", Name: "boss", ModelTier: "m", SkillPalette: []string{"helper"}},
 		tool.Invocation{RunID: "parent-run", CallerID: "caller-7", ChannelID: "chan"},
 		"delegate please")
 	if res.Err != nil {
 		t.Fatalf("run error: %v", res.Err)
 	}
 	if res.Output != "delegated and done" {
 		t.Errorf("output = %q", res.Output)
 	}
 	if pal.gotName != "helper" {
 		t.Errorf("InvokeSkill name = %q, want helper", pal.gotName)
 	}
 	if pal.gotCaller != "caller-7" {
 		t.Errorf("InvokeSkill caller = %q, want caller-7", pal.gotCaller)
 	}
 	if pal.gotParent != "parent-run" {
 		t.Errorf("InvokeSkill parentRunID = %q, want parent-run (child of the current run)", pal.gotParent)
 	}
 	if pal.gotInputs["q"] != "hi" {
 		t.Errorf("InvokeSkill inputs = %+v, want q=hi", pal.gotInputs)
 	}
 }
 // TestNoPaletteNoDelegationTools: nil PaletteSource → no delegation tools, run
 // still works (the agent just has no skill__/agent__ tools).
 func TestNoPaletteNoDelegationTools(t *testing.T) {
 	fp := fake.New("fake")
 	fp.Enqueue("m", fake.Reply("ok"))
 	m, _ := fp.Model("m")
 	ex := run.New(run.Config{
 		Registry: tool.NewRegistry(),
 		Models:   func(ctx context.Context, _ string) (context.Context, llm.Model, error) { return ctx, m, nil },
 	})
 	res := ex.Run(context.Background(),
 		run.RunnableAgent{Name: "x", ModelTier: "m", SkillPalette: []string{"helper"}},
 		tool.Invocation{RunID: "r"}, "hi")
 	if res.Err != nil || res.Output != "ok" {
 		t.Fatalf("nil-palette run failed: %v / %q", res.Err, res.Output)
 	}
 }
 // TestDelegationDedupeAndEmptySkip: empty + duplicate palette names are skipped,
 // not turned into "skill__"/duplicate tools that error at box.Add (gadfly C0).
 func TestDelegationDedupeAndEmptySkip(t *testing.T) {
 	pal := &recordingPalette{}
 	fp := fake.New("fake")
 	fp.Enqueue("m", fake.Reply("ok"))
 	m, _ := fp.Model("m")
 	ex := run.New(run.Config{
 		Registry: tool.NewRegistry(),
 		Models:   func(ctx context.Context, _ string) (context.Context, llm.Model, error) { return ctx, m, nil },
 		Ports:    run.Ports{Palette: pal},
 	})
 	// "" (empty) and a duplicate "helper" must not break the build.
 	res := ex.Run(context.Background(),
 		run.RunnableAgent{Name: "x", ModelTier: "m", SkillPalette: []string{"helper", "", "helper"}},
 		tool.Invocation{RunID: "r"}, "hi")
 	if res.Err != nil {
 		t.Fatalf("empty/duplicate palette names should be skipped, not error: %v", res.Err)
 	}
 	if res.Output != "ok" {
 		t.Fatalf("output = %q", res.Output)
 	}
 }
@@ -2,6 +2,7 @@ package run
 import (
 	"context"
 	"errors"
 	"time"
 	"gitea.stevedudenhoeffer.com/steve/majordomo/llm"
@@ -9,6 +10,12 @@ import (
 	"gitea.stevedudenhoeffer.com/steve/executus/deliver"
 )
 // ErrCriticKill is the cancellation cause the executor stamps on a run the
 // critic kills, so a critic kill surfaces as a distinct "killed" status (vs a
 // backstop "timeout" or a caller "cancelled"). A host CriticHandle signals a
 // kill via KillCause(); the executor wraps that reason with this sentinel.
 var ErrCriticKill = errors.New("run: critic killed the run")
 // Ports are the host seams the run executor consumes. Every field is nil-safe:
 // a light host passes the zero Ports and gets a bounded, in-memory run with no
 // persistence, audit, budget, critic, delegation, or delivery — which is
@@ -48,6 +55,9 @@ type RunInfo struct {
 	ParentRunID string
 	Inputs      map[string]any
 	StartedAt   time.Time
 	// MaxIterations is the run's base tool-dispatch step ceiling, so a critic can
 	// raise it relative to the baseline (see CriticHandle.MaxSteps).
 	MaxIterations int
 }
 // RunStats is the terminal roll-up a recorder's Close writes. Mirrors mort's
@@ -113,10 +123,17 @@ type Critic interface {
 }
 // CriticHandle is the executor's live link to a run's critic.
 //
 // Concurrency: the executor calls RecordStep/RecordToolStart/Steer from the run
 // goroutine while a separate watch goroutine polls Deadline() and the run's end
 // calls Stop() — so implementations MUST be safe for concurrent use across these
 // methods (the critic battery's handle guards its state with a mutex).
 type CriticHandle interface {
 	// RecordStep / RecordToolStart keep the critic's activity clock fresh so a
-	// healthy-but-slow run is not mistaken for a hang.
+	// healthy-but-slow run is not mistaken for a hang. RecordStep also carries the
-	RecordStep(iter int)
+	// completed step's model response (nil-safe) so the critic's Trace can show
 	// what the agent actually produced, not just an iteration count.
 	RecordStep(iter int, resp *llm.Response)
 	RecordToolStart(name, args string)
 	// Steer returns any messages the critic wants injected into the loop (a
 	// nudge), drained before each step — matches majordomo agent.WithSteer.
@@ -124,6 +141,17 @@ type CriticHandle interface {
 	// Deadline returns the current hard-kill deadline (the critic may extend
 	// it); the executor binds the run context to it. Zero = no hard deadline.
 	Deadline() time.Time
 	// MaxSteps returns the current tool-dispatch step ceiling, polled by the
 	// executor each step (via majordomo WithMaxStepsFunc) so a critic can raise a
 	// healthy-but-long run's iteration budget mid-flight. Return <= 0 to defer to
 	// the run's base MaxIterations.
 	MaxSteps() int
 	// KillCause returns a non-nil reason iff the critic has decided to KILL this
 	// run (as opposed to letting the hard-deadline backstop expire). The executor
 	// reads it when the deadline passes: non-nil → cancel the run with
 	// ErrCriticKill (status "killed"); nil → the backstop expired naturally
 	// (status "timeout"). Hosts that never distinguish the two may return nil.
 	KillCause() error
 	// Stop ends monitoring when the run finishes.
 	Stop()
 }
@@ -173,6 +173,13 @@ type Invocation struct {
 	CallerID  string
 	ChannelID string
 	GuildID   string
 	// DeliveryKind / DeliveryID name where the executor posts the run's output
 	// via run.Ports.Delivery — a host-interpreted Target ("channel"/"dm"/
 	// "thread"/...). An empty DeliveryID means the executor delivers nothing
 	// and the caller reads Result.Output itself (the synchronous default; the
 	// `.agent run` canary works this way).
 	DeliveryKind string
 	DeliveryID   string
 	// CallerIsAdmin is true when the caller is a mort admin (Member.Admin).
 	// Populated by the executor at run dispatch via Bot.GetMember; defaults
 	// to false on any lookup failure (member not found, DB error, empty
Author	SHA1	Message	Date
steve	be4bbbcad5	run: fix statusFor — don't relabel a generic error / caller-cancel as timeout (gadfly #11 ) executus CI / test (pull_request) Successful in 47s Details executus CI / test (push) Successful in 45s Details The WithCancelCause+timer rewrite made MaxRuntime surface as Canceled (not DeadlineExceeded), so statusFor's context.Cause(DeadlineExceeded) check could relabel (a) a genuine run error as 'timeout' and (b) a caller cancel/deadline as 'timeout' (was 'cancelled'). Convergent gadfly finding (glm-5.2 + cluster). Fix: keep MaxRuntime as WithTimeout (its DeadlineExceeded propagates → 'timeout', preserving own-timeout vs caller-cancel), add a NESTED WithCancelCause layer only for the kill. statusFor consults context.Cause ONLY for ErrCriticKill; everything else is classified by the run error itself. Tests: generic-error-not-relabeled + caller-cancel-stays-cancelled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 17:00:26 -04:00
steve	390e6cf905	run: critic parity — fuller RecordStep + cause-carrying Kill (distinct status) executus CI / test (pull_request) Successful in 46s Details Adversarial Review (Gadfly) / review (pull_request) Successful in 22m30s Details Completes the run-critic seam so a host adapter (mort's agentcritic) has full fidelity, closing the two limitations gadfly surfaced on mort #1334. - RecordStep(iter int, resp *llm.Response): the completed step's model response is now passed to the critic (was index-only), so a host that records a trace (mort's ProgressRecorder) can show what the agent actually produced, not just an iteration count. The executor forwards s.Response; the battery ignores it (its Progress is count-based). - CriticHandle.KillCause() error + ErrCriticKill: the executor now distinguishes an explicit critic KILL from a natural backstop expiry. runCtx uses a cause-carrying cancel (WithCancelCause + a MaxRuntime timer cancelling with DeadlineExceeded); the deadline-watch cancels with ErrCriticKill when KillCause()!=nil, else DeadlineExceeded. statusFor reads context.Cause → killed / timeout / cancelled are now distinct (were all "cancelled"). The battery sets killCause from Decision.KillReason on a Kill. Tests: statusFor "killed" case (cause=ErrCriticKill, err=Canceled); fake handle + battery RecordStep/KillCause signatures. Core stays battery-free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 16:35:13 -04:00
steve	1a1d5e417b	chore: go mod tidy (add missing go.sum entry; CI tidiness gate) executus CI / test (pull_request) Successful in 2m8s Details executus CI / test (push) Successful in 1m45s Details Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 14:53:58 -04:00
steve	f3bd43b726	ci(gadfly): drop the m1 reviewer (dead weight; keep m5) executus CI / test (pull_request) Failing after 1m1s Details m1/qwen3:14b proved consistently low-value + slowest in the pool over multiple PRs. Removed from GADFLY_MODELS + GADFLY_PROVIDER_CONCURRENCY + its endpoint so it never fires again. m5 retained. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 14:41:14 -04:00
steve	306d575c31	critic: overflow-guard maxSteps += RaiseStepsBy (gadfly 5-model convergence) executus CI / test (pull_request) Has been cancelled Details A buggy/hostile Escalator returning a huge RaiseStepsBy could wrap handle.maxSteps negative (which the executor reads as defer-to-base). Clamp at math.MaxInt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 14:38:48 -04:00
steve	4ba83ab905	run: critic can raise a run's step ceiling mid-flight (CriticHandle.MaxSteps) executus CI / test (pull_request) Failing after 1m1s Details Adversarial Review (Gadfly) / review (pull_request) Successful in 21m8s Details Prerequisite for a full-fidelity mort agentcritic adapter (which adjusts a healthy-but-long run's iteration budget, not just its deadline). executus's CriticHandle was deadline+steer only; this adds the dynamic step ceiling above an unchanged majordomo (which already exposes WithMaxStepsFunc). - run.RunInfo += MaxIterations (the run's base ceiling, so a critic can raise it relative to the baseline). - run.CriticHandle += MaxSteps() int — polled by the executor each step via agent.WithMaxStepsFunc; <=0 defers to the base. The executor uses WithMaxStepsFunc(critic.MaxSteps) when a critic is active, else WithMaxSteps. - critic battery: handle.maxSteps (initialised from RunInfo.MaxIterations) + MaxSteps(); Decision gains RaiseStepsBy so an Escalator can raise the ceiling alongside ExtendBy. ExtendOnce default is unchanged (time-only). Test: a critic returning MaxSteps=5 lets a base-MaxIterations=1 run complete two tool-dispatch steps past the base ceiling. Core stays battery-free (run doesn't import critic). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 14:16:03 -04:00
steve	a103cc5e9f	ci(gadfly): 9-cloud panel @ 3 models x 3 lenses (9 concurrent) executus CI / test (push) Failing after 1m57s Details Match mort: minimax-m3, glm-5.2, glm-5.1 (SWE-Bench Pro SOTA), kimi-k2.7-code, deepseek-v4-pro, nemotron-3-super, gpt-oss:120b, qwen3-coder:480b, gemma4 (8 families) + m1/m5 locals. ollama-cloud=3 x lens=3 = 9 concurrent (10 budget). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 12:17:24 -04:00
steve	4d28cd6e2c	ci(gadfly): 4-cloud pool — add kimi-k2.7-code + deepseek-v4-pro, drop v4-flash executus CI / test (push) Failing after 1m2s Details Match mort's new cloud panel: minimax-m3, glm-5.2, kimi-k2.7-code (Moonshot), deepseek-v4-pro (frontier, replaces v4-flash). Keeps m1/m5 locals + the existing ollama-cloud=1 + lens-concurrency=3 serial-model style. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 11:59:13 -04:00
steve	dcaefff756	ci(gadfly): add M1/M5 Macs back to the reviewer pool (full fleet) executus CI / test (push) Failing after 1m23s Details Re-adds the local Macs (m1/qwen3:14b, m5/qwen3.6:35b-mlx) via their foreman endpoints alongside the 3 cloud models. Cloud keeps lens fan-out (ollama-cloud=1 model + lens=3); each Mac runs one model with lenses serial (foreman serializes anyway); all provider lanes parallel. Bumps the job timeout 30->90m for the slow local lanes. With findings telemetry now on, gadfly-reports can quantify whether the Macs earn their keep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 10:44:22 -04:00
steve	97154395e6	C0b: document recordToolStart post-iteration timing (gadfly glm finding) executus CI / test (pull_request) Failing after 59s Details executus CI / test (push) Failing after 1m1s Details majordomo's step observer fires post-iteration, so the critic's activity clock refreshes per-iteration, not mid-tool — a single long tool call won't refresh it until it returns. Documented + the host-progress-bridge mitigation (mort's pattern). A true pre-dispatch hook needs majordomo support (follow-up). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 10:10:56 -04:00
steve	4aa06f652e	C0b: address verified gadfly findings (panic-safety + test honesty) executus CI / test (pull_request) Failing after 58s Details From PR #9 (minimax + deepseek): - Run now has a top-level recover() — the "never propagates a panic" promise was unenforced; a panicking host Port (Critic/Audit/Palette) on the run goroutine now becomes Result.Err instead of unwinding into the caller. - The critic deadline-watch goroutine recovers panics from a host Deadline() (it's a separate goroutine, so Run's recover can't catch it) — a buggy CriticHandle can't crash the process. - CriticHandle interface documents its concurrency contract (Record*/Steer on the run goroutine vs Deadline()/Stop() from the watch goroutine — impls must be concurrent-safe; the critic battery already is). - startCritic's dead `soft <= 0 -> noop` guard (withFallbacks already coerces to 90s) replaced with a defensive inline 90s default, so a bypass of withFallbacks still gets a working critic instead of silently none. - Delivery tests made honest: the old "error path" test only checked the early-return (no delivery); added TestDeliverErrorOnRunFailure (in-loop model error -> DeliverError to the target) + renamed the early-return test. Graded all #9 findings in the gadfly MCP. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 10:09:22 -04:00
steve	43b2471737	C0b: wire Critic + Delivery into run.Executor executus CI / test (pull_request) Failing after 1m0s Details Adversarial Review (Gadfly) / review (pull_request) Successful in 5m9s Details Continues finishing the executor's run.Ports wiring (after C0's Palette). Critic (run/critic.go): when Ports.Critic is set and the agent enables it, the executor calls Monitor at run start, feeds RecordStep/RecordToolStart from the step observer, drains the critic's Steer messages into the loop via agent.WithSteer, and binds the run's hard cancellation to the critic's (extendable) Deadline through a watch goroutine — a healthy-but-slow run gets room while a hung one is killed. Stop() on run end. Soft timeout from Defaults.CriticSoftTimeout (default 90s). nil-safe: no critic / not-enabled = no-op. Delivery (run/executor.go deliver): after the run, when Ports.Delivery is set and inv.DeliveryID is non-empty, the executor posts Result.Output (or DeliverError on failure) to a host-interpreted deliver.Target {inv.DeliveryKind, inv.DeliveryID}. Empty target = caller reads Result.Output itself (the synchronous default; the `.agent run` canary). Best-effort + detached. tool.Invocation gains DeliveryKind/DeliveryID (host-set egress target). Tests: critic monitored/fed/steered/stopped when enabled, untouched when not; delivery posts on a target, skips without one. Deferred: Checkpointer (needs a majordomo hook to snapshot the running message history). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 10:00:05 -04:00
steve	0c80679719	C0: address verified gadfly findings (trivial fixes) executus CI / test (pull_request) Failing after 1m31s Details executus CI / test (push) Failing after 1m31s Details From the PR #8 review (all graded in the gadfly MCP): - skip empty palette names + dedupe by final tool name, instead of producing a "skill__" tool or an opaque box.Add duplicate error. - delegationResult: no trailing blank line when a non-ok child produced no output. - delegationErr: fold a child's partial output into the hard-failure error so it isn't silently dropped. Deferred to C0b (design-level, not trivial): route delegation through the tool.Registry gate/audit wrappers; expose the skill's real input schema to the LLM instead of a generic inputs map. typed-nil PaletteSource is left as a caller contract (the == nil guard catches the untyped-nil interface). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 09:53:11 -04:00
steve	9d41987b0e	C0: wire Palette delegation into run.Executor (skill__/agent__ tools) executus CI / test (pull_request) Failing after 1m2s Details Adversarial Review (Gadfly) / review (pull_request) Successful in 3m47s Details The first cutover prerequisite: the executor now turns an agent's SkillPalette / SubAgentPalette into delegation tools so a mort agent that delegates works through run.Executor (the piece the `.agent run` canary needs beyond the already-wired audit/budget). - run/palette.go: addDelegationTools builds a skill__<name> tool (structured inputs) per SkillPalette entry and an agent__<name> tool (prompt) per SubAgentPalette entry, each invoking run.Ports.Palette as a CHILD of the current run (parentRunID = inv.RunID, inheriting caller + channel). A non-ok child status is surfaced to the parent with the partial output. nil-safe: no PaletteSource or empty palette → no delegation tools (unchanged behavior). - executor.go: call it right after building the low-level toolbox. Tests: the model calls skill__helper → routed through Palette with the right name/caller/inputs/parent; nil palette → run still works. Deferred to C0b (the remaining run.Ports executor wiring): Critic (soft-timeout monitor + deadline binding + steer), Delivery (output egress for surfaces that need executor-side delivery), Checkpointer (needs a majordomo message-history hook to snapshot resumable state). The `.agent run` canary delivers its returned Result.Output itself, so these aren't on its critical path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 09:28:01 -04:00
steve	e37cf415de	ci(gadfly): emit findings to gadfly-reports + bump image to sha-d7f364d executus CI / test (push) Failing after 2m40s Details Adds GADFLY_FINDINGS_URL / GADFLY_FINDINGS_TOKEN (user-scope secrets) so each review POSTs its run + findings to the gadfly-reports store, and bumps the pinned gadfly image to sha-d7f364d (the build carrying the findings-emit). Advisory only — emit failures never affect the review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-27 09:12:46 -04:00