feat(run): execute multi-phase pipelines (RunnableAgent.Phases) #19

Merged
steve merged 2 commits from feat/kernel-phases into main 2026-06-29 19:52:52 +00:00
Owner

The kernel carried RunnableAgent.Phases as a DTO but never executed itrun.Executor.Run always ran a single agent loop with ra.SystemPrompt. A phased agent (mort's deepresearch/research) therefore silently ran one loop with the base prompt instead of its pipeline. This is the host-side companion to mort's executus migration (mort issue #1355): mort's agent path is now sole-sourced on this kernel, so the phase feature must actually run.

What

  • run/phases.gorunPhases / runOnePhase. Phases run sequentially; each is a fresh agent loop (or a bare LLM call for IsRunFunc phases) with its own template-expanded system prompt ({{.Query}} + {{.<PhaseName>}}), model tier, step cap, and tool subset. Outputs thread into later phases; the final phase's output is the run output. Optional phases swallow errors → FallbackMessage; a non-optional phase that merely exhausts its step/tool budget salvages its partial transcript and continues (a hard error still aborts); per-phase tier-resolve failures fall back to the base model with a WARN. Ported from mort's agentexec pipeline but reusing the kernel's machinery.
  • run/agent.goPhase gains IsRunFunc + FallbackMessage (the struct previously omitted them).
  • run/executor.goRun factors the shared agent options (tool-error limits, step observer, compactor) and branches: single loop (critic's dynamic step ceiling) vs the phase runner (fixed per-phase caps; the run-level critic's steer + hard deadline still apply across phases). The same step observer feeds audit/steps/critic across all phases, so token/tool/step accounting accumulates. systemPrompt delegates to a new systemPromptWithBody so each phase keeps the platform header.

Not carried over (kernel is leaner than mort's legacy pipeline)

  • the legacy submit capture tool (the kernel relies on majordomo's no-tool-call-is-final-answer termination, as its single-loop path already does);
  • the critic's dynamic iteration ceiling for phases (per-phase caps are fixed at phase start; steer + hard deadline still apply).

Tests

run/phases_test.go: sequential output threading + template expansion, Optional-failure→FallbackMessage continues, hard-error abort, IsRunFunc bare call, per-phase SystemHeader, filterToolbox subset, expandPhaseTemplate. Full go test ./... green.

🤖 Generated with Claude Code

The kernel carried `RunnableAgent.Phases` as a DTO but **never executed it** — `run.Executor.Run` always ran a single agent loop with `ra.SystemPrompt`. A phased agent (mort's `deepresearch`/`research`) therefore silently ran one loop with the base prompt instead of its pipeline. This is the host-side companion to mort's executus migration (mort issue #1355): mort's agent path is now sole-sourced on this kernel, so the phase feature must actually run. ## What - **`run/phases.go`** — `runPhases` / `runOnePhase`. Phases run sequentially; each is a fresh agent loop (or a bare LLM call for `IsRunFunc` phases) with its own template-expanded system prompt (`{{.Query}}` + `{{.<PhaseName>}}`), model tier, step cap, and tool subset. Outputs thread into later phases; the final phase's output is the run output. `Optional` phases swallow errors → `FallbackMessage`; a non-optional phase that merely exhausts its step/tool budget salvages its partial transcript and continues (a hard error still aborts); per-phase tier-resolve failures fall back to the base model with a WARN. Ported from mort's `agentexec` pipeline but reusing the kernel's machinery. - **`run/agent.go`** — `Phase` gains `IsRunFunc` + `FallbackMessage` (the struct previously omitted them). - **`run/executor.go`** — `Run` factors the shared agent options (tool-error limits, step observer, compactor) and branches: single loop (critic's *dynamic* step ceiling) vs the phase runner (fixed per-phase caps; the run-level critic's steer + hard deadline still apply across phases). The same step observer feeds audit/steps/critic across all phases, so token/tool/step accounting accumulates. `systemPrompt` delegates to a new `systemPromptWithBody` so each phase keeps the platform header. ## Not carried over (kernel is leaner than mort's legacy pipeline) - the legacy `submit` capture tool (the kernel relies on majordomo's no-tool-call-is-final-answer termination, as its single-loop path already does); - the critic's *dynamic* iteration ceiling for phases (per-phase caps are fixed at phase start; steer + hard deadline still apply). ## Tests `run/phases_test.go`: sequential output threading + template expansion, Optional-failure→FallbackMessage continues, hard-error abort, IsRunFunc bare call, per-phase SystemHeader, `filterToolbox` subset, `expandPhaseTemplate`. Full `go test ./...` green. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
steve added 1 commit 2026-06-29 19:15:06 +00:00
feat(run): execute multi-phase pipelines (RunnableAgent.Phases)
executus CI / test (pull_request) Successful in 1m49s
Adversarial Review (Gadfly) / review (pull_request) Successful in 13m59s
30b79a330f
The kernel carried RunnableAgent.Phases as a DTO but never executed it —
Run always ran a single agent loop with ra.SystemPrompt, so a phased agent
(mort's deepresearch/research) silently ran one loop with the base prompt
instead of its pipeline. This implements the phase loop, ported from mort's
agentexec pipeline but reusing the kernel's own machinery.

- run/phases.go: runPhases / runOnePhase. Phases run sequentially; each is a
  fresh agent loop (or a bare LLM call for IsRunFunc phases) with its own
  template-expanded system prompt ({{.Query}} + {{.<PhaseName>}}), model
  tier, step cap, and tool subset. Outputs thread into later phases; the
  final phase's output is the run output. Optional phases swallow errors and
  substitute FallbackMessage; a non-optional phase that merely exhausts its
  step/tool budget salvages its partial transcript and continues (a hard
  error still aborts); per-phase tier-resolve failures fall back with a WARN.
- run/agent.go: Phase gains IsRunFunc + FallbackMessage (the kernel Phase
  struct previously omitted them).
- run/executor.go: Run factors the shared agent options (tool-error limits,
  step observer, compactor) and branches — single loop (critic's dynamic
  step ceiling) vs the phase runner (fixed per-phase caps; the run-level
  critic's steer + hard deadline still apply across phases). systemPrompt
  now delegates to systemPromptWithBody so each phase keeps the platform
  header. The same step observer feeds audit/steps/critic across all phases.

Tests (run/phases_test.go): sequential output threading + template
expansion, Optional-failure → FallbackMessage continues, hard-error abort,
IsRunFunc bare call, per-phase SystemHeader, filterToolbox subset, template
expansion. Full ./... suite green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

🪰 Gadfly — live review status

7/7 reviewers finished · updated 2026-06-29 19:29:03Z

claude-code/opus · claude-code — done

  • security — No material issues found
  • correctness — Blocking issues found
  • maintainability — Minor issues
  • performance — No material issues found
  • error-handling — Minor issues

claude-code/opus:max · claude-code — done

  • security — No material issues found
  • correctness — Minor issues
  • maintainability — Minor issues
  • performance — No material issues found
  • error-handling — Blocking issues found

deepseek-v4-pro:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — Minor issues
  • maintainability — Minor issues
  • performance — No material issues found
  • error-handling — Minor issues

devstral-2:123b-cloud · ollama-cloud — done

  • security — Minor issues
  • correctness — Minor issues
  • maintainability — Minor issues
  • performance — Minor issues
  • error-handling — Minor issues

glm-5.2:cloud · ollama-cloud — done

  • security — Minor issues
  • correctness — Minor issues
  • maintainability — Minor issues
  • performance — No material issues found
  • error-handling — Minor issues

kimi-k2.6:cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — Blocking issues found
  • maintainability — Minor issues
  • performance — Blocking issues found
  • error-handling — Minor issues

qwen3.5:397b-cloud · ollama-cloud — done

  • security — No material issues found
  • correctness — No material issues found
  • maintainability — Minor issues
  • performance — Minor issues
  • error-handling — No material issues found

Live status board. Findings are posted in each model's own comment. Advisory only — does not block merge.

<!-- gadfly-status-board --> ## 🪰 Gadfly — live review status 7/7 reviewers finished · updated 2026-06-29 19:29:03Z #### `claude-code/opus` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Blocking issues found - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `claude-code/opus:max` · claude-code — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Blocking issues found #### `deepseek-v4-pro:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `devstral-2:123b-cloud` · ollama-cloud — ✅ done - ✅ **security** — Minor issues - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — Minor issues - ✅ **error-handling** — Minor issues #### `glm-5.2:cloud` · ollama-cloud — ✅ done - ✅ **security** — Minor issues - ✅ **correctness** — Minor issues - ✅ **maintainability** — Minor issues - ✅ **performance** — No material issues found - ✅ **error-handling** — Minor issues #### `kimi-k2.6:cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — Blocking issues found - ✅ **maintainability** — Minor issues - ✅ **performance** — Blocking issues found - ✅ **error-handling** — Minor issues #### `qwen3.5:397b-cloud` · ollama-cloud — ✅ done - ✅ **security** — No material issues found - ✅ **correctness** — No material issues found - ✅ **maintainability** — Minor issues - ✅ **performance** — Minor issues - ✅ **error-handling** — No material issues found <sub>Live status board. Findings are posted in each model's own comment. Advisory only — does not block merge.</sub>
gitea-actions bot reviewed 2026-06-29 19:29:03 +00:00
gitea-actions bot left a comment

🪰 Gadfly consensus review — 19 inline findings on changed lines. See the consensus comment for the full ranked summary.

Advisory only — does not block merge.

<!-- gadfly-inline-review --> 🪰 **Gadfly consensus review** — 19 inline findings on changed lines. See the consensus comment for the full ranked summary. <sub>Advisory only — does not block merge.</sub>
run/phases.go Outdated
@@ -0,0 +43,4 @@
// critic's dynamic iteration ceiling (per-phase caps are fixed at phase start —
// the run-level critic's steer + hard deadline still apply across phases).
// phaseDeps carries the per-run state the phase runner shares with Run: the base

🟡 Redundant comments in phaseDeps

maintainability · flagged by 1 model

  1. Redundant Comments - Location: run/phases.go:46-58 - Issue: The phaseDeps struct has a comment that explains its purpose, but the comment is somewhat redundant with the struct's name and fields. The comment could be more concise. - Suggestion: Shorten the comment to focus on the unique aspects of phaseDeps that aren't immediately obvious from the struct definition.

🪰 Gadfly · advisory

🟡 **Redundant comments in phaseDeps** _maintainability · flagged by 1 model_ 3. **Redundant Comments** - **Location**: `run/phases.go:46-58` - **Issue**: The `phaseDeps` struct has a comment that explains its purpose, but the comment is somewhat redundant with the struct's name and fields. The comment could be more concise. - **Suggestion**: Shorten the comment to focus on the unique aspects of `phaseDeps` that aren't immediately obvious from the struct definition. <sub>🪰 Gadfly · advisory</sub>
run/phases.go Outdated
@@ -0,0 +49,4 @@
// the critic/session steer, and the audit recorder (phase events).
type phaseDeps struct {
baseModel llm.Model
baseTier string

🟠 Dead field baseTier in phaseDeps struct — populated but never read

maintainability · flagged by 4 models

  • phaseDeps.baseTier is dead code (run/phases.go:52). The field is populated in executor.go:348 (baseTier: tier) but never read anywhere in the package. grep confirms deps.baseTier has zero usages. This is a dead struct field that will confuse future readers. Remove it from the struct and the construction site.

🪰 Gadfly · advisory

🟠 **Dead field baseTier in phaseDeps struct — populated but never read** _maintainability · flagged by 4 models_ - **`phaseDeps.baseTier` is dead code** (`run/phases.go:52`). The field is populated in `executor.go:348` (`baseTier: tier`) but never read anywhere in the package. `grep` confirms `deps.baseTier` has zero usages. This is a dead struct field that will confuse future readers. Remove it from the struct and the construction site. <sub>🪰 Gadfly · advisory</sub>
run/phases.go Outdated
@@ -0,0 +61,4 @@
// whose Output is the final phase's output, with Usage aggregated across phases
// and Messages set to the last phase's transcript (for the PostRun hook). A hard
// (non-optional, non-budget) phase failure returns the error.
func (e *Executor) runPhases(runCtx context.Context, ra RunnableAgent, deps phaseDeps, query string, images []llm.ImagePart) (*agent.Result, error) {

🟡 Long function runPhases

maintainability · flagged by 1 model

  1. Long Function in runPhases - Location: run/phases.go:64-140 - Issue: The runPhases function is quite long and handles multiple responsibilities, including iterating over phases, handling errors, and managing outputs. This makes it harder to read and maintain. - Suggestion: Break this function into smaller, focused functions. For example, separate the error handling logic into a dedicated function.

🪰 Gadfly · advisory

🟡 **Long function runPhases** _maintainability · flagged by 1 model_ 1. **Long Function in `runPhases`** - **Location**: `run/phases.go:64-140` - **Issue**: The `runPhases` function is quite long and handles multiple responsibilities, including iterating over phases, handling errors, and managing outputs. This makes it harder to read and maintain. - **Suggestion**: Break this function into smaller, focused functions. For example, separate the error handling logic into a dedicated function. <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +73,4 @@
return lastResult, err
}
instructions := expandPhaseTemplate(phase.SystemPrompt, query, outputs)

🟠 Untrusted model outputs (prior phase results) are injected into later phases' system prompts via {{.}}, creating a prompt-injection escalation path where adversarial content gains system-prompt privilege.

security · flagged by 1 model

  • Untrusted model outputs are injected into later phases' system prompts (run/phases.go:76, run/phases.go:122, run/phases.go:144, run/phases.go:262-275)expandPhaseTemplate substitutes prior-phase outputs (model-generated, influenced by adversarial query/tool content) via {{.<PhaseName>}}; the expanded instructions become the next phase's System: prompt (systemPromptWithBodyagent.New(model, system, …) / model.Generate{System: system}). In the single-loop path model…

🪰 Gadfly · advisory

🟠 **Untrusted model outputs (prior phase results) are injected into later phases' system prompts via {{.<PhaseName>}}, creating a prompt-injection escalation path where adversarial content gains system-prompt privilege.** _security · flagged by 1 model_ - **Untrusted model outputs are injected into later phases' system prompts (`run/phases.go:76`, `run/phases.go:122`, `run/phases.go:144`, `run/phases.go:262-275`)** — `expandPhaseTemplate` substitutes prior-phase outputs (model-generated, influenced by adversarial query/tool content) via `{{.<PhaseName>}}`; the expanded `instructions` become the next phase's `System:` prompt (`systemPromptWithBody` → `agent.New(model, system, …)` / `model.Generate{System: system}`). In the single-loop path model… <sub>🪰 Gadfly · advisory</sub>
run/phases.go Outdated
@@ -0,0 +86,4 @@
if err != nil {
isLast := i == len(ra.Phases)-1
switch {
case phase.Optional:

🔴 Optional-phase branch swallows context cancellation/deadline/critic-kill; a killed/timed-out run with an Optional final phase reports status "ok" with a fallback message

correctness, error-handling · flagged by 3 models

1. Optional phase handling swallows context cancellation / deadline / critic-kill — a killed or timed-out run whose final phase is Optional is reported as a successful runrun/phases.go:89

🪰 Gadfly · advisory

🔴 **Optional-phase branch swallows context cancellation/deadline/critic-kill; a killed/timed-out run with an Optional final phase reports status "ok" with a fallback message** _correctness, error-handling · flagged by 3 models_ **1. `Optional` phase handling swallows context cancellation / deadline / critic-kill — a killed or timed-out run whose final phase is `Optional` is reported as a successful run** — `run/phases.go:89` <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +97,4 @@
deps.rec.LogEvent("phase_failed_optional", map[string]any{"phase": phase.Name, "error": err.Error()})
}
case isPhaseBudgetExhaustion(err) && (!isLast || strings.TrimSpace(output) != ""):

🟠 Last non-optional phase exhausting its budget with empty salvage aborts the run with a hard error and empty output, discarding all prior phases' completed work — contradicting the documented 'NOT fatal / never discards earlier work' semantics

correctness, error-handling, maintainability, performance · flagged by 3 models

  • run/phases.go:100-118 — a final non-optional phase that exhausts its step/tool budget with no salvageable narrated text aborts the whole run with a hard error and empty output, discarding all earlier phases' completed work — contradicting the file's own documented "NOT fatal" guarantee. Confirmed by tracing the switch at lines 88–119. runOnePhase (lines 178–190) only sets output to a non-empty value when salvagePhaseTranscript(res) finds assistant prose; otherwise output stays `"…

🪰 Gadfly · advisory

🟠 **Last non-optional phase exhausting its budget with empty salvage aborts the run with a hard error and empty output, discarding all prior phases' completed work — contradicting the documented 'NOT fatal / never discards earlier work' semantics** _correctness, error-handling, maintainability, performance · flagged by 3 models_ - **`run/phases.go:100-118` — a final non-optional phase that exhausts its step/tool budget with no salvageable narrated text aborts the whole run with a hard error and empty output, discarding all earlier phases' completed work — contradicting the file's own documented "NOT fatal" guarantee.** Confirmed by tracing the switch at lines 88–119. `runOnePhase` (lines 178–190) only sets `output` to a non-empty value when `salvagePhaseTranscript(res)` finds assistant prose; otherwise `output` stays `"… <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +115,4 @@
}
default:
return lastResult, fmt.Errorf("pipeline phase %q: %w", phase.Name, err)

🟡 totalUsage not applied to lastResult on hard-error return; prior phases' token usage lost from run result

correctness · flagged by 1 model

  • run/phases.go:118totalUsage not applied to lastResult on hard-error return (verified by tracing lines 82-84, 118, and executor.go:358-360). When a non-optional phase fails hard and runOnePhase returns a non-nil partial res, line 83 overwrites lastResult = res and line 84 accumulates totalUsage correctly. But the error return on line 118 sends lastResult as-is — its .Usage field still holds only the failed phase's partial usage, not the accumulated totalUsage that i…

🪰 Gadfly · advisory

🟡 **totalUsage not applied to lastResult on hard-error return; prior phases' token usage lost from run result** _correctness · flagged by 1 model_ - **`run/phases.go:118` — `totalUsage` not applied to `lastResult` on hard-error return** (verified by tracing lines 82-84, 118, and `executor.go:358-360`). When a non-optional phase fails hard and `runOnePhase` returns a non-nil partial `res`, line 83 overwrites `lastResult = res` and line 84 accumulates `totalUsage` correctly. But the error return on line 118 sends `lastResult` as-is — its `.Usage` field still holds only the failed phase's partial usage, not the accumulated `totalUsage` that i… <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +125,4 @@
// Synthesize the run result: the final phase's output, usage aggregated over
// all phases, and the last phase's transcript for the PostRun hook.
if lastResult == nil {

🟡 When the last phase fails as Optional, synthesized lastResult carries a prior phase's transcript for the PostRun hook

error-handling · flagged by 1 model

  • run/phases.go:128 runPhases — final synthesized lastResult may carry a prior phase's transcript when the last phase fails optionally. When the last phase is Optional and fails with a nil res, lastResult is not updated and retains the previous phase's agent.Result; the code then overwrites Output/Usage with the fallback/accumulated usage, but Messages remains the prior phase's transcript (passed to the PostRun hook). The doc says "last phase's transcript for the PostRu…

🪰 Gadfly · advisory

🟡 **When the last phase fails as Optional, synthesized lastResult carries a prior phase's transcript for the PostRun hook** _error-handling · flagged by 1 model_ - **`run/phases.go:128` `runPhases` — final synthesized `lastResult` may carry a *prior* phase's transcript when the last phase fails optionally.** When the last phase is `Optional` and fails with a nil `res`, `lastResult` is not updated and retains the previous phase's `agent.Result`; the code then overwrites `Output`/`Usage` with the fallback/accumulated usage, but `Messages` remains the prior phase's transcript (passed to the PostRun hook). The doc says "last phase's transcript for the PostRu… <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +140,4 @@
func (e *Executor) runOnePhase(runCtx context.Context, ra RunnableAgent, deps phaseDeps, phase Phase, instructions, query string, images []llm.ImagePart) (string, *agent.Result, error) {
model := e.phaseModel(runCtx, deps, ra, phase)
// The phase's expanded instructions are the system prompt (with the platform
// header so tools keep their run ids); the original query is the user message.

🟡 Inconsistent error handling in runOnePhase

maintainability · flagged by 1 model

  1. Inconsistent Error Handling in runOnePhase - Location: run/phases.go:143-200 - Issue: The runOnePhase function handles errors in a nested manner, which can make it difficult to follow the flow of execution. The error handling logic is intertwined with the main logic. - Suggestion: Extract error handling into separate functions or use a more structured approach to improve readability.

🪰 Gadfly · advisory

🟡 **Inconsistent error handling in runOnePhase** _maintainability · flagged by 1 model_ 4. **Inconsistent Error Handling in `runOnePhase`** - **Location**: `run/phases.go:143-200` - **Issue**: The `runOnePhase` function handles errors in a nested manner, which can make it difficult to follow the flow of execution. The error handling logic is intertwined with the main logic. - **Suggestion**: Extract error handling into separate functions or use a more structured approach to improve readability. <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +152,4 @@
return "", nil, fmt.Errorf("phase %q model call: %w", phase.Name, err)
}
if deps.rec != nil {
deps.rec.OnStep(1, resp)

🟠 IsRunFunc phases bypass shared stepObserver; critic clock not refreshed, steps missing from emitter

correctness · flagged by 2 models

🪰 Gadfly · advisory

🟠 **IsRunFunc phases bypass shared stepObserver; critic clock not refreshed, steps missing from emitter** _correctness · flagged by 2 models_ <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +183,4 @@
// Budget/guard exhaustion leaves a usable partial transcript but an empty
// final answer; salvage the narrated work so the pipeline can carry it forward.
if runErr != nil && isPhaseBudgetExhaustion(runErr) {
if salvaged := salvagePhaseTranscript(res); salvaged != "" {

🟠 Missing nil check for res.Steps

error-handling · flagged by 1 model

  1. Finding at run/phases.go:186: The draft claims that salvagePhaseTranscript assumes res.Steps is non-nil without an explicit check. Looking at the code, the function salvagePhaseTranscript does iterate over res.Steps without a nil check. This is a valid concern, and the finding should be kept.

🪰 Gadfly · advisory

🟠 **Missing nil check for res.Steps** _error-handling · flagged by 1 model_ 2. **Finding at `run/phases.go:186`**: The draft claims that `salvagePhaseTranscript` assumes `res.Steps` is non-nil without an explicit check. Looking at the code, the function `salvagePhaseTranscript` does iterate over `res.Steps` without a nil check. This is a valid concern, and the finding should be kept. <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +196,4 @@
if phase.ModelTier == "" {
return deps.baseModel
}
_, m, err := e.cfg.Models(ctx, phase.ModelTier)

🟠 phaseModel discards the resolver's enriched usage-attribution context, so non-base-tier phases can be mis-attributed when resp.Model is empty (single-loop path adopts it via ctx = modelCtx)

correctness, error-handling · flagged by 2 models

2. Per-phase model resolution discards the enriched usage-attribution contextrun/phases.go:199 (phaseModel: _, m, err := e.cfg.Models(ctx, phase.ModelTier))

🪰 Gadfly · advisory

🟠 **phaseModel discards the resolver's enriched usage-attribution context, so non-base-tier phases can be mis-attributed when resp.Model is empty (single-loop path adopts it via ctx = modelCtx)** _correctness, error-handling · flagged by 2 models_ **2. Per-phase model resolution discards the enriched usage-attribution context** — `run/phases.go:199` (`phaseModel`: `_, m, err := e.cfg.Models(ctx, phase.ModelTier)`) <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +209,4 @@
// loop hit its step cap or tripped a tool-error guard) — which leaves a usable
// partial transcript — as opposed to a hard error (cancellation, model failure).
func isPhaseBudgetExhaustion(err error) bool {
return errors.Is(err, agent.ErrMaxSteps) || errors.Is(err, agent.ErrToolLoop)

🟡 run/phases.go:212

correctness · flagged by 1 model

  • run/phases.go:212: The isPhaseBudgetExhaustion function checks for agent.ErrMaxSteps and agent.ErrToolLoop, but these errors are not defined in the repository. This could lead to incorrect behavior if the errors are not properly defined or imported. Verified by searching for the error definitions in the repository and finding no matches.

🪰 Gadfly · advisory

🟡 **`run/phases.go:212`** _correctness · flagged by 1 model_ - **`run/phases.go:212`**: The `isPhaseBudgetExhaustion` function checks for `agent.ErrMaxSteps` and `agent.ErrToolLoop`, but these errors are not defined in the repository. This could lead to incorrect behavior if the errors are not properly defined or imported. **Verified** by searching for the error definitions in the repository and finding no matches. <sub>🪰 Gadfly · advisory</sub>
run/phases.go Outdated
@@ -0,0 +232,4 @@
}
}
out := strings.TrimSpace(b.String())
const maxSalvage = 8000

🔴 salvagePhaseTranscript truncates by bytes, corrupting multi-byte UTF-8 runes

correctness, error-handling, maintainability, performance · flagged by 7 models

  • run/phases.go:236-238salvagePhaseTranscript truncates by byte length (len(out) > maxSalvage and out[len(out)-maxSalvage:]). For any multi-byte UTF-8 text this slices through runes, producing invalid UTF-8 and garbled output fed into downstream phases. The limit should count runes (utf8.RuneCountInString) and slice on a rune boundary.

🪰 Gadfly · advisory

🔴 **salvagePhaseTranscript truncates by bytes, corrupting multi-byte UTF-8 runes** _correctness, error-handling, maintainability, performance · flagged by 7 models_ - **`run/phases.go:236-238`** — `salvagePhaseTranscript` truncates by **byte length** (`len(out) > maxSalvage` and `out[len(out)-maxSalvage:]`). For any multi-byte UTF-8 text this slices through runes, producing invalid UTF-8 and garbled output fed into downstream phases. The limit should count runes (`utf8.RuneCountInString`) and slice on a rune boundary. <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +241,4 @@
// phaseUserMessage builds a phase's user message: the original query text plus
// any inline images. Mirrors the single-loop multimodal seeding in runAgent.
func phaseUserMessage(query string, images []llm.ImagePart) llm.Message {

🟠 phaseUserMessage duplicates runAgent's multimodal message construction — extract shared helper

maintainability · flagged by 5 models

  • phaseUserMessage duplicates runAgent's multimodal message logic (run/phases.go:244–256 vs run/executor.go:491–506). Both functions build a llm.Message from query + images with the same pattern: no images → simple path, otherwise build []llm.Part with optional text + image parts. The comment on phaseUserMessage even acknowledges it "mirrors" runAgent. Extract a shared helper (e.g., func userMessage(query string, images []llm.ImagePart) llm.Message) so the two call sites…

🪰 Gadfly · advisory

🟠 **phaseUserMessage duplicates runAgent's multimodal message construction — extract shared helper** _maintainability · flagged by 5 models_ - **`phaseUserMessage` duplicates `runAgent`'s multimodal message logic** (`run/phases.go:244–256` vs `run/executor.go:491–506`). Both functions build a `llm.Message` from `query + images` with the same pattern: no images → simple path, otherwise build `[]llm.Part` with optional text + image parts. The comment on `phaseUserMessage` even acknowledges it "mirrors" `runAgent`. Extract a shared helper (e.g., `func userMessage(query string, images []llm.ImagePart) llm.Message`) so the two call sites… <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +257,4 @@
// expandPhaseTemplate applies Go text/template substitution to a phase prompt,
// replacing {{.Query}} with the original query and {{.<PhaseName>}} with a prior
// phase's output. Returns the original string unchanged if parsing or execution

🔴 expandPhaseTemplate silently fails for phase names with hyphens or non-identifier characters

correctness, error-handling, maintainability, performance, security · flagged by 6 models

  • run/phases.go:262-275expandPhaseTemplate builds text/template variables as {{.<PhaseName>}} (line 269: data[k] = v). Go text/template identifiers may contain only letters, digits, and underscores. Phase names with hyphens (deep-research), digits at the start, spaces, etc. parse as invalid expressions or fail execution silently, causing the template to return unexpanded — downstream phases receive literal {{.deep-research}} instead of the actual prior output. This con…

🪰 Gadfly · advisory

🔴 **expandPhaseTemplate silently fails for phase names with hyphens or non-identifier characters** _correctness, error-handling, maintainability, performance, security · flagged by 6 models_ - **`run/phases.go:262-275`** — `expandPhaseTemplate` builds `text/template` variables as `{{.<PhaseName>}}` (line 269: `data[k] = v`). Go `text/template` identifiers may contain only letters, digits, and underscores. Phase names with hyphens (`deep-research`), digits at the start, spaces, etc. parse as invalid expressions or fail execution silently, causing the template to return **unexpanded** — downstream phases receive literal `{{.deep-research}}` instead of the actual prior output. This con… <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +264,4 @@
if err != nil {
return tmpl
}
data := map[string]string{"Query": query}

🟡 Unsanitized user input in template data could lead to XSS or injection

security · flagged by 1 model

Finding 2: Unsanitized User Input in Template Data (run/phases.go:267-270) - Verification: The query and priorOutputs values are directly passed into the template data map without sanitization. - Impact: Confirmed. This could lead to XSS or other injection issues if the output is rendered in a web context. - Conclusion: This finding is valid.

🪰 Gadfly · advisory

🟡 **Unsanitized user input in template data could lead to XSS or injection** _security · flagged by 1 model_ ### Finding 2: Unsanitized User Input in Template Data (`run/phases.go:267-270`) - **Verification**: The `query` and `priorOutputs` values are directly passed into the template data map without sanitization. - **Impact**: Confirmed. This could lead to XSS or other injection issues if the output is rendered in a web context. - **Conclusion**: This finding is valid. <sub>🪰 Gadfly · advisory</sub>
@@ -0,0 +280,4 @@
// a WARN — a typo'd phase tool list should not abort a run mid-pipeline.
func filterToolbox(box *llm.Toolbox, names []string) *llm.Toolbox {
out := llm.NewToolbox(box.Name())
if len(names) == 0 {

🟠 filterToolbox copies entire toolbox when no filtering is requested

correctness, performance · flagged by 3 models

  • filterToolbox copies entire toolbox when no filtering is requestedrun/phases.go:283-287 When a phase does not specify a tool subset (len(names) == 0), filterToolbox still allocates a new llm.Toolbox and copies every tool into it. If the base toolbox is large and the pipeline has many phases, this is unnecessary allocation and Add overhead on every phase of every run. The single-loop path already passes the original toolbox pointer directly without copying. Fix: Return…

🪰 Gadfly · advisory

🟠 **filterToolbox copies entire toolbox when no filtering is requested** _correctness, performance · flagged by 3 models_ * **`filterToolbox` copies entire toolbox when no filtering is requested** — `run/phases.go:283-287` When a phase does not specify a tool subset (`len(names) == 0`), `filterToolbox` still allocates a new `llm.Toolbox` and copies every tool into it. If the base toolbox is large and the pipeline has many phases, this is unnecessary allocation and `Add` overhead on every phase of every run. The single-loop path already passes the original `toolbox` pointer directly without copying. **Fix:** Return… <sub>🪰 Gadfly · advisory</sub>
run/phases.go Outdated
@@ -0,0 +301,4 @@
// addUsage sums two llm.Usage tallies field-by-field so a phased run reports the
// total tokens across all phases.
func addUsage(a, b llm.Usage) llm.Usage {

🟠 addUsage manually sums llm.Usage fields, brittle to upstream changes

maintainability · flagged by 2 models

  • run/phases.go:304-311addUsage manually accumulates every field of llm.Usage. If the upstream struct gains a new field (e.g. a future cost or token category), this function silently under-counts. Prefer an Add method on llm.Usage (if majordomo doesn't provide one, add it there) or generate this with a simple reflection helper; at minimum add a compile-time assertion that verifies field count.

🪰 Gadfly · advisory

🟠 **addUsage manually sums llm.Usage fields, brittle to upstream changes** _maintainability · flagged by 2 models_ - **`run/phases.go:304-311`** — `addUsage` manually accumulates every field of `llm.Usage`. If the upstream struct gains a new field (e.g. a future cost or token category), this function silently under-counts. Prefer an `Add` method on `llm.Usage` (if majordomo doesn't provide one, add it there) or generate this with a simple reflection helper; at minimum add a compile-time assertion that verifies field count. <sub>🪰 Gadfly · advisory</sub>

🪰 Gadfly review — consensus across 7 models

Verdict: Blocking issues found · 21 findings (10 with multi-model agreement)

Finding Where Models Lens
🔴 salvagePhaseTranscript truncates by bytes, corrupting multi-byte UTF-8 runes run/phases.go:235 7/7 correctness, error-handling, maintainability, performance
🔴 expandPhaseTemplate silently fails for phase names with hyphens or non-identifier characters run/phases.go:260 6/7 correctness, error-handling, maintainability, performance, security
🟠 phaseUserMessage duplicates runAgent's multimodal message construction — extract shared helper run/phases.go:244 5/7 maintainability
🟠 Dead field baseTier in phaseDeps struct — populated but never read run/phases.go:52 4/7 maintainability
🔴 Optional-phase branch swallows context cancellation/deadline/critic-kill; a killed/timed-out run with an Optional final phase reports status "ok" with a fallback message run/phases.go:89 3/7 correctness, error-handling
🟠 Last non-optional phase exhausting its budget with empty salvage aborts the run with a hard error and empty output, discarding all prior phases' completed work — contradicting the documented 'NOT fatal / never discards earlier work' semantics run/phases.go:100 3/7 correctness, error-handling, maintainability, performance
🟠 filterToolbox copies entire toolbox when no filtering is requested run/phases.go:283 3/7 correctness, performance
🟠 IsRunFunc phases bypass shared stepObserver; critic clock not refreshed, steps missing from emitter run/phases.go:155 2/7 correctness
🟠 phaseModel discards the resolver's enriched usage-attribution context, so non-base-tier phases can be mis-attributed when resp.Model is empty (single-loop path adopts it via ctx = modelCtx) run/phases.go:199 2/7 correctness, error-handling
🟠 addUsage manually sums llm.Usage fields, brittle to upstream changes run/phases.go:304 2/7 maintainability
11 single-model findings (lower confidence)
Finding Where Model Lens
🟠 Untrusted model outputs (prior phase results) are injected into later phases' system prompts via {{.}}, creating a prompt-injection escalation path where adversarial content gains system-prompt privilege. run/phases.go:76 glm-5.2:cloud security
🟠 Missing nil check for res.Steps run/phases.go:186 devstral-2:123b-cloud error-handling
🟡 Lack of input validation for SystemPrompt field in Phase struct run/agent.go:64 devstral-2:123b-cloud security
🟡 Redundant comments in phaseDeps run/phases.go:46 devstral-2:123b-cloud maintainability
🟡 Long function runPhases run/phases.go:64 devstral-2:123b-cloud maintainability
🟡 totalUsage not applied to lastResult on hard-error return; prior phases' token usage lost from run result run/phases.go:118 deepseek-v4-pro:cloud correctness
🟡 When the last phase fails as Optional, synthesized lastResult carries a prior phase's transcript for the PostRun hook run/phases.go:128 glm-5.2:cloud error-handling
🟡 Inconsistent error handling in runOnePhase run/phases.go:143 devstral-2:123b-cloud maintainability
🟡 'run/phases.go:212' run/phases.go:212 devstral-2:123b-cloud correctness
🟡 Unsanitized user input in template data could lead to XSS or injection run/phases.go:267 devstral-2:123b-cloud security
Live iteration counter regresses each phase (SetIteration stores, not max); RunInfo.MaxIterations understates phased runs — observability only run/executor.go:267 claude-code/opus correctness
Per-model detail
claude-code/opus (claude-code) — Blocking issues found

Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

I've verified the key security-relevant mechanics. Let me confirm one detail about the data-vs-template-text boundary, since it's the crux of the injection question.

The findings:

  • expandPhaseTemplate (run/phases.go:262) uses text/template where the template text is phase.SystemPrompt (host/agent config — trusted) and the data is {Query, prior-phase-outputs} (untrusted). Go's text/template inserts data values as literals; it does not re-parse them as templates. So a query containing {{.Secret}} is rendered as literal text, not evaluated — no server-side template injection / RCE. Verified against the package usage (run/phases.go:10, only text/template, never html/template, and no other call site feeds untrusted strings as template text).
  • filterToolbox (run/phases.go:289) copies from deps.baseToolbox, which is the already-permission-decorated toolbox built once via Registry.Build(..., inv, Visibility("private"), ...) (run/executor.go:191) plus delegation/extra/session tools. Phases can only narrow this set (box.Get(name) over the existing palette) — they cannot add a tool that wasn't already authorized for this invocation, and an unknown name is skipped, not synthesized. No authz escalation across phases. Verified.
  • phaseModel (run/phases.go:218) resolves phase.ModelTier from config, falling back to the base model. Tier strings are host config, not request-tainted. No concern.
  • No command/path/SQL construction, no network/SSRF surface, no deserialization. Error strings reach audit LogEvent/slog.Warn only (phase name + err.Error()), not phase outputs/secrets.

One thing genuinely in-lane worth flagging, though it is a design property rather than a coding defect:

  • Untrusted input is elevated into the system-prompt role. In runOnePhase (run/phases.go:160-165), the expanded instructions — which for any phase whose SystemPrompt contains {{.Query}} or {{.<priorPhase>}} now embed the raw user query and prior (LLM-generated, therefore untrusted) phase outputs — become the system message (system := e.systemPromptWithBody(instructions)), while the single-loop path keeps user input confined to the user message. This widens the prompt-injection surface: untrusted text gains system-role authority across phases. It's opt-in (the template author must write {{.Query}}) and ported intentionally from mort, so it's not a regression I'd block on, but hosts running phased agents on untrusted input should be aware that phase-to-phase output threading carries any injected instructions forward at system authority. Mitigation if desired: keep {{.Query}} substitution in the user message and reserve the system body for static instructions.

Net: the injection/authz/secret surfaces are clean; the only security-relevant note is the inherent prompt-injection amplification of threading untrusted content through system prompts, which is by design.

VERDICT: No material issues found

🎯 Correctness — Blocking issues found

I've verified all three findings against the actual code. Let me confirm my analysis:

Finding 1 — Confirmed. In runPhases, the loop-top guard (phases.go:72) only protects the start of the next phase. The case phase.Optional: (phases.go:89) is checked first and swallows any error, including context.Canceled/DeadlineExceeded/critic-kill. isPhaseBudgetExhaustion (phases.go:212) only matches ErrMaxSteps/ErrToolLoop, so a ctx error falls into the Optional branch. When the Optional phase is the last one, no subsequent iteration's top-guard catches it; runPhases returns nil error. statusFor (executor.go:391) checks runErr == nil first → returns "ok" even when context.Cause(runCtx) is ErrCriticKill. Real bug.

Finding 2 — Confirmed as a deviation. phaseModel (phases.go:199) discards the resolver's enriched ctx with _; the single-loop path adopts it (executor.go:159). Impact is narrower than it could be: recordUsage/resolvedModelName (call.go:235-268) attribute primarily by resp.Model, falling back to ctx attribution only when resp.Model is empty — so mis-attribution only manifests in that fallback path. The draft's low confidence is appropriate; the deviation itself is real.

Finding 3 — Confirmed. SetIteration (runengine.go:142) is an atomic store; each phase is a fresh loop with a shared stepObserver, so the live counter regresses at phase boundaries. RunInfo.MaxIterations = base maxIter (executor.go:172). Token accounting unaffected — Writer.OnStep (writer.go:83) sums resp.Usage regardless of iter. Observability only.

All three survive. (Fixing the stale line number for finding 2 in the JSON block: the discard is at phases.go:199, not 178.)


VERDICT: Blocking issues found

Findings

1. Optional phase handling swallows context cancellation / deadline / critic-kill — a killed or timed-out run whose final phase is Optional is reported as a successful runrun/phases.go:89

The error switch checks case phase.Optional: first and unconditionally substitutes FallbackMessage for any error, including context.Canceled, context.DeadlineExceeded, and ErrCriticKill. The only cancellation guard is at the top of the loop (run/phases.go:72), which protects the start of the next phase, not a cancellation that occurs during a phase.

Trace the case where the final phase is Optional and the run is killed/timed-out mid-phase:

  • runOnePhaserunAgent returns (res, ctxErr) where errors.Is(ctxErr, context.Canceled) (or DeadlineExceeded).
  • isPhaseBudgetExhaustion(ctxErr) is false (phases.go:212 — only matches ErrMaxSteps/ErrToolLoop), so cancellation never hits the budget branch.
  • case phase.Optional matches → error swallowed, output = FallbackMessage, loop ends (it's the last phase).
  • runPhases returns (synthesized result, nil).

Back in Run (run/executor.go:357), statusFor(runCtx, nil) returns "ok" — it checks runErr == nil before consulting the cancellation cause, so even an ErrCriticKill cause is ignored — and res.Output is the fallback string. A run the critic killed (or that blew its MaxRuntime) is recorded as a successful run delivering a canned "phase encountered an error" message, instead of status killed/timeout. The IsRunFunc path has the same exposure (its error wraps the ctx error at run/phases.go:152).

Fix: before the Optional/budget branches, bail on a genuine context error, e.g.

if runCtx.Err() != nil || errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
    return lastResult, err
}

so cancellation/deadline/kill always aborts and surfaces with the correct status.

2. Per-phase model resolution discards the enriched usage-attribution contextrun/phases.go:199 (phaseModel: _, m, err := e.cfg.Models(ctx, phase.ModelTier))

The single-loop path adopts the resolver's returned context (ctx = modelCtx, run/executor.go:159) — its documented purpose is "enriches ctx for usage attribution" (executor.go:19-21). phaseModel resolves a phase's tier but throws the enriched ctx away with _, so a phase running on a non-base tier executes under the base tier's attribution context while using the other tier's model.

Confidence is low — usage attribution (model/call.go:235-268) keys primarily off resp.Model (the serving model name), and only falls back to the context attribution set by ParseModelForContext when resp.Model is empty. So mis-attribution only manifests for responses that don't name their serving model. It is nonetheless a clear, unintended deviation from the established single-loop pattern introduced by this change.

3. (Minor) Live iteration counter regresses across phasesrun/executor.go:267 (stateAcc.SetIteration(s.Index))

Each phase is a fresh agent loop whose s.Index restarts, and SetIteration does an atomic store, not a max/accumulate (RunStateAccessor, run/runengine.go:142). A self-status/progress consumer watching a phased run sees the iteration count jump backwards at each phase boundary, and RunInfo.MaxIterations (set to the base maxIter, executor.go:172) understates a phased run whose phases can each consume up to maxIter steps. Token accounting is not affected — the audit Writer sums resp.Usage per OnStep regardless of iter (audit/writer.go:83-110), so duplicate indices across phases don't corrupt totals. Observability only.

🧹 Code cleanliness & maintainability — Minor issues

Both findings verified against the actual source. Outputting the corrected review.

VERDICT: Minor issues

This is a clean, well-commented port. The structure (one runPhases orchestrator + runOnePhase worker + small pure helpers) reads well, the doc comments are thorough and honest about what was/wasn't carried over, and the sharedOpts refactor in executor.go cleanly factors the common agent options without duplicating them per path. Two minor maintainability nits:

  • run/phases.go:52 — dead struct field baseTier. phaseDeps.baseTier is populated (executor.go:348 baseTier: tier) and documented ("the base model + tier"), but no code in the package ever reads it — phaseModel falls back to deps.baseModel, never a tier. Verified by grep -rn baseTier run/: only the declaration and the assignment, no read. A field that looks load-bearing but isn't will mislead the next maintainer (e.g. someone wiring per-phase compaction thresholds will assume the base tier is already threaded here). Drop the field and the baseTier: tier line, or actually use it (the compaction threshold is computed once from the base tier and shared across phases — if that's a known limitation, baseTier would be the natural hook, but as written it's just dead).

  • run/phases.go:244-256 vs run/executor.go:495-501 — duplicated multimodal part-building. phaseUserMessage and the image-folding block inside runAgent are byte-for-byte the same logic (make([]llm.Part, 0, len(images)+1), trim-guard the text part, append each image). The comment on phaseUserMessage even says "Mirrors the single-loop multimodal seeding in runAgent" — an explicit acknowledgment of the copy-paste. Extract a small func userParts(text string, images []llm.ImagePart) []llm.Part and have both call sites use it, so a future change to multimodal seeding (e.g. a new part type) only has to land in one place. Low-churn, removes the mirror-by-hand maintenance hazard.

Nothing else in my lane: naming is consistent with the surrounding code, the switch in runPhases is readable, helper functions are appropriately small and single-purpose, and the new test file follows the existing table/fake-provider patterns.

Performance — No material issues found

I have enough to assess the Performance lens. The phase runner is LLM-round-trip-bound, and I verified the key allocation/loop sites against the actual code in run/phases.go and run/executor.go.

VERDICT: No material issues found

Reviewed strictly through the Performance lens. The new phase runner is dominated by sequential model round-trips; the surrounding bookkeeping is all bounded by the (small, author-defined) phase count, so there's nothing I can justify as a real efficiency regression.

What I checked and cleared:

  • No per-phase toolbox re-decoration. The expensive Registry.Build (permissions/SSRF/delegation decoration) runs once in Run (executor.go:191-232); each phase only calls filterToolbox (phases.go:281), which copies already-decorated tool references into a fresh *llm.Toolbox. Cost is O(phases × tools) reference copies — negligible vs. the model calls. Good reuse, not a regression.
  • No unbounded growth across phases. Each phase is a fresh agent loop seeded only with the original query + any templated prior outputs (phases.go:149, 176-178); the prior phase's full transcript is not carried forward into the next loop's context. So context size doesn't compound phase-to-phase. The outputs map (phases.go:65) grows by exactly one entry per phase, and salvaged transcripts are tail-bounded to 8000 chars (phases.go:235-238).
  • Shared accumulators are bounded. The single stepObserver/emitter and addUsage tally across phases (executor.go:265-290, phases.go:84); total steps ≤ Σ per-phase caps, total usage is a fixed-field sum. No quadratic accumulation.
  • Template/usage helpers are cheap. expandPhaseTemplate parses one small template per phase (phases.go:262-276) and addUsage is constant-time (phases.go:304-311) — both off the hot path relative to a model call.
  • No N+1 / blocking-call surprise. phaseModel re-resolves the tier only for phases that set a non-empty ModelTier (phases.go:195-206); that's required for per-phase tiers, not redundant work.

Below-threshold, deliberately not flagged as a finding (micro-optimization): when phase.Tools is empty, filterToolbox still allocates a full copy of the base palette per phase (phases.go:283-288) rather than returning deps.baseToolbox directly — but it's a handful of pointer copies per phase, immaterial against the LLM latency.

🧯 Error handling & edge cases — Minor issues

I've read the full implementation. Let me verify each finding against the actual code.

Finding 1 — expandPhaseTemplate (lines 262-276): Confirmed on all three sub-claims by reading the function:

  • Line 263-265: parse failure returns tmpl verbatim, no log. A phase referenced as {{.deep-research}} (hyphen/space/leading-digit name) is an invalid Go template field selector → parse error → entire prompt returned unchanged with the literal {{...}} passed through.
  • Line 267 seeds data["Query"], then lines 268-270 overwrite from priorOutputs, so a phase named Query shadows the original input.
  • Lines 264-265 and 272-273 both return tmpl with no slog.Warn, unlike phaseModel (201) and filterToolbox (292, 296) which do warn. Confirmed.

Finding 2 — salvagePhaseTranscript (lines 236-238): Confirmed. out[len(out)-maxSalvage:] is a byte-offset slice on a string; a multibyte rune straddling the boundary is split. Real (cosmetic) issue.

Both findings independently reproduce against the real source and rest only on in-repo logic. Both survive.


VERDICT: Minor issues

🧯 Error handling & edge cases

  • run/phases.go:262-276expandPhaseTemplate swallows template errors silently, and several valid Phase.Names break threading invisibly. Three coupled edge cases, all silent:

    1. A phase whose Name is not a valid Go template identifier — hyphens, spaces, leading digits (deep-research, step 1, 2nd) — makes any downstream {{.deep-research}} a parse error, so the whole prompt is returned unchanged and the literal {{.deep-research}} is sent to the model. The kernel's own RunnableAgent.Phases DTO places no constraint on Name, so this is reachable from valid config.
    2. A phase named Query silently shadows the original input: data is seeded with "Query": query (line 267) and then for k,v := range priorOutputs { data[k]=v } (268-270) overwrites it, so later phases' {{.Query}} resolves to that phase's output, not the run input.
    3. Both the parse failure (264-265) and execute failure (272-273) return the template verbatim with no slog.Warn — unlike every other fallback in this file (phaseModel, filterToolbox all WARN). A misconfigured pipeline therefore degrades to "model got raw {{...}} text" with zero diagnostics.
      Impact: misconfiguration produces silently-wrong prompts, very hard to debug. Suggested fix: log a WARN on parse/execute failure, and at minimum document/validate that Name must be a bare template identifier (and reserve Query).
  • run/phases.go:236-238salvagePhaseTranscript tail-trim can split a UTF-8 rune. out[len(out)-maxSalvage:] slices at a byte offset, so a multi-byte rune straddling the 8000-byte boundary becomes an invalid leading fragment (a replacement char) in the salvaged text. Cosmetic, but trivial to fix: advance to a rune boundary, e.g. trim on utf8.RuneStart or slice on a []rune.

Everything else in this lane checked out: the runCtx.Err() guard between phases (72-74), the nil-result handling for failed bare calls (82-85, 128-130), the budget-vs-hard-error classification (isPhaseBudgetExhaustion, 211-213), the last-phase-with-empty-salvage falling through to a hard error (100, 117-118), and filterToolbox skipping unknown/duplicate tools without aborting (289-298) are all sound.

claude-code/opus:max (claude-code) — Blocking issues found

Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

I've verified the security-relevant surfaces of this change. Findings:

  • Template expansion is not an injection sink. expandPhaseTemplate (run/phases.go:266) parses phase.SystemPrompt (agent-definition config) as the template source, and injects the user query and prior phase outputs only as data values in the map[string]string. A query containing {{...}} is never re-parsed, so untrusted input can't inject template directives. It uses text/template (correct — prompt text, no HTML/XSS context). Even a fully attacker-controlled phase template can only reach Query/prior outputs and call methods on string values — text/template is not a sandbox escape and exposes no fs/env, so no info disclosure beyond the run's own data.
  • Tool permissioning is preserved across phases. toolbox is built via e.cfg.Registry.Build(... Visibility("private") ...) (run/executor.go:191), i.e. the decorated/permissioned palette. filterToolbox (run/phases.go:289) only re-adds the same tool objects returned by box.Get(name) and can only ever produce a subset — a phase cannot acquire a tool absent from the agent's palette, nor strip the permission decoration. No privilege escalation. (Empty Tools → full palette is documented, agent-author-controlled behavior, not a vuln.)
  • IsRunFunc bare call (run/phases.go:154) passes no tools at all — strictly more restrictive than the loop path; no new sink.
  • No new untrusted-input handling of secrets, no SSRF/command/path/SQL surface, no deserialization. Phase outputs threading into later prompts stays within the same run's existing trust boundary (the model already had that data).

No material security issue introduced by this change.

VERDICT: No material issues found

🎯 Correctness — Minor issues

I've confirmed both findings against the actual source.

Finding 1 (IsRunFunc bypasses critic clock): Confirmed. critic.recordStep is called only at run/executor.go:272, inside stepObserver (defined at line 265). The IsRunFunc branch in run/phases.go calls deps.rec.OnStep(1, resp) directly and never routes through stepObserver, so the critic's activity clock is not refreshed by a bare phase. critic.recordSteph.RecordStep (run/critic.go:82-86) is the sole refresh path, and the executor's deadline goroutine (run/critic.go:48-75) cancels runCtx once the (unrefreshed) hard deadline passes — a real false-kill vector when a stall-based critic is enabled. The executor comment at executor.go:343 and the phases.go header both assert the step observer feeds the critic "across every phase," which IsRunFunc phases violate. Kept.

Finding 2 (silent template swallow): Confirmed. expandPhaseTemplate returns the prompt unchanged on both Parse and Execute errors with no log, while sibling helpers filterToolbox and phaseModel both slog.Warn on their failure paths. A phase name containing -/space can't be referenced via {{.Name}} (parse error → whole prompt left unexpanded). Kept as a minor observability concern.

Both findings survive; verdict unchanged.


VERDICT: Minor issues

The core mechanics are sound. Verified against the actual code:

  • Usage aggregation is correctaddUsage (run/phases.go:304) sums exactly the five llm.Usage fields the rest of the repo references (InputTokens, OutputTokens, CacheReadTokens, CacheWriteTokens, ReasoningTokens); no field is dropped. Audit token accounting (rec.TokenStats() via OnStep) and the returned Result.Usage (via totalUsage) are independent sinks, so there's no double-count (executor.go:426).
  • Output threading, optional-fallback, hard-error-abort, and budget-salvage branches in runPhases are logically consistent with their tests, and the %w wrapping in the default branch preserves errors.Is (matches TestPhases_HardErrorAborts).
  • model.Generate returns *llm.Response, so rec.OnStep(1, resp) and resp.Message() are correctly typed; the IsRunFunc test exercises this path.

Findings:

  • run/phases.go:154-160IsRunFunc phases never refresh the critic's activity clock (possible false stall-kill). The bare-LLM branch calls deps.rec.OnStep(1, resp) directly but bypasses stepObserver, which is the only caller of critic.recordStep (run/executor.go:272, run/critic.go:82-86). So for the entire duration of a bare phase, the run-critic's last-activity timestamp stays frozen at the prior phase's value. A slow IsRunFunc phase — exactly the plan/synthesize transforms this feature targets, which can run long on a large threaded context — can be misjudged as stalled and nudged or killed. This contradicts the PR's stated invariant ("the same step observer feeds audit/steps/critic across all phases"). Impact is gated on the host enabling a stall-based critic, hence not blocking. Fix: call critic.recordStep(1, resp) alongside the rec.OnStep in the IsRunFunc branch. (Real-world kill depends on critic config — medium confidence.)

  • run/phases.go:283-296expandPhaseTemplate silently swallows parse/exec failures, masking broken threading. On any template.Parse/Execute error it returns the prompt unchanged, so the literal {{.x}} text leaks into the system prompt with no diagnostic. Two concrete ways to hit this: (a) a typo'd phase variable, and (b) a phase whose Name isn't a valid Go-template identifier (contains -, space, etc.) — then {{.<PhaseName>}} is a parse error and the whole template silently fails to expand, breaking threading for that phase. Given threading is the headline feature, a silent degradation is a correctness smell. Fix: slog.Warn on the parse/exec error (mirroring the WARN discipline used in filterToolbox and phaseModel) so a misconfigured pipeline is visible. (Medium confidence.)

Items checked and judged not bugs (documented design choices): per-phase fixed step caps vs. the critic's dynamic ceiling; last-phase budget-exhaustion with non-empty output reporting ok rather than error; missingkey=zero forward-references resolving to "". The 8000-byte tail-trim in salvagePhaseTranscript can split a UTF-8 rune but that's trivial.

🧹 Code cleanliness & maintainability — Minor issues

Both findings are confirmed against the actual code:

  • baseTier is declared at run/phases.go:52 and assigned at run/executor.go:348 (baseTier: tier), with no other matches anywhere in run/ — a genuine dead field. The phaseDeps doc comment does advertise "the base model + tier," so it's also misleading.
  • phaseUserMessage (run/phases.go:244-256) and runAgent (run/executor.go:495-501) build []llm.Part with a line-for-line identical block — same make([]llm.Part, 0, len(images)+1), same strings.TrimSpace(...) != "" guard on the text part, same image loop — differing only in the final wrap.

VERDICT: Minor issues

Through the code cleanliness & maintainability lens, the new run/phases.go is well-organized, thoroughly commented, and follows the surrounding executor conventions (nil-safe rec, slog.Warn on degrade, fmt.Errorf("…: %w") wrapping). Two genuine maintainability nits:

  • phaseDeps.baseTier is a dead fieldrun/phases.go:52, set at run/executor.go:348 (baseTier: tier) but never read anywhere in the package (grep -rn "baseTier" run/ matches only the declaration and the assignment). The phaseDeps doc comment even advertises "the base model + tier," so the comment misleads too. phaseModel resolves tiers off phase.ModelTier/deps.baseModel and never needs the base tier string. Fix: drop the field and the baseTier: line (and trim "+ tier" from the comment), or actually use it (e.g. in the resolve-failure WARN).

  • phaseUserMessage duplicates runAgent's image-foldingrun/phases.go:244-256 vs run/executor.go:495-501. The block that builds []llm.Part from a (possibly-blank) query plus inline images is line-for-line identical; the only difference is the final wrap (llm.UserParts(parts...) returned directly vs handed to agent.WithHistory). The new helper's own comment ("Mirrors the single-loop multimodal seeding in runAgent") flags the copy-paste. Fix: extract a small multimodalParts(text string, images []llm.ImagePart) []llm.Part and call it from both runAgent and phaseUserMessage, so the "omit empty text part" rule lives in one place.

Nothing blocking in this lane. The phase/loop branching in Run, the systemPromptWithBody extraction, and the error-classification switch in runPhases are all clean and readable.

Performance — No material issues found

VERDICT: No material issues found

I reviewed run/phases.go and the executor.go branch through the performance lens, verifying the actual checked-out code (read both files in full around the changed regions). The phase runner is well-bounded and I found no efficiency regression worth flagging:

  • No unbounded growth. outputs map, totalUsage, and the accumulated emitter/critic step state are all bounded by the phase count and the fixed per-phase MaxSteps caps. salvagePhaseTranscript is explicitly tail-capped at maxSalvage = 8000 (run/phases.go:235-238), and the tail slice is concatenated into a fresh string so it doesn't retain the larger backing array.
  • No N+1 / quadratic-over-data. expandPhaseTemplate rebuilds its data map per phase by copying prior outputs (run/phases.go:267-270), which is O(phases²) in map entries — but phase counts are single-digit and string values are copied by reference, so this is immaterial. No per-item DB/model fan-out inside any loop.
  • Shared machinery is genuinely shared. sharedOpts (step observer, tool-error limits, compactor) is built once in Run and reused across phases (executor.go:295-355), so the expensive setup isn't repeated per phase. The per-phase append([]agent.Option{...}, deps.sharedOpts...) allocates a small new slice per phase — trivial.

Two things I considered and am deliberately not filing as findings, because they're inherent to the phased design rather than avoidable code inefficiencies:

  • Images re-sent every phase (runOnePhase seeds images into each loop via runAgent, and phaseUserMessage re-attaches them for IsRunFunc phases — run/phases.go:149,178,244-256). For an image-bearing multi-phase run this multiplies image token/upload cost by the phase count vs. a single-loop run. But each phase is a fresh context with no shared message history, so re-seeding the query+images is structurally required; there's no cheaper rewrite without changing the design. Whether later phases need the images at all is a correctness/design question, not my lane.
  • filterToolbox clones the full toolbox even when phase.Tools is empty (run/phases.go:283-288), where the single-loop path uses the toolbox directly. This is O(tools) allocations per loop phase. With single-digit phases and a modest palette it's negligible; I could not verify Toolbox.Add's per-call cost (majordomo lives outside the readable working dir), so I won't assert a real impact.

Nothing here rises to a material performance problem.

🧯 Error handling & edge cases — Blocking issues found

I've verified both findings against the actual source in run/phases.go and run/executor.go.

Finding 1 is confirmed. In runPhases the switch at run/phases.go:88 evaluates case phase.Optional: (line 89) before any context-abort check. For a non-final phase a context cancellation is re-caught by the next iteration's runCtx.Err() guard (line 72), but the final phase has no next iteration: the Optional branch substitutes FallbackMessage, the loop ends, and runPhases returns (lastResult, nil). Back in Run, statusFor(runCtx, nil) hits its first case runErr == nil and returns "ok" (line 393) — its ErrCriticKill relabel at line 400 is never reached. The in-repo comment at line 395-399 confirms a critic kill surfaces as a plain context.Canceled run error, corroborating that the abort flows through runOnePhase as an err. Real.

Finding 2 is confirmed: out[len(out)-maxSalvage:] at line 237 slices a string on a byte index, which can split a multi-byte rune. Trivial/cosmetic.

Both survive.


VERDICT: Blocking issues found

🧯 Findings

1. (high) Optional last phase swallows context-cancellation / deadline / critic-kill → run misreported as "ok"run/phases.go:88-98

The switch in runPhases matches case phase.Optional: first and swallows err unconditionally, without distinguishing the phase's own intrinsic failure from a run-level abort (caller cancel, MaxRuntime deadline, or a critic kill).

For every phase except the last this is masked: the next iteration's top-of-loop runCtx.Err() check (run/phases.go:72) re-detects the cancellation and aborts. But when the final phase is Optional, there is no next iteration:

  • runOnePhaserunAgent returns context.Canceled (the critic kills via cancelCause(ErrCriticKill), which cancels runCtx; a deadline / caller-cancel likewise),
  • the Optional branch substitutes FallbackMessage and falls through,
  • the loop ends and runPhases returns (lastResult, nil).

Back in Run (run/executor.go:357), runErr == nil, so statusFor (run/executor.go:391-394) returns "ok" and res.Err == nil. statusFor keys entirely off runErr; its only runCtx consult is the ErrCriticKill relabel (line 400), which is never reached because the runErr == nil case wins first.

Concrete impact: a timed-out / caller-cancelled / critic-killed run whose trailing phase is Optional is reported as a successful run — audit status "ok", the FallbackMessage delivered as if it were a real answer, and a critic's kill decision silently overridden. This defeats the killed/timeout/cancelled distinction the surrounding code goes to lengths to preserve. Plausible in practice: a best-effort synthesis/enrichment final phase marked Optional.

Fix: don't let Optional (or the budget branch) swallow a run-context abort. Add a guard before the Optional case, e.g.:

case runCtx.Err() != nil || errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded):
    return lastResult, err

(errors is already imported and used at run/phases.go:212.)

2. (trivial) salvagePhaseTranscript tail-trims on a byte boundaryrun/phases.go:236-237

out[len(out)-maxSalvage:] can split a multi-byte UTF-8 rune, so the salvaged text can begin with an invalid partial rune (right after the prepended "…trimmed…\n"). Cosmetic only — it's fed into a downstream prompt, not parsed — but a rune-safe trim (advance to the next rune start, or trim on a line boundary) is cheap.

deepseek-v4-pro:cloud (ollama-cloud) — Minor issues

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

Verdict: No material issues found

I traced the flow of untrusted data (user query and prior-phase LLM outputs) through the new phase pipeline and found no material security issues.

What I verified:

  • Template injection is not possible (phases.go:262-276): expandPhaseTemplate uses Go's text/template with the template text sourced from phase.SystemPrompt (host-configured, not user-controlled). The user's query and prior phase outputs are passed as data values to t.Execute(), not as template text. In Go's template engine, data values are rendered as literal strings — they cannot inject template directives. A prior phase's LLM output containing {{...}} would appear literally in the expanded prompt, not be interpreted. This is structurally safe.

  • No new injection surfaces: The expanded prompt and user message flow into llm.Request.System and llm.Request.Messages respectively, which are consumed by the LLM provider. There is no SQL, shell, path, or HTML context where these strings could be interpreted as code. The LLM is the intended consumer.

  • Safe error defaults: filterToolbox skips unknown tool names with a WARN rather than aborting (phases.go:291-293). phaseModel falls back to the base model on resolution failure (phases.go:199-204). expandPhaseTemplate returns the raw template unchanged on parse/execute failure (phases.go:264-265, 272-273). None of these degrade security — they're graceful degradation, not privilege escalation.

  • No new authn/authz gaps: Phases run within the same runCtx and inherit the same budget gating, critic, and audit recorder as the single-loop path. No phase can escalate privileges or bypass the existing run-level controls.

  • No SSRF, unsafe deserialization, or credential handling: The phase runner makes no new network calls beyond the existing model/tool dispatch. It deserializes nothing. It handles no secrets.

  • Prompt injection via prior phase outputs is inherent to the design: An earlier phase's LLM output is intentionally threaded into later phase prompts via {{.<PhaseName>}}. This is the documented feature, not a bug — the PR description states "Outputs thread into later phases." The kernel cannot sanitize LLM outputs for prompt injection without defeating the purpose of the pipeline.

🎯 Correctness — Minor issues

All three findings are confirmed against the actual code. Here is the corrected review:

Verdict: Minor issues

Findings

  • run/phases.go:146-161 — IsRunFunc phases bypass the shared stepObserver (verified by reading executor.go:265-289 and phases.go:146-161). The stepObserver closure — which feeds stateAcc.SetIteration, critic.recordStep, and the step emitter — is passed via sharedOpts to every agent-loop phase but is never called for IsRunFunc phases. Only deps.rec.OnStep(1, resp) is called directly. Impact: the critic's activity clock is not refreshed during an IsRunFunc phase, so a slow bare-LLM call could trigger a false stall detection; steps from IsRunFunc phases are absent from Result.Steps (the emitter never sees them); the live iteration counter (stateAcc) stalls. Fix: call the shared stepObserver (or at minimum critic.recordStep + the emitter) for IsRunFunc phases too, or document that these phases are invisible to the critic/emitter by design.

  • run/phases.go:118totalUsage not applied to lastResult on hard-error return (verified by tracing lines 82-84, 118, and executor.go:358-360). When a non-optional phase fails hard and runOnePhase returns a non-nil partial res, line 83 overwrites lastResult = res and line 84 accumulates totalUsage correctly. But the error return on line 118 sends lastResult as-is — its .Usage field still holds only the failed phase's partial usage, not the accumulated totalUsage that includes prior successful phases. The caller (executor.go:360) copies runRes.Usage into res.Usage, so the run result under-reports token consumption. Fix: set lastResult.Usage = totalUsage before the return lastResult, ... on line 118.

  • run/phases.go:100 — budget exhaustion on the last phase with empty output is fatal, contradicting the documented contract (verified by reading lines 100-118 and the doc comment on lines 34-36). The doc states "a non-optional phase that merely exhausts its step/tool budget is NOT fatal," but the condition !isLast || strings.TrimSpace(output) != "" means the last phase exhausting its budget with no salvageable output falls through to the default error case and aborts the run. This is arguably reasonable (there's nothing to return), but it contradicts the blanket guarantee in the doc comment. Fix: either update the doc to note the last-phase exception, or treat it as a soft stop with a generated placeholder (consistent with the non-last case on line 107).

🧹 Code cleanliness & maintainability — Minor issues

All four findings are confirmed against the actual code. Let me verify each one:

  1. baseTier dead field: Grep confirms baseTier appears only at the struct definition (phases.go:52) and the construction site (executor.go:348). deps.baseTier has zero reads anywhere in the package. Confirmed.

  2. phaseUserMessage duplicates runAgent's multimodal logic: Both functions (phases.go:244-256 and executor.go:491-506) build the same []llm.Part from text+images with identical branching (len(images)==0 → simple path, else build parts slice). The comment on phaseUserMessage explicitly says "Mirrors the single-loop multimodal seeding in runAgent." Confirmed.

  3. expandPhaseTemplate silently swallows errors: Lines 263-265 and 272-274 return tmpl unchanged on parse/execution failure with no slog.Warn or any diagnostic. The doc comment even says "best-effort, not fatal." Confirmed.

  4. Magic number 8000: Line 235 is const maxSalvage = 8000 with no explanatory comment. Confirmed.

All four findings survive. The verdict remains "Minor issues."

Verdict: Minor issues

  • phaseDeps.baseTier is dead code (run/phases.go:52). The field is populated in executor.go:348 (baseTier: tier) but never read anywhere in the package. grep confirms deps.baseTier has zero usages. This is a dead struct field that will confuse future readers. Remove it from the struct and the construction site.

  • phaseUserMessage duplicates runAgent's multimodal message logic (run/phases.go:244–256 vs run/executor.go:491–506). Both functions build a llm.Message from query + images with the same pattern: no images → simple path, otherwise build []llm.Part with optional text + image parts. The comment on phaseUserMessage even acknowledges it "mirrors" runAgent. Extract a shared helper (e.g., func userMessage(query string, images []llm.ImagePart) llm.Message) so the two call sites don't drift.

  • expandPhaseTemplate silently swallows parse/execution errors (run/phases.go:262–276). When a template is malformed or execution fails, the function returns the raw template string unchanged — meaning the LLM sees literal {{.Query}} / {{.PhaseName}} in its system prompt. There is no log, no warning, no audit event. A misconfigured phase prompt would produce subtly wrong behavior with zero diagnostic signal. At minimum, log a slog.Warn on failure so operators can detect the problem.

  • Magic number maxSalvage = 8000 in salvagePhaseTranscript (run/phases.go:235). The constant has no explanation of why 8000 characters was chosen. A named constant with a comment (e.g., "cap salvage to roughly 2K tokens to avoid bloating downstream phase prompts") would make the intent clear and easier to tune later.

Performance — No material issues found

Verdict: No material issues found

The phase runner is linear in both phases and steps, with no N+1 queries, quadratic loops, or unbounded growth. LLM call latency dominates by orders of magnitude; the CPU/memory overhead of template parsing, toolbox copying, and string building is negligible in that context.

A few micro-observations that don't rise to the level of material findings:

  • filterToolbox copies the full toolbox even when names is empty (phases.go:283-287). When a phase doesn't restrict tools (the common case), the function could return the original box directly instead of allocating a copy and re-adding every tool. With typical toolbox sizes (~10–50 tools) and phase counts (2–5), the wasted allocation is trivial relative to LLM call latency.
  • expandPhaseTemplate re-parses the template string on every run (phases.go:263). The phase SystemPrompt is static per agent definition; caching parsed templates would avoid repeated lexing/parsing. Again, microsecond-level savings against second-to-minute LLM calls.
  • salvagePhaseTranscript byte-index slices a UTF-8 string (phases.go:237). out[len(out)-maxSalvage:] can split a multi-byte rune, producing invalid UTF-8. This is primarily a correctness concern (outside my lens), but could cause wasted work if a downstream LLM API rejects the malformed text.

None of these are material enough to flag as performance regressions.

🧯 Error handling & edge cases — Minor issues

Verdict: Minor issues

  • run/phases.go:237salvagePhaseTranscript truncates on byte boundary, can split UTF-8 (medium, high confidence). The line out[len(out)-maxSalvage:] slices at a byte offset, which can break a multi-byte rune in half, producing invalid UTF-8 in the salvaged output that gets threaded into later phases. Use []rune conversion or iterate with utf8.DecodeLastRuneInString to find a safe cut point. Verified by reading the function; len(out) on a string returns byte count, and the subsequent slice operates on byte indices.

  • run/phases.go:201phaseModel logs error=<nil> when resolver returns nil model without error (small, high confidence). The condition err != nil || m == nil on line 200 triggers the warn log for both cases, but when m == nil && err == nil the log field "error", err emits error=<nil>, making the message "model resolve failed" misleading — there was no error, just a nil model. Distinguishing the two cases (or logging "model", "nil" instead) would make the warning actionable. Verified by reading lines 199-205.

devstral-2:123b-cloud (ollama-cloud) — Minor issues

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — Minor issues

Now, let's verify the findings in the draft review:

Finding 1: Template Injection Risk in expandPhaseTemplate (run/phases.go:262)

  • Verification: The expandPhaseTemplate function uses text/template with Option("missingkey=zero"). This does not prevent template injection attacks if an attacker controls the SystemPrompt field.
  • Impact: Confirmed. If an attacker can control the SystemPrompt, they could inject malicious template directives.
  • Conclusion: This finding is valid.

Finding 2: Unsanitized User Input in Template Data (run/phases.go:267-270)

  • Verification: The query and priorOutputs values are directly passed into the template data map without sanitization.
  • Impact: Confirmed. This could lead to XSS or other injection issues if the output is rendered in a web context.
  • Conclusion: This finding is valid.

Finding 3: Lack of Input Validation for SystemPrompt (run/agent.go:64)

  • Verification: The SystemPrompt field in the Phase struct is a free-form string with no validation.
  • Impact: Confirmed. This could lead to resource exhaustion or template injection if the prompt is user-controlled.
  • Conclusion: This finding is valid.

All findings in the draft review are confirmed. The verdict remains "Minor issues."

🎯 Correctness — Minor issues

Verdict

Minor issues

Findings

  • run/phases.go:212: The isPhaseBudgetExhaustion function checks for agent.ErrMaxSteps and agent.ErrToolLoop, but these errors are not defined in the repository. This could lead to incorrect behavior if the errors are not properly defined or imported. Verified by searching for the error definitions in the repository and finding no matches.

gadfly-findings

[
  {"file": "run/phases.go", "line": 212, "severity": "high", "confidence": "high", "title": "Undefined errors `agent.ErrMaxSteps` and `agent.ErrToolLoop` may cause incorrect behavior"}
]
🧹 Code cleanliness & maintainability — Minor issues

Now, let's verify each finding in the draft review:

  1. Inconsistent Naming in Phase Struct:

    • The draft claims that IsRunFunc is inconsistently named compared to other fields like MaxIterations and ModelTier. However, IsRunFunc is a boolean flag that indicates whether a phase is a single bare LLM call, which is distinct from the other fields. The naming is appropriate for its purpose, and there is no inconsistency.
  2. Long Function in runPhases:

    • The runPhases function is indeed long and handles multiple responsibilities. This is a valid observation.
  3. Magic Number in salvagePhaseTranscript:

    • The constant maxSalvage = 8000 is defined within the function and lacks context. This is a valid observation.
  4. Redundant Comments:

    • The comment for the phaseDeps struct is descriptive but could be more concise. This is a valid observation.
  5. Inconsistent Error Handling in runOnePhase:

    • The error handling in runOnePhase is nested and intertwined with the main logic. This is a valid observation.
  6. Unused Variable in runPhases:

    • The variable lastOutput is used in the final synthesis of the result, so it is not unused. This finding is incorrect.

Based on this verification, the findings that are confirmed are:

  • Long function in runPhases
  • Magic number in salvagePhaseTranscript
  • Redundant comments in phaseDeps
  • Inconsistent error handling in runOnePhase

The corrected review is as follows:

Corrected Review

Verdict

Minor issues

Findings

  1. Long Function in runPhases

    • Location: run/phases.go:64-140
    • Issue: The runPhases function is quite long and handles multiple responsibilities, including iterating over phases, handling errors, and managing outputs. This makes it harder to read and maintain.
    • Suggestion: Break this function into smaller, focused functions. For example, separate the error handling logic into a dedicated function.
  2. Magic Number in salvagePhaseTranscript

    • Location: run/phases.go:260
    • Issue: The constant maxSalvage = 8000 is a magic number. While it is defined within the function, it lacks context or explanation.
    • Suggestion: Define this as a named constant at the package level or within the function with a clear comment explaining its purpose.
  3. Redundant Comments

    • Location: run/phases.go:46-58
    • Issue: The phaseDeps struct has a comment that explains its purpose, but the comment is somewhat redundant with the struct's name and fields. The comment could be more concise.
    • Suggestion: Shorten the comment to focus on the unique aspects of phaseDeps that aren't immediately obvious from the struct definition.
  4. Inconsistent Error Handling in runOnePhase

    • Location: run/phases.go:143-200
    • Issue: The runOnePhase function handles errors in a nested manner, which can make it difficult to follow the flow of execution. The error handling logic is intertwined with the main logic.
    • Suggestion: Extract error handling into separate functions or use a more structured approach to improve readability.
Performance — Minor issues

Verdict

Minor issues

Findings

  • Potential Inefficiency in salvagePhaseTranscript: The function salvagePhaseTranscript in run/phases.go (line 218) reconstructs the phase output by iterating over all steps and concatenating their responses. This operation could be inefficient for phases with a large number of steps, as it involves multiple string concatenations and trimming operations. Consider using a more efficient approach, such as preallocating a buffer or using a more optimized string-building method.

gadfly-findings

[
  {"file": "run/phases.go", "line": 218, "severity": "medium", "confidence": "high", "title": "Potential inefficiency in salvaging phase transcripts"}
]
🧯 Error handling & edge cases — Minor issues

Now let's verify the findings:

  1. Finding at run/phases.go:106: The draft claims that strings.TrimSpace(output) == "" is checked redundantly at lines 106 and 108. However, the code shows that the check at line 106 is part of an if condition, while line 108 is part of an else block. These are not redundant checks but rather mutually exclusive conditions. Therefore, this finding is incorrect and should be dropped.

  2. Finding at run/phases.go:186: The draft claims that salvagePhaseTranscript assumes res.Steps is non-nil without an explicit check. Looking at the code, the function salvagePhaseTranscript does iterate over res.Steps without a nil check. This is a valid concern, and the finding should be kept.

  3. Finding at run/phases.go:236: The draft claims that the maxSalvage constant is hardcoded to 8000 without validation. This is confirmed in the code, and there is no additional validation or explanation for this value. This finding should be kept.

Corrected Review

VERDICT

Minor issues

Findings

  • run/phases.go:186: The salvagePhaseTranscript function assumes res.Steps is non-nil, but there is no explicit check for this. If res.Steps is nil, the loop will panic. Add a nil check before iterating over res.Steps.
  • run/phases.go:236: The maxSalvage constant is hardcoded to 8000, but there is no validation to ensure this value is reasonable or safe for all use cases. Consider making this configurable or adding a comment explaining the rationale for this value.
glm-5.2:cloud (ollama-cloud) — Minor issues

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — Minor issues

Both findings are confirmed against the actual source.

  • Finding 1: expandPhaseTemplate (lines 262-275) substitutes prior-phase outputs (outputs[phase.Name], populated at line 122) into a later phase's prompt template; the result (instructions) becomes the phase's system prompt at line 144 (system := e.systemPromptWithBody(instructions)), passed to agent.New(model, system, ...) (line 176) and model.Generate(..., System: system) (line 150). The injection-into-system-prompt claim holds. (Note: line 107-108 cited in the draft is the budget-exhaustion note strings, not the injection site; the real injection sites are lines 76/122/144/150/176.)
  • Finding 2: expandPhaseTemplate returns the raw tmpl on both Parse (264-266) and Execute (272-274) error with no WARN logged. Confirmed.

Corrected review

Verdict: Minor issues

  • Untrusted model outputs are injected into later phases' system prompts (run/phases.go:76, run/phases.go:122, run/phases.go:144, run/phases.go:262-275)expandPhaseTemplate substitutes prior-phase outputs (model-generated, influenced by adversarial query/tool content) via {{.<PhaseName>}}; the expanded instructions become the next phase's System: prompt (systemPromptWithBodyagent.New(model, system, …) / model.Generate{System: system}). In the single-loop path model content only ever returns as conversational user/assistant messages; here it is elevated to system-prompt status, which models weight more authoritatively. This is a prompt-injection escalation path: a malicious query or tool result that drives phase A's output can embed instructions that phase B treats as system-level directives. The system-prompt placement is the kernel's choice and materially raises the injection surface. Suggested mitigation: render prior-phase outputs into the later phase's user message (or a fenced, clearly-delimited context block) rather than the System: field; at minimum document the trust boundary for phase-template authors.

  • Template parse/execution failures silently fall back to the raw prompt (run/phases.go:263-266, run/phases.go:272-274)expandPhaseTemplate returns the original tmpl on either Parse or Execute error. For a malformed operator template this means unexpanded {{...}} tokens (which may contain prior-phase output names or .Query) reach the model verbatim — not a direct injection, but an operator typo silently leaks template scaffolding into the prompt with no error surfaced. Low severity; consider logging a WARN on the parse/execute-error path so misconfigured phase prompts aren't silently swallowed.

No authn/authz, command/SQL injection, SSRF, deserialization, or secret-leakage issues found in the touched code. Tool/model-tier inputs are operator-config-sourced and validated/resolved through existing resolvers; filterToolbox skips unknown tool names rather than widening access; phaseModel falls back to the base model on resolution failure rather than failing open to an arbitrary tier.

🎯 Correctness — Minor issues

VERDICT: Minor issues

Verified against the actual code in run/phases.go (lines 64–134, 140–191), run/executor.go (the single-loop/phase branch at 328–355, statusFor at 391, runAgent at 491), and a repo-wide grep for ReasoningTokens/ErrMaxSteps/ErrToolLoop.

  • run/phases.go:100-118 — a final non-optional phase that exhausts its step/tool budget with no salvageable narrated text aborts the whole run with a hard error and empty output, discarding all earlier phases' completed work — contradicting the file's own documented "NOT fatal" guarantee. Confirmed by tracing the switch at lines 88–119. runOnePhase (lines 178–190) only sets output to a non-empty value when salvagePhaseTranscript(res) finds assistant prose; otherwise output stays "". Back in runPhases, the budget-exhaustion case is gated by isPhaseBudgetExhaustion(err) && (!isLast || strings.TrimSpace(output) != "") (line 100). For the last phase with empty salvage that condition is true && (false || false) = false, so execution falls through to default (line 117): return lastResult, fmt.Errorf("pipeline phase %q: %w", phase.Name, err). lastResult at that point is the last phase's res (it is reassigned every iteration at line 83), so its .Output is empty and carries none of the earlier phases' outputs — those live only in the local outputs map (line 65) and the transiently-overwritten lastResult, both dropped on the early return. The caller (executor.go:357-363) then classifies this as a failed run (statusFor default branch) with res.Output = "". This is precisely the "long verify phase" the doc comment (lines 34–37) promises will "never [discard] every earlier phase's work": a verify/consolidation phase is typically last in a research pipeline, and when the model burns its whole step budget on tool calls without writing a final prose answer, salvage returns "" and the run is reported as a total failure. The fix is to handle the empty-last-phase-budget case explicitly: synthesize a degraded output from the prior outputs map (or at minimum a "final phase did not consolidate" note) and return nil error instead of falling to default, so earlier phase work survives. (Confidence: medium — the code path is unambiguous; one can argue a hard abort is defensible when there is genuinely nothing to return, but that is not what the documented contract states.)

The addUsage field concern was already self-dropped by the draft and is not a finding: a repo-wide grep confirms llm.Usage.ReasoningTokens is referenced in audit/writer.go:109,139-140, so the field exists and addUsage compiles. The input-file staging ordering (stageInputFiles at executor.go:323 precedes the single-loop/phase branch) and the {{.Query}}/{{.<PhaseName>}} template threading through expandPhaseTemplate (missingkey=zero on a map[string]string) are correct as the draft noted.

🧹 Code cleanliness & maintainability — Minor issues

All three findings are confirmed against the actual source code. The duplication between phaseUserMessage (phases.go:244-256) and runAgent (executor.go:491-505) is real, the convoluted guard at line 100 is as described, and addUsage does sum field-by-field against an external llm.Usage struct.

Verdict: Minor issues

  • run/phases.go:244-256phaseUserMessage duplicates runAgent's multimodal seeding. The comment admits it "mirrors the single-loop multimodal seeding in runAgent," but the code is copy-pasted rather than shared. runAgent (executor.go:491-505) already encapsulates the "user text + inline images → llm.UserParts" logic with the same trimming/append pattern. The current duplication means a future change to one (e.g., how empty query text is handled, image part ordering) silently diverges between single-loop and phase paths. Suggested fix: extract a shared userMessageWithImages(query string, images []llm.ImagePart) llm.Message helper used by both runAgent and phaseUserMessage.

  • run/phases.go:100 — convoluted guard condition hurts readability. case isPhaseBudgetExhaustion(err) && (!isLast || strings.TrimSpace(output) != "") couples the "is this a soft stop" check with a last-phase non-empty-output check in one boolean expression, with the real intent (a final phase that exhausts budget with no salvageable output is treated as a hard failure) only visible by reading the body. A two-line comment or an isLast && strings.TrimSpace(output) == "" early-fallthrough to default: would make the salvage-vs-abort boundary legible without a comment archaeology dig.

  • run/phases.go:304-311addUsage field-by-field sums llm.Usage. This is the standard Go approach and works today, but it's a maintainability hazard if llm.Usage (defined in the external majordomo/llm package) ever gains a field — a phased run would silently under-report that field. The llm package is an external module dependency (not vendored in this checkout), so the struct definition could not be inspected directly; this is a low-severity "fragile if upstream changes" note, not a current bug.

Performance — No material issues found

VERDICT: No material issues found

Reviewed run/phases.go, run/executor.go, and run/phases_test.go through the performance lens.

  • Per-phase allocations are bounded and proportional to phase count. outputs (map[string]string sized to len(ra.Phases)), filterToolbox (one new toolbox per phase, O(tools) copy), expandPhaseTemplate (one text/template parse per phase), and the per-phase opts slice copy of sharedOpts are all O(phases) and cheap relative to the LLM calls they bracket. None of these sit in a hot loop beyond the phase loop itself.
  • salvagePhaseTranscript is capped at maxSalvage = 8000 via a tail-trim, and it only iterates res.Steps once with a single strings.Builder. No quadratic/unbounded growth.
  • Usage aggregation (addUsage) is a fixed 5-field additive fold over phases — O(phases), no extra allocation beyond the value copy of totalUsage.
  • No N+1 or repeated blocking work introduced. Each phase makes exactly one model call/loop; the shared stepObserver/steer/compactor options are built once in Run and reused by reference (sharedOpts appended, not rebuilt) across phases.

Two things I deliberately did not flag (verified they are outside the performance lane / inherent to the feature):

  • Images are re-seeded into every phase's user message (phaseUserMessage), so each phase re-sends all images to the model. This is an inherent cost of "each phase is a fresh loop seeded with the original query," mirroring the single-loop multimodal path — not a regression, and removing it would break phase isolation.
  • The WithCompactor threshold is computed once for the base tier and applied to all phase agents regardless of their per-phase ModelTier; a phase on a larger-context model could compaction-fire earlier than necessary. That's a correctness/threshold-fidelity concern, not a performance regression, and it depends on tier configs I cannot inspect.
🧯 Error handling & edge cases — Minor issues

All three findings are confirmed against the actual code. Line numbers in the draft's JSON need correcting to match the actual source.

  • Finding 1 (byte-truncation): actual line is 237, not 247. Confirmed — out[len(out)-maxSalvage:] slices by byte offset.
  • Finding 2 (Optional case precedes budget-exhaustion case): confirmed at line 89.
  • Finding 3 (prior phase transcript carried on last-optional failure): confirmed at lines 82-133.

VERDICT: Minor issues

I verified the error-handling paths against the checked-out code. The phase runner's unhappy paths are mostly sound: cancellation is checked between phases (runCtx.Err()), hard errors abort, budget exhaustion salvages partial transcripts, and agent.ErrMaxSteps/agent.ErrToolLoop are real symbols. A few edge-case nits remain:

  • run/phases.go:237 salvagePhaseTranscript — byte-truncation can split a multibyte UTF-8 rune. The bound out[len(out)-maxSalvage:] slices by byte offset; if the cut lands inside a multibyte character the salvaged string begins with invalid UTF-8, which then becomes the phase output (and may be threaded into the next phase's template or returned as the run output). Low impact (only triggers on >8KB salvage), but a real boundary bug. Fix: use a rune-aware truncate.

  • run/phases.go:89 runPhases switch ordering — an Optional phase that hits budget exhaustion is swallowed with FallbackMessage, discarding the salvaged partial transcript. case phase.Optional: precedes the isPhaseBudgetExhaustion case, so a long-running optional verify phase that ran out of steps loses its (real, salvageable) partial work and is replaced by a static fallback note. This appears intentional ("Optional swallows a phase's error and substitutes FallbackMessage"), but it is a silent data-loss edge case worth confirming: an optional phase that did substantial work before hitting the step cap yields less useful output than a non-optional one in the same situation.

  • run/phases.go:128 runPhases — final synthesized lastResult may carry a prior phase's transcript when the last phase fails optionally. When the last phase is Optional and fails with a nil res, lastResult is not updated and retains the previous phase's agent.Result; the code then overwrites Output/Usage with the fallback/accumulated usage, but Messages remains the prior phase's transcript (passed to the PostRun hook). The doc says "last phase's transcript for the PostRun hook," which is inaccurate in this path.

No swallowed errors or missing cleanup found: the defer/context handling is consistent with the single-loop path, and phaseModel's resolve-failure fallback is logged at WARN rather than dropped silently.

kimi-k2.6:cloud (ollama-cloud) — Blocking issues found

Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

No material issues found.

Through the Security lens I examined:

  • Template expansion (expandPhaseTemplate, run/phases.go:262–276): Phase prompts are parsed with text/template against a map[string]string of prior outputs and the original query. The data surface is limited to strings (no functions or structs are exposed), and no custom func-map is registered. The key risk — template injection — would only materialize if an untrusted party could author the Phase.SystemPrompt template itself; per the PR, RunnableAgent (including its Phases) is a host-controlled DTO, so the template source is trusted. missingkey=zero is a safe default.
  • Error logging (run/phases.go:97, 114): err.Error() is logged into the internal RunRecorder audit stream on optional-phase failure and budget exhaustion. This is consistent with existing audit logging in the executor and does not expose data to a new, less-trusted consumer.
  • Model resolution (phaseModel, run/phases.go:195–206): Tier strings come from the host-provided Phase config, not from user input, so the e.cfg.Models(ctx, tier) call does not introduce a new SSRF vector.
  • Tool filtering (filterToolbox, run/phases.go:281–300): Skips unknown tools with a warning; no authority bypass.
  • Result synthesis (runPhases, run/phases.go:64–134): The final result is synthesized from phase outputs that were already produced by the agent/LLM stack; no unsafe deserialization or credential leakage was introduced.
🎯 Correctness — Blocking issues found

Blocking issues found

  • run/phases.go:236-238salvagePhaseTranscript truncates by byte length (len(out) > maxSalvage and out[len(out)-maxSalvage:]). For any multi-byte UTF-8 text this slices through runes, producing invalid UTF-8 and garbled output fed into downstream phases. The limit should count runes (utf8.RuneCountInString) and slice on a rune boundary.

  • run/phases.go:262-275expandPhaseTemplate builds text/template variables as {{.<PhaseName>}} (line 269: data[k] = v). Go text/template identifiers may contain only letters, digits, and underscores. Phase names with hyphens (deep-research), digits at the start, spaces, etc. parse as invalid expressions or fail execution silently, causing the template to return unexpanded — downstream phases receive literal {{.deep-research}} instead of the actual prior output. This contradicts the documented threading semantics and is a silent data-corruption bug. Use index syntax ({{index . "Phase-Name"}}) or validate/sanitize phase names.

🧹 Code cleanliness & maintainability — Minor issues

Minor issues

  • run/phases.go:244-255phaseUserMessage duplicates the multimodal llm.Part assembly already present in runAgent (executor.go:491-505). Both functions build a []llm.Part from an optional text part + image parts with identical strings.TrimSpace(query) != "" guards. Extract a shared helper like buildMultimodalUserMessage(query string, images []llm.ImagePart) llm.Message and call it from both sites.

  • run/phases.go:304-311addUsage manually accumulates every field of llm.Usage. If the upstream struct gains a new field (e.g. a future cost or token category), this function silently under-counts. Prefer an Add method on llm.Usage (if majordomo doesn't provide one, add it there) or generate this with a simple reflection helper; at minimum add a compile-time assertion that verifies field count.

  • run/phases.go:262-275expandPhaseTemplate uses text/template for simple key-value substitution and silently returns the original string on parse or execution failure. Phase names containing hyphens, spaces, or other non-identifier characters (e.g. deep-research) produce invalid template syntax ({{.deep-research}} is parsed as a subtraction expression), causing silent fallback to the unexpanded template. This is a debugging trap for pipeline authors. Consider a simpler strings.ReplaceAll-style expander or document the identifier restriction and log/return the error at WARN level so authors know why substitution failed.

  • run/phases.go:235 — Magic constant const maxSalvage = 8000 with no explanation of where the value comes from. Is it bytes, runes, characters? Why 8000? Add a comment describing the rationale (e.g. "tail of ~2k tokens at ~4 bytes/token") or make it a configurable default.

Performance — Blocking issues found

Blocking issues found

  • Template re-parsed on every phase execution instead of being pre-parsed or cachedrun/phases.go:262-263
    expandPhaseTemplate calls template.New("phase").Option("missingkey=zero").Parse(tmpl) once per phase, every run. Since tmpl is the static Phase.SystemPrompt from configuration, the same AST is rebuilt repeatedly. For a multi-phase agent under load this is pure lexer/parser overhead that scales with request rate.
    Fix: Pre-parse Phase.SystemPrompt into a *template.Template when the RunnableAgent is loaded/constructed (or cache parsed templates in a sync.Map keyed by prompt string) and execute with the per-run data map.

  • filterToolbox copies entire toolbox when no filtering is requestedrun/phases.go:283-287
    When a phase does not specify a tool subset (len(names) == 0), filterToolbox still allocates a new llm.Toolbox and copies every tool into it. If the base toolbox is large and the pipeline has many phases, this is unnecessary allocation and Add overhead on every phase of every run. The single-loop path already passes the original toolbox pointer directly without copying.
    Fix: Return box directly when len(names) == 0:

    if len(names) == 0 {
        return box
    }
    
  • Double strings.TrimSpace in budget-exhaustion pathrun/phases.go:100 and :106
    The condition on line 100 evaluates strings.TrimSpace(output) != "", and line 106 evaluates strings.TrimSpace(output) == "" for the same string. This redundant scan is minor but easily avoided by binding the trimmed result once.
    Fix: Hoist trimmed := strings.TrimSpace(output) before the switch.

🧯 Error handling & edge cases — Minor issues

Minor issues found.

  • run/phases.go:100 — The budget-exhaustion soft-stop condition isPhaseBudgetExhaustion(err) && (!isLast || strings.TrimSpace(output) != "") makes the final phase fatal when it exhausts its budget and produces no salvageable transcript. This contradicts the stated semantics that non-optional budget exhaustion is always non-fatal. Earlier phases are protected, but a last-phase soft-stop still aborts the whole run. Suggested fix: drop the !isLast || guard so the last phase degrades with a generated placeholder message (like the non-last phases already do). Verified by reading the runPhases error switch.

  • run/phases.go:236–237salvagePhaseTranscript truncates by raw byte count (len(out) > maxSalvage then out[len(out)-maxSalvage:]), which can split a multi-byte UTF-8 rune and emit invalid UTF-8 into the next phase’s template variable. Suggested fix: truncate by rune count or walk back to the nearest valid UTF-8 boundary. Verified by reading the function body.

qwen3.5:397b-cloud (ollama-cloud) — Minor issues

Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling

🔒 Security — No material issues found

The finding is confirmed. I've verified:

  1. persona/agent.go:159-191AgentPhase struct has both FallbackMessage (lines 183-185) and IsRunFunc (lines 187-190) fields.

  2. run/agent.go:62-79run.Phase struct has both FallbackMessage (lines 72-74) and IsRunFunc (lines 75-78) fields.

  3. persona/runnable.go:26-35 — The conversion loop copies Name, SystemPrompt, ModelTier, MaxIterations, Tools, and Optional, but omits FallbackMessage and IsRunFunc.

This is a genuine bug: personas configured with IsRunFunc: true (intended as safe, tool-less transform steps) will silently run as full agent loops with tool access, and custom FallbackMessage values are lost.

No material issues found — the single finding in the draft is valid and survives verification.

🎯 Correctness — No material issues found

VERDICT: No material issues found

After thorough review through the Correctness lens, I verified the following:

  • Template expansion (expandPhaseTemplate, phases.go:262-276): Correctly expands {{.Query}} and {{.<PhaseName>}} using Go's text/template with missingkey=zero. Returns original on parse/execute failure (best-effort). Verified against test TestExpandPhaseTemplate.

  • Output threading (phases.go:122): Phase outputs stored in outputs map and threaded to subsequent phases via template expansion. Verified by test TestPhases_SequentialThreadsOutputs showing phase b receives "out-a" and phase c receives "out-b".

  • Optional phase error handling (phases.go:89-98): Sets output = phase.FallbackMessage (or generated note) while allowing pipeline to continue. lastResult from the failed phase is retained for Messages/Usage. Verified by test TestPhases_OptionalFailureSubstitutesFallback.

  • Budget exhaustion salvage (phases.go:100-115, 185-188): Non-optional phases that exhaust step/tool budget salvage partial transcript via salvagePhaseTranscript and continue (unless last phase with empty output). salvagePhaseTranscript (phases.go:218-239) safely handles nil result.

  • Usage aggregation (addUsage, phases.go:304-310): Correctly sums all five llm.Usage fields (InputTokens, OutputTokens, CacheReadTokens, CacheWriteTokens, ReasoningTokens).

  • IsRunFunc bare call (phases.go:146-162): Single LLM call without tool loop, properly constructs agent.Result with Output, Usage, and Messages.

  • SystemHeader composition (systemPromptWithBody, executor.go:437-445): Correctly prepends platform header to each phase's expanded instructions. Verified by test TestPhases_SystemHeaderAppliedPerPhase.

  • Toolbox filtering (filterToolbox, phases.go:281-299): Preserves palette order, skips unknown tools with WARN. Verified by test TestFilterToolbox.

  • Context cancellation (phases.go:72-74): Checks runCtx.Err() before each phase to abort on timeout/cancellation.

The implementation correctly ports mort's pipeline semantics while reusing kernel machinery. Tests cover sequential threading, optional failures, hard errors, IsRunFunc, SystemHeader, toolbox filtering, and template expansion.

🧹 Code cleanliness & maintainability — Minor issues

The finding is confirmed. I've verified:

  1. baseTier is declared at run/phases.go:52 in the phaseDeps struct
  2. It's populated at run/executor.go:348 (baseTier: tier)
  3. A grep across the entire repository for baseTier returns only those two occurrences — there is no read access to this field anywhere in the codebase

This is dead code: the field is written but never read.

## Verdict: Minor issues

- **`run/phases.go:52` — Dead field `baseTier` in `phaseDeps`**  
  The `baseTier string` field is declared in `phaseDeps`, populated by `executor.go:348` (`baseTier: tier`), but **never read** anywhere in the repository (grep confirms only the declaration and initialization exist). This is dead code that adds confusion without serving a purpose.  
  **Fix:** Remove the `baseTier` field from `phaseDeps` and delete the corresponding initialization in `executor.go:348`.

</details>

<details><summary><b>⚡ Performance</b> — Minor issues</summary>

**VERDICT: Minor issues found**

**Performance findings:**

- **`run/phases.go:263` — Template re-parsed every phase execution (medium severity, high confidence)**: `expandPhaseTemplate` calls `template.New(...).Parse(tmpl)` on every phase, even when the phase's `SystemPrompt` is a static template string. For pipelines with many phases or repeated runs of the same agent, this is redundant work. The template string itself doesn't change between runs — only the data (`Query`, prior outputs) does. **Fix**: Cache parsed templates per `Phase.SystemPrompt` (e.g., a `sync.Map` keyed by prompt string, or pre-parse during agent initialization). Verified by reading `run/phases.go:262-276`; no caching mechanism exists in the file or surrounding code.

- **`run/phases.go:283-287` — Full toolbox copy when `phase.Tools` is empty (small severity, high confidence)**: `filterToolbox` copies ALL tools from the base toolbox when `names` is empty/nil (lines 283-287). If a phased agent defines phases without explicit tool filtering, each phase incurs an O(n) toolbox copy where n is the total tool count. For agents with large tool palettes and many phases, this adds up. **Fix**: When no filtering is needed, return the base toolbox directly (it's read-safe) or share a cached filtered copy. Verified by reading `run/phases.go:278-300`.

- **`run/phases.go:235``maxSalvage` constant is reasonable but unverified derivation (trivial severity)**: The 8000-character tail-trim bound in `salvagePhaseTranscript` appears arbitrary. While this does prevent unbounded growth (good), I cannot verify this threshold was derived from token budget constraints or empirical data. If this is meant to fit within a specific context window margin, it should be documented or derived from `e.cfg.ContextTokens`. Verified by reading `run/phases.go:218-240` and searching for `ContextTokens` usage in `run/`; no derivation found.

</details>

<details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary>

**VERDICT: No material issues found**

After reviewing the changes through the error handling & edge cases lens and verifying against the checked-out code:

**Findings:**

None. The code demonstrates robust error handling:

- **Context cancellation**: Checked at phase boundaries (`runCtx.Err()` at phases.go:72) and propagates through `runOnePhase` calls
- **Optional phase failures**: Properly substitute `FallbackMessage` (or generated default) and continue pipeline (phases.go:89-98)
- **Budget exhaustion**: Non-fatal for non-optional phases; partial transcripts salvaged via `salvagePhaseTranscript` (phases.go:100-115, 215-240)
- **Nil guards**: `res != nil` checks before dereferencing (phases.go:82, 180, 219), `deps.rec != nil` checks (phases.go:77, 154)
- **Template errors**: `expandPhaseTemplate` returns original string on parse/execute failure (phases.go:262-275)
- **Unknown tools**: `filterToolbox` skips with WARN rather than failing (phases.go:291-294)
- **Model resolution failures**: `phaseModel` falls back to base model with WARN (phases.go:195-205)
- **Empty collections**: Handled correctly (empty `Tools` → full palette, empty `images` → text-only message)
- **Zero/negative values**: `MaxIterations <= 0` falls back to `baseMaxIter` (phases.go:165-168)

All error paths either handle the error appropriately or propagate it with context. No swallowed errors, missing defers, or unhandled edge cases identified in this lens.

</details>

</details>

</details>

<sub>Automated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.</sub>
<!-- gadfly-consensus --> ## 🪰 Gadfly review — consensus across 7 models **Verdict: Blocking issues found** · 21 findings (10 with multi-model agreement) | | Finding | Where | Models | Lens | |--|--|--|--|--| | 🔴 | salvagePhaseTranscript truncates by bytes, corrupting multi-byte UTF-8 runes | `run/phases.go:235` | 7/7 | correctness, error-handling, maintainability, performance | | 🔴 | expandPhaseTemplate silently fails for phase names with hyphens or non-identifier characters | `run/phases.go:260` | 6/7 | correctness, error-handling, maintainability, performance, security | | 🟠 | phaseUserMessage duplicates runAgent's multimodal message construction — extract shared helper | `run/phases.go:244` | 5/7 | maintainability | | 🟠 | Dead field baseTier in phaseDeps struct — populated but never read | `run/phases.go:52` | 4/7 | maintainability | | 🔴 | Optional-phase branch swallows context cancellation/deadline/critic-kill; a killed/timed-out run with an Optional final phase reports status "ok" with a fallback message | `run/phases.go:89` | 3/7 | correctness, error-handling | | 🟠 | Last non-optional phase exhausting its budget with empty salvage aborts the run with a hard error and empty output, discarding all prior phases' completed work — contradicting the documented 'NOT fatal / never discards earlier work' semantics | `run/phases.go:100` | 3/7 | correctness, error-handling, maintainability, performance | | 🟠 | filterToolbox copies entire toolbox when no filtering is requested | `run/phases.go:283` | 3/7 | correctness, performance | | 🟠 | IsRunFunc phases bypass shared stepObserver; critic clock not refreshed, steps missing from emitter | `run/phases.go:155` | 2/7 | correctness | | 🟠 | phaseModel discards the resolver's enriched usage-attribution context, so non-base-tier phases can be mis-attributed when resp.Model is empty (single-loop path adopts it via ctx = modelCtx) | `run/phases.go:199` | 2/7 | correctness, error-handling | | 🟠 | addUsage manually sums llm.Usage fields, brittle to upstream changes | `run/phases.go:304` | 2/7 | maintainability | <details><summary>11 single-model findings (lower confidence)</summary> | | Finding | Where | Model | Lens | |--|--|--|--|--| | 🟠 | Untrusted model outputs (prior phase results) are injected into later phases' system prompts via {{.<PhaseName>}}, creating a prompt-injection escalation path where adversarial content gains system-prompt privilege. | `run/phases.go:76` | glm-5.2:cloud | security | | 🟠 | Missing nil check for res.Steps | `run/phases.go:186` | devstral-2:123b-cloud | error-handling | | 🟡 | Lack of input validation for SystemPrompt field in Phase struct | `run/agent.go:64` | devstral-2:123b-cloud | security | | 🟡 | Redundant comments in phaseDeps | `run/phases.go:46` | devstral-2:123b-cloud | maintainability | | 🟡 | Long function runPhases | `run/phases.go:64` | devstral-2:123b-cloud | maintainability | | 🟡 | totalUsage not applied to lastResult on hard-error return; prior phases' token usage lost from run result | `run/phases.go:118` | deepseek-v4-pro:cloud | correctness | | 🟡 | When the last phase fails as Optional, synthesized lastResult carries a prior phase's transcript for the PostRun hook | `run/phases.go:128` | glm-5.2:cloud | error-handling | | 🟡 | Inconsistent error handling in runOnePhase | `run/phases.go:143` | devstral-2:123b-cloud | maintainability | | 🟡 | 'run/phases.go:212' | `run/phases.go:212` | devstral-2:123b-cloud | correctness | | 🟡 | Unsanitized user input in template data could lead to XSS or injection | `run/phases.go:267` | devstral-2:123b-cloud | security | | ⚪ | Live iteration counter regresses each phase (SetIteration stores, not max); RunInfo.MaxIterations understates phased runs — observability only | `run/executor.go:267` | claude-code/opus | correctness | </details> <details><summary>Per-model detail</summary> <details><summary><b>claude-code/opus</b> (claude-code) — Blocking issues found</summary> **Verdict: Blocking issues found** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> I've verified the key security-relevant mechanics. Let me confirm one detail about the data-vs-template-text boundary, since it's the crux of the injection question. The findings: - `expandPhaseTemplate` (`run/phases.go:262`) uses `text/template` where the **template text** is `phase.SystemPrompt` (host/agent config — trusted) and the **data** is `{Query, prior-phase-outputs}` (untrusted). Go's `text/template` inserts data values as literals; it does **not** re-parse them as templates. So a query containing `{{.Secret}}` is rendered as literal text, not evaluated — no server-side template injection / RCE. Verified against the package usage (`run/phases.go:10`, only `text/template`, never `html/template`, and no other call site feeds untrusted strings as template text). - `filterToolbox` (`run/phases.go:289`) copies from `deps.baseToolbox`, which is the already-permission-decorated toolbox built once via `Registry.Build(..., inv, Visibility("private"), ...)` (`run/executor.go:191`) plus delegation/extra/session tools. Phases can only **narrow** this set (`box.Get(name)` over the existing palette) — they cannot add a tool that wasn't already authorized for this invocation, and an unknown name is skipped, not synthesized. No authz escalation across phases. Verified. - `phaseModel` (`run/phases.go:218`) resolves `phase.ModelTier` from config, falling back to the base model. Tier strings are host config, not request-tainted. No concern. - No command/path/SQL construction, no network/SSRF surface, no deserialization. Error strings reach audit `LogEvent`/`slog.Warn` only (phase name + `err.Error()`), not phase outputs/secrets. One thing genuinely in-lane worth flagging, though it is a design property rather than a coding defect: - **Untrusted input is elevated into the system-prompt role.** In `runOnePhase` (`run/phases.go:160-165`), the expanded instructions — which for any phase whose `SystemPrompt` contains `{{.Query}}` or `{{.<priorPhase>}}` now embed the raw user query and prior (LLM-generated, therefore untrusted) phase outputs — become the **system** message (`system := e.systemPromptWithBody(instructions)`), while the single-loop path keeps user input confined to the user message. This widens the prompt-injection surface: untrusted text gains system-role authority across phases. It's opt-in (the template author must write `{{.Query}}`) and ported intentionally from mort, so it's not a regression I'd block on, but hosts running phased agents on untrusted input should be aware that phase-to-phase output threading carries any injected instructions forward at system authority. Mitigation if desired: keep `{{.Query}}` substitution in the user message and reserve the system body for static instructions. Net: the injection/authz/secret surfaces are clean; the only security-relevant note is the inherent prompt-injection amplification of threading untrusted content through system prompts, which is by design. VERDICT: No material issues found </details> <details><summary><b>🎯 Correctness</b> — Blocking issues found</summary> I've verified all three findings against the actual code. Let me confirm my analysis: **Finding 1** — Confirmed. In `runPhases`, the loop-top guard (`phases.go:72`) only protects the *start* of the next phase. The `case phase.Optional:` (`phases.go:89`) is checked first and swallows any error, including `context.Canceled`/`DeadlineExceeded`/critic-kill. `isPhaseBudgetExhaustion` (`phases.go:212`) only matches `ErrMaxSteps`/`ErrToolLoop`, so a ctx error falls into the Optional branch. When the Optional phase is the *last* one, no subsequent iteration's top-guard catches it; `runPhases` returns `nil` error. `statusFor` (`executor.go:391`) checks `runErr == nil` *first* → returns `"ok"` even when `context.Cause(runCtx)` is `ErrCriticKill`. Real bug. **Finding 2** — Confirmed as a deviation. `phaseModel` (`phases.go:199`) discards the resolver's enriched ctx with `_`; the single-loop path adopts it (`executor.go:159`). Impact is narrower than it could be: `recordUsage`/`resolvedModelName` (`call.go:235-268`) attribute primarily by `resp.Model`, falling back to ctx attribution only when `resp.Model` is empty — so mis-attribution only manifests in that fallback path. The draft's low confidence is appropriate; the deviation itself is real. **Finding 3** — Confirmed. `SetIteration` (`runengine.go:142`) is an atomic *store*; each phase is a fresh loop with a shared `stepObserver`, so the live counter regresses at phase boundaries. `RunInfo.MaxIterations` = base `maxIter` (`executor.go:172`). Token accounting unaffected — `Writer.OnStep` (`writer.go:83`) sums `resp.Usage` regardless of `iter`. Observability only. All three survive. (Fixing the stale line number for finding 2 in the JSON block: the discard is at `phases.go:199`, not 178.) --- ## VERDICT: Blocking issues found ### Findings **1. `Optional` phase handling swallows context cancellation / deadline / critic-kill — a killed or timed-out run whose final phase is `Optional` is reported as a successful run** — `run/phases.go:89` The error switch checks `case phase.Optional:` *first* and unconditionally substitutes `FallbackMessage` for **any** error, including `context.Canceled`, `context.DeadlineExceeded`, and `ErrCriticKill`. The only cancellation guard is at the *top* of the loop (`run/phases.go:72`), which protects the **start of the next phase**, not a cancellation that occurs **during** a phase. Trace the case where the final phase is `Optional` and the run is killed/timed-out mid-phase: - `runOnePhase` → `runAgent` returns `(res, ctxErr)` where `errors.Is(ctxErr, context.Canceled)` (or `DeadlineExceeded`). - `isPhaseBudgetExhaustion(ctxErr)` is **false** (`phases.go:212` — only matches `ErrMaxSteps`/`ErrToolLoop`), so cancellation never hits the budget branch. - `case phase.Optional` matches → error swallowed, `output = FallbackMessage`, loop ends (it's the last phase). - `runPhases` returns `(synthesized result, nil)`. Back in `Run` (`run/executor.go:357`), `statusFor(runCtx, nil)` returns `"ok"` — it checks `runErr == nil` before consulting the cancellation cause, so even an `ErrCriticKill` cause is ignored — and `res.Output` is the **fallback string**. A run the critic *killed* (or that blew its `MaxRuntime`) is recorded as a successful run delivering a canned "phase encountered an error" message, instead of status `killed`/`timeout`. The IsRunFunc path has the same exposure (its error wraps the ctx error at `run/phases.go:152`). Fix: before the `Optional`/budget branches, bail on a genuine context error, e.g. ```go if runCtx.Err() != nil || errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) { return lastResult, err } ``` so cancellation/deadline/kill always aborts and surfaces with the correct status. **2. Per-phase model resolution discards the enriched usage-attribution context** — `run/phases.go:199` (`phaseModel`: `_, m, err := e.cfg.Models(ctx, phase.ModelTier)`) The single-loop path adopts the resolver's returned context (`ctx = modelCtx`, `run/executor.go:159`) — its documented purpose is "enriches ctx for usage attribution" (`executor.go:19-21`). `phaseModel` resolves a phase's tier but throws the enriched ctx away with `_`, so a phase running on a **non-base tier** executes under the base tier's attribution context while using the other tier's model. Confidence is low — usage attribution (`model/call.go:235-268`) keys primarily off `resp.Model` (the serving model name), and only falls back to the context attribution set by `ParseModelForContext` when `resp.Model` is empty. So mis-attribution only manifests for responses that don't name their serving model. It is nonetheless a clear, unintended deviation from the established single-loop pattern introduced by this change. **3. (Minor) Live iteration counter regresses across phases** — `run/executor.go:267` (`stateAcc.SetIteration(s.Index)`) Each phase is a fresh agent loop whose `s.Index` restarts, and `SetIteration` does an atomic *store*, not a max/accumulate (`RunStateAccessor`, `run/runengine.go:142`). A self-status/progress consumer watching a phased run sees the iteration count jump backwards at each phase boundary, and `RunInfo.MaxIterations` (set to the base `maxIter`, `executor.go:172`) understates a phased run whose phases can each consume up to `maxIter` steps. Token accounting is **not** affected — the audit `Writer` sums `resp.Usage` per `OnStep` regardless of `iter` (`audit/writer.go:83-110`), so duplicate indices across phases don't corrupt totals. Observability only. </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> Both findings verified against the actual source. Outputting the corrected review. ## VERDICT: Minor issues This is a clean, well-commented port. The structure (one `runPhases` orchestrator + `runOnePhase` worker + small pure helpers) reads well, the doc comments are thorough and honest about what was/wasn't carried over, and the `sharedOpts` refactor in `executor.go` cleanly factors the common agent options without duplicating them per path. Two minor maintainability nits: - **`run/phases.go:52` — dead struct field `baseTier`.** `phaseDeps.baseTier` is populated (`executor.go:348` `baseTier: tier`) and documented ("the base model + tier"), but no code in the package ever reads it — `phaseModel` falls back to `deps.baseModel`, never a tier. Verified by `grep -rn baseTier run/`: only the declaration and the assignment, no read. A field that looks load-bearing but isn't will mislead the next maintainer (e.g. someone wiring per-phase compaction thresholds will assume the base tier is already threaded here). Drop the field and the `baseTier: tier` line, or actually use it (the compaction threshold is computed once from the base `tier` and shared across phases — if that's a known limitation, `baseTier` would be the natural hook, but as written it's just dead). - **`run/phases.go:244-256` vs `run/executor.go:495-501` — duplicated multimodal part-building.** `phaseUserMessage` and the image-folding block inside `runAgent` are byte-for-byte the same logic (`make([]llm.Part, 0, len(images)+1)`, trim-guard the text part, append each image). The comment on `phaseUserMessage` even says "Mirrors the single-loop multimodal seeding in runAgent" — an explicit acknowledgment of the copy-paste. Extract a small `func userParts(text string, images []llm.ImagePart) []llm.Part` and have both call sites use it, so a future change to multimodal seeding (e.g. a new part type) only has to land in one place. Low-churn, removes the mirror-by-hand maintenance hazard. Nothing else in my lane: naming is consistent with the surrounding code, the `switch` in `runPhases` is readable, helper functions are appropriately small and single-purpose, and the new test file follows the existing table/fake-provider patterns. </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> I have enough to assess the Performance lens. The phase runner is LLM-round-trip-bound, and I verified the key allocation/loop sites against the actual code in `run/phases.go` and `run/executor.go`. ## VERDICT: No material issues found Reviewed strictly through the **⚡ Performance** lens. The new phase runner is dominated by sequential model round-trips; the surrounding bookkeeping is all bounded by the (small, author-defined) phase count, so there's nothing I can justify as a real efficiency regression. What I checked and cleared: - **No per-phase toolbox re-decoration.** The expensive `Registry.Build` (permissions/SSRF/delegation decoration) runs **once** in `Run` (`executor.go:191-232`); each phase only calls `filterToolbox` (`phases.go:281`), which copies already-decorated tool *references* into a fresh `*llm.Toolbox`. Cost is O(phases × tools) reference copies — negligible vs. the model calls. Good reuse, not a regression. - **No unbounded growth across phases.** Each phase is a *fresh* agent loop seeded only with the original query + any templated prior outputs (`phases.go:149`, `176-178`); the prior phase's full transcript is **not** carried forward into the next loop's context. So context size doesn't compound phase-to-phase. The `outputs` map (`phases.go:65`) grows by exactly one entry per phase, and salvaged transcripts are tail-bounded to 8000 chars (`phases.go:235-238`). - **Shared accumulators are bounded.** The single `stepObserver`/emitter and `addUsage` tally across phases (`executor.go:265-290`, `phases.go:84`); total steps ≤ Σ per-phase caps, total usage is a fixed-field sum. No quadratic accumulation. - **Template/usage helpers are cheap.** `expandPhaseTemplate` parses one small template per phase (`phases.go:262-276`) and `addUsage` is constant-time (`phases.go:304-311`) — both off the hot path relative to a model call. - **No N+1 / blocking-call surprise.** `phaseModel` re-resolves the tier only for phases that set a non-empty `ModelTier` (`phases.go:195-206`); that's required for per-phase tiers, not redundant work. Below-threshold, deliberately **not** flagged as a finding (micro-optimization): when `phase.Tools` is empty, `filterToolbox` still allocates a full copy of the base palette per phase (`phases.go:283-288`) rather than returning `deps.baseToolbox` directly — but it's a handful of pointer copies per phase, immaterial against the LLM latency. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> I've read the full implementation. Let me verify each finding against the actual code. **Finding 1 — `expandPhaseTemplate` (lines 262-276):** Confirmed on all three sub-claims by reading the function: - Line 263-265: parse failure returns `tmpl` verbatim, no log. A phase referenced as `{{.deep-research}}` (hyphen/space/leading-digit name) is an invalid Go template field selector → parse error → entire prompt returned unchanged with the literal `{{...}}` passed through. - Line 267 seeds `data["Query"]`, then lines 268-270 overwrite from `priorOutputs`, so a phase named `Query` shadows the original input. - Lines 264-265 and 272-273 both `return tmpl` with no `slog.Warn`, unlike `phaseModel` (201) and `filterToolbox` (292, 296) which do warn. Confirmed. **Finding 2 — `salvagePhaseTranscript` (lines 236-238):** Confirmed. `out[len(out)-maxSalvage:]` is a byte-offset slice on a string; a multibyte rune straddling the boundary is split. Real (cosmetic) issue. Both findings independently reproduce against the real source and rest only on in-repo logic. Both survive. --- **VERDICT: Minor issues** ### 🧯 Error handling & edge cases - **`run/phases.go:262-276` — `expandPhaseTemplate` swallows template errors silently, and several valid `Phase.Name`s break threading invisibly.** Three coupled edge cases, all silent: 1. A phase whose `Name` is not a valid Go template identifier — hyphens, spaces, leading digits (`deep-research`, `step 1`, `2nd`) — makes any downstream `{{.deep-research}}` a *parse* error, so the whole prompt is returned **unchanged** and the literal `{{.deep-research}}` is sent to the model. The kernel's own `RunnableAgent.Phases` DTO places no constraint on `Name`, so this is reachable from valid config. 2. A phase named `Query` silently shadows the original input: `data` is seeded with `"Query": query` (line 267) and then `for k,v := range priorOutputs { data[k]=v }` (268-270) overwrites it, so later phases' `{{.Query}}` resolves to that phase's output, not the run input. 3. Both the parse failure (264-265) and execute failure (272-273) return the template verbatim with **no `slog.Warn`** — unlike every other fallback in this file (`phaseModel`, `filterToolbox` all WARN). A misconfigured pipeline therefore degrades to "model got raw `{{...}}` text" with zero diagnostics. Impact: misconfiguration produces silently-wrong prompts, very hard to debug. Suggested fix: log a WARN on parse/execute failure, and at minimum document/validate that `Name` must be a bare template identifier (and reserve `Query`). - **`run/phases.go:236-238` — `salvagePhaseTranscript` tail-trim can split a UTF-8 rune.** `out[len(out)-maxSalvage:]` slices at a byte offset, so a multi-byte rune straddling the 8000-byte boundary becomes an invalid leading fragment (a replacement char) in the salvaged text. Cosmetic, but trivial to fix: advance to a rune boundary, e.g. trim on `utf8.RuneStart` or slice on a `[]rune`. Everything else in this lane checked out: the `runCtx.Err()` guard between phases (72-74), the `nil`-result handling for failed bare calls (82-85, 128-130), the budget-vs-hard-error classification (`isPhaseBudgetExhaustion`, 211-213), the last-phase-with-empty-salvage falling through to a hard error (100, 117-118), and `filterToolbox` skipping unknown/duplicate tools without aborting (289-298) are all sound. </details> </details> <details><summary><b>claude-code/opus:max</b> (claude-code) — Blocking issues found</summary> **Verdict: Blocking issues found** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> I've verified the security-relevant surfaces of this change. Findings: - **Template expansion is not an injection sink.** `expandPhaseTemplate` (`run/phases.go:266`) parses `phase.SystemPrompt` (agent-definition config) as the template *source*, and injects the user `query` and prior phase outputs only as *data values* in the `map[string]string`. A query containing `{{...}}` is never re-parsed, so untrusted input can't inject template directives. It uses `text/template` (correct — prompt text, no HTML/XSS context). Even a fully attacker-controlled phase template can only reach `Query`/prior outputs and call methods on `string` values — `text/template` is not a sandbox escape and exposes no fs/env, so no info disclosure beyond the run's own data. - **Tool permissioning is preserved across phases.** `toolbox` is built via `e.cfg.Registry.Build(... Visibility("private") ...)` (`run/executor.go:191`), i.e. the decorated/permissioned palette. `filterToolbox` (`run/phases.go:289`) only re-adds the *same* tool objects returned by `box.Get(name)` and can only ever produce a *subset* — a phase cannot acquire a tool absent from the agent's palette, nor strip the permission decoration. No privilege escalation. (Empty `Tools` → full palette is documented, agent-author-controlled behavior, not a vuln.) - **IsRunFunc bare call** (`run/phases.go:154`) passes no tools at all — strictly more restrictive than the loop path; no new sink. - No new untrusted-input handling of secrets, no SSRF/command/path/SQL surface, no deserialization. Phase outputs threading into later prompts stays within the same run's existing trust boundary (the model already had that data). No material security issue introduced by this change. VERDICT: No material issues found </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> I've confirmed both findings against the actual source. **Finding 1 (IsRunFunc bypasses critic clock):** Confirmed. `critic.recordStep` is called only at `run/executor.go:272`, inside `stepObserver` (defined at line 265). The IsRunFunc branch in `run/phases.go` calls `deps.rec.OnStep(1, resp)` directly and never routes through `stepObserver`, so the critic's activity clock is not refreshed by a bare phase. `critic.recordStep`→`h.RecordStep` (`run/critic.go:82-86`) is the sole refresh path, and the executor's deadline goroutine (`run/critic.go:48-75`) cancels `runCtx` once the (unrefreshed) hard deadline passes — a real false-kill vector when a stall-based critic is enabled. The executor comment at `executor.go:343` and the phases.go header both assert the step observer feeds the critic "across every phase," which IsRunFunc phases violate. Kept. **Finding 2 (silent template swallow):** Confirmed. `expandPhaseTemplate` returns the prompt unchanged on both `Parse` and `Execute` errors with no log, while sibling helpers `filterToolbox` and `phaseModel` both `slog.Warn` on their failure paths. A phase name containing `-`/space can't be referenced via `{{.Name}}` (parse error → whole prompt left unexpanded). Kept as a minor observability concern. Both findings survive; verdict unchanged. --- ## VERDICT: Minor issues The core mechanics are sound. Verified against the actual code: - **Usage aggregation is correct** — `addUsage` (`run/phases.go:304`) sums exactly the five `llm.Usage` fields the rest of the repo references (`InputTokens`, `OutputTokens`, `CacheReadTokens`, `CacheWriteTokens`, `ReasoningTokens`); no field is dropped. Audit token accounting (`rec.TokenStats()` via `OnStep`) and the returned `Result.Usage` (via `totalUsage`) are independent sinks, so there's no double-count (`executor.go:426`). - **Output threading, optional-fallback, hard-error-abort, and budget-salvage branches** in `runPhases` are logically consistent with their tests, and the `%w` wrapping in the default branch preserves `errors.Is` (matches `TestPhases_HardErrorAborts`). - `model.Generate` returns `*llm.Response`, so `rec.OnStep(1, resp)` and `resp.Message()` are correctly typed; the `IsRunFunc` test exercises this path. Findings: - **`run/phases.go:154-160` — `IsRunFunc` phases never refresh the critic's activity clock (possible false stall-kill).** The bare-LLM branch calls `deps.rec.OnStep(1, resp)` directly but bypasses `stepObserver`, which is the *only* caller of `critic.recordStep` (`run/executor.go:272`, `run/critic.go:82-86`). So for the entire duration of a bare phase, the run-critic's last-activity timestamp stays frozen at the prior phase's value. A slow `IsRunFunc` phase — exactly the plan/synthesize transforms this feature targets, which can run long on a large threaded context — can be misjudged as stalled and nudged or killed. This contradicts the PR's stated invariant ("the same step observer feeds audit/steps/critic across all phases"). Impact is gated on the host enabling a stall-based critic, hence not blocking. Fix: call `critic.recordStep(1, resp)` alongside the `rec.OnStep` in the `IsRunFunc` branch. *(Real-world kill depends on critic config — medium confidence.)* - **`run/phases.go:283-296` — `expandPhaseTemplate` silently swallows parse/exec failures, masking broken threading.** On any `template.Parse`/`Execute` error it returns the prompt *unchanged*, so the literal `{{.x}}` text leaks into the system prompt with no diagnostic. Two concrete ways to hit this: (a) a typo'd phase variable, and (b) a phase whose `Name` isn't a valid Go-template identifier (contains `-`, space, etc.) — then `{{.<PhaseName>}}` is a *parse* error and the whole template silently fails to expand, breaking threading for that phase. Given threading is the headline feature, a silent degradation is a correctness smell. Fix: `slog.Warn` on the parse/exec error (mirroring the WARN discipline used in `filterToolbox` and `phaseModel`) so a misconfigured pipeline is visible. *(Medium confidence.)* Items checked and judged **not** bugs (documented design choices): per-phase fixed step caps vs. the critic's dynamic ceiling; last-phase budget-exhaustion with non-empty output reporting `ok` rather than `error`; `missingkey=zero` forward-references resolving to `""`. The 8000-byte tail-trim in `salvagePhaseTranscript` can split a UTF-8 rune but that's trivial. </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> Both findings are confirmed against the actual code: - `baseTier` is declared at `run/phases.go:52` and assigned at `run/executor.go:348` (`baseTier: tier`), with no other matches anywhere in `run/` — a genuine dead field. The `phaseDeps` doc comment does advertise "the base model + tier," so it's also misleading. - `phaseUserMessage` (`run/phases.go:244-256`) and `runAgent` (`run/executor.go:495-501`) build `[]llm.Part` with a line-for-line identical block — same `make([]llm.Part, 0, len(images)+1)`, same `strings.TrimSpace(...) != ""` guard on the text part, same image loop — differing only in the final wrap. ## VERDICT: Minor issues Through the **code cleanliness & maintainability** lens, the new `run/phases.go` is well-organized, thoroughly commented, and follows the surrounding executor conventions (nil-safe `rec`, `slog.Warn` on degrade, `fmt.Errorf("…: %w")` wrapping). Two genuine maintainability nits: - **`phaseDeps.baseTier` is a dead field** — `run/phases.go:52`, set at `run/executor.go:348` (`baseTier: tier`) but never read anywhere in the package (`grep -rn "baseTier" run/` matches only the declaration and the assignment). The `phaseDeps` doc comment even advertises "the base model + tier," so the comment misleads too. `phaseModel` resolves tiers off `phase.ModelTier`/`deps.baseModel` and never needs the base tier string. Fix: drop the field and the `baseTier:` line (and trim "+ tier" from the comment), or actually use it (e.g. in the resolve-failure WARN). - **`phaseUserMessage` duplicates `runAgent`'s image-folding** — `run/phases.go:244-256` vs `run/executor.go:495-501`. The block that builds `[]llm.Part` from a (possibly-blank) query plus inline images is line-for-line identical; the only difference is the final wrap (`llm.UserParts(parts...)` returned directly vs handed to `agent.WithHistory`). The new helper's own comment ("Mirrors the single-loop multimodal seeding in runAgent") flags the copy-paste. Fix: extract a small `multimodalParts(text string, images []llm.ImagePart) []llm.Part` and call it from both `runAgent` and `phaseUserMessage`, so the "omit empty text part" rule lives in one place. Nothing blocking in this lane. The phase/loop branching in `Run`, the `systemPromptWithBody` extraction, and the error-classification `switch` in `runPhases` are all clean and readable. </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> ## VERDICT: No material issues found I reviewed `run/phases.go` and the `executor.go` branch through the performance lens, verifying the actual checked-out code (read both files in full around the changed regions). The phase runner is well-bounded and I found no efficiency regression worth flagging: - **No unbounded growth.** `outputs` map, `totalUsage`, and the accumulated `emitter`/critic step state are all bounded by the phase count and the fixed per-phase `MaxSteps` caps. `salvagePhaseTranscript` is explicitly tail-capped at `maxSalvage = 8000` (`run/phases.go:235-238`), and the tail slice is concatenated into a fresh string so it doesn't retain the larger backing array. - **No N+1 / quadratic-over-data.** `expandPhaseTemplate` rebuilds its `data` map per phase by copying prior outputs (`run/phases.go:267-270`), which is O(phases²) in *map entries* — but phase counts are single-digit and string values are copied by reference, so this is immaterial. No per-item DB/model fan-out inside any loop. - **Shared machinery is genuinely shared.** `sharedOpts` (step observer, tool-error limits, compactor) is built once in `Run` and reused across phases (`executor.go:295-355`), so the expensive setup isn't repeated per phase. The per-phase `append([]agent.Option{...}, deps.sharedOpts...)` allocates a small new slice per phase — trivial. Two things I considered and am deliberately **not** filing as findings, because they're inherent to the phased design rather than avoidable code inefficiencies: - **Images re-sent every phase** (`runOnePhase` seeds `images` into each loop via `runAgent`, and `phaseUserMessage` re-attaches them for `IsRunFunc` phases — `run/phases.go:149,178,244-256`). For an image-bearing multi-phase run this multiplies image token/upload cost by the phase count vs. a single-loop run. But each phase is a fresh context with no shared message history, so re-seeding the query+images is structurally required; there's no cheaper rewrite without changing the design. Whether *later* phases need the images at all is a correctness/design question, not my lane. - **`filterToolbox` clones the full toolbox even when `phase.Tools` is empty** (`run/phases.go:283-288`), where the single-loop path uses the toolbox directly. This is O(tools) allocations per loop phase. With single-digit phases and a modest palette it's negligible; I could not verify `Toolbox.Add`'s per-call cost (majordomo lives outside the readable working dir), so I won't assert a real impact. Nothing here rises to a material performance problem. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Blocking issues found</summary> I've verified both findings against the actual source in `run/phases.go` and `run/executor.go`. **Finding 1** is confirmed. In `runPhases` the `switch` at `run/phases.go:88` evaluates `case phase.Optional:` (line 89) before any context-abort check. For a non-final phase a context cancellation is re-caught by the next iteration's `runCtx.Err()` guard (line 72), but the **final** phase has no next iteration: the `Optional` branch substitutes `FallbackMessage`, the loop ends, and `runPhases` returns `(lastResult, nil)`. Back in `Run`, `statusFor(runCtx, nil)` hits its first case `runErr == nil` and returns `"ok"` (line 393) — its `ErrCriticKill` relabel at line 400 is never reached. The in-repo comment at line 395-399 confirms a critic kill surfaces as a plain `context.Canceled` run error, corroborating that the abort flows through `runOnePhase` as an `err`. Real. **Finding 2** is confirmed: `out[len(out)-maxSalvage:]` at line 237 slices a string on a byte index, which can split a multi-byte rune. Trivial/cosmetic. Both survive. --- ## VERDICT: Blocking issues found ### 🧯 Findings **1. (high) Optional last phase swallows context-cancellation / deadline / critic-kill → run misreported as `"ok"`** — `run/phases.go:88-98` The `switch` in `runPhases` matches `case phase.Optional:` first and swallows `err` *unconditionally*, without distinguishing the phase's own intrinsic failure from a run-level abort (caller cancel, `MaxRuntime` deadline, or a critic kill). For every phase except the last this is masked: the next iteration's top-of-loop `runCtx.Err()` check (`run/phases.go:72`) re-detects the cancellation and aborts. But when the **final** phase is `Optional`, there is no next iteration: - `runOnePhase` → `runAgent` returns `context.Canceled` (the critic kills via `cancelCause(ErrCriticKill)`, which cancels `runCtx`; a deadline / caller-cancel likewise), - the `Optional` branch substitutes `FallbackMessage` and falls through, - the loop ends and `runPhases` returns `(lastResult, nil)`. Back in `Run` (`run/executor.go:357`), `runErr == nil`, so `statusFor` (`run/executor.go:391-394`) returns `"ok"` and `res.Err == nil`. `statusFor` keys entirely off `runErr`; its only `runCtx` consult is the `ErrCriticKill` relabel (line 400), which is never reached because the `runErr == nil` case wins first. Concrete impact: a timed-out / caller-cancelled / **critic-killed** run whose trailing phase is `Optional` is reported as a successful run — audit status `"ok"`, the `FallbackMessage` delivered as if it were a real answer, and a critic's kill decision silently overridden. This defeats the `killed`/`timeout`/`cancelled` distinction the surrounding code goes to lengths to preserve. Plausible in practice: a best-effort synthesis/enrichment final phase marked `Optional`. Fix: don't let `Optional` (or the budget branch) swallow a run-context abort. Add a guard *before* the `Optional` case, e.g.: ```go case runCtx.Err() != nil || errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded): return lastResult, err ``` (`errors` is already imported and used at `run/phases.go:212`.) **2. (trivial) `salvagePhaseTranscript` tail-trims on a byte boundary** — `run/phases.go:236-237` `out[len(out)-maxSalvage:]` can split a multi-byte UTF-8 rune, so the salvaged text can begin with an invalid partial rune (right after the prepended `"…trimmed…\n"`). Cosmetic only — it's fed into a downstream prompt, not parsed — but a rune-safe trim (advance to the next rune start, or trim on a line boundary) is cheap. </details> </details> <details><summary><b>deepseek-v4-pro:cloud</b> (ollama-cloud) — Minor issues</summary> **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> ## Verdict: No material issues found I traced the flow of untrusted data (user `query` and prior-phase LLM outputs) through the new phase pipeline and found no material security issues. **What I verified:** - **Template injection is not possible** (`phases.go:262-276`): `expandPhaseTemplate` uses Go's `text/template` with the template text sourced from `phase.SystemPrompt` (host-configured, not user-controlled). The user's `query` and prior phase outputs are passed as *data values* to `t.Execute()`, not as template text. In Go's template engine, data values are rendered as literal strings — they cannot inject template directives. A prior phase's LLM output containing `{{...}}` would appear literally in the expanded prompt, not be interpreted. This is structurally safe. - **No new injection surfaces**: The expanded prompt and user message flow into `llm.Request.System` and `llm.Request.Messages` respectively, which are consumed by the LLM provider. There is no SQL, shell, path, or HTML context where these strings could be interpreted as code. The LLM is the intended consumer. - **Safe error defaults**: `filterToolbox` skips unknown tool names with a WARN rather than aborting (`phases.go:291-293`). `phaseModel` falls back to the base model on resolution failure (`phases.go:199-204`). `expandPhaseTemplate` returns the raw template unchanged on parse/execute failure (`phases.go:264-265, 272-273`). None of these degrade security — they're graceful degradation, not privilege escalation. - **No new authn/authz gaps**: Phases run within the same `runCtx` and inherit the same budget gating, critic, and audit recorder as the single-loop path. No phase can escalate privileges or bypass the existing run-level controls. - **No SSRF, unsafe deserialization, or credential handling**: The phase runner makes no new network calls beyond the existing model/tool dispatch. It deserializes nothing. It handles no secrets. - **Prompt injection via prior phase outputs is inherent to the design**: An earlier phase's LLM output is intentionally threaded into later phase prompts via `{{.<PhaseName>}}`. This is the documented feature, not a bug — the PR description states "Outputs thread into later phases." The kernel cannot sanitize LLM outputs for prompt injection without defeating the purpose of the pipeline. </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> All three findings are confirmed against the actual code. Here is the corrected review: # Verdict: Minor issues ## Findings - **`run/phases.go:146-161` — IsRunFunc phases bypass the shared `stepObserver`** (verified by reading `executor.go:265-289` and `phases.go:146-161`). The `stepObserver` closure — which feeds `stateAcc.SetIteration`, `critic.recordStep`, and the step emitter — is passed via `sharedOpts` to every agent-loop phase but is **never called** for `IsRunFunc` phases. Only `deps.rec.OnStep(1, resp)` is called directly. Impact: the critic's activity clock is not refreshed during an IsRunFunc phase, so a slow bare-LLM call could trigger a false stall detection; steps from IsRunFunc phases are absent from `Result.Steps` (the emitter never sees them); the live iteration counter (`stateAcc`) stalls. **Fix**: call the shared `stepObserver` (or at minimum `critic.recordStep` + the emitter) for IsRunFunc phases too, or document that these phases are invisible to the critic/emitter by design. - **`run/phases.go:118` — `totalUsage` not applied to `lastResult` on hard-error return** (verified by tracing lines 82-84, 118, and `executor.go:358-360`). When a non-optional phase fails hard and `runOnePhase` returns a non-nil partial `res`, line 83 overwrites `lastResult = res` and line 84 accumulates `totalUsage` correctly. But the error return on line 118 sends `lastResult` as-is — its `.Usage` field still holds only the failed phase's partial usage, not the accumulated `totalUsage` that includes prior successful phases. The caller (`executor.go:360`) copies `runRes.Usage` into `res.Usage`, so the run result under-reports token consumption. **Fix**: set `lastResult.Usage = totalUsage` before the `return lastResult, ...` on line 118. - **`run/phases.go:100` — budget exhaustion on the last phase with empty output is fatal, contradicting the documented contract** (verified by reading lines 100-118 and the doc comment on lines 34-36). The doc states "a non-optional phase that merely exhausts its step/tool budget is NOT fatal," but the condition `!isLast || strings.TrimSpace(output) != ""` means the *last* phase exhausting its budget with no salvageable output falls through to the `default` error case and aborts the run. This is arguably reasonable (there's nothing to return), but it contradicts the blanket guarantee in the doc comment. **Fix**: either update the doc to note the last-phase exception, or treat it as a soft stop with a generated placeholder (consistent with the non-last case on line 107). </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> All four findings are confirmed against the actual code. Let me verify each one: 1. **`baseTier` dead field**: Grep confirms `baseTier` appears only at the struct definition (phases.go:52) and the construction site (executor.go:348). `deps.baseTier` has zero reads anywhere in the package. **Confirmed.** 2. **`phaseUserMessage` duplicates `runAgent`'s multimodal logic**: Both functions (phases.go:244-256 and executor.go:491-506) build the same `[]llm.Part` from text+images with identical branching (`len(images)==0` → simple path, else build parts slice). The comment on `phaseUserMessage` explicitly says "Mirrors the single-loop multimodal seeding in runAgent." **Confirmed.** 3. **`expandPhaseTemplate` silently swallows errors**: Lines 263-265 and 272-274 return `tmpl` unchanged on parse/execution failure with no `slog.Warn` or any diagnostic. The doc comment even says "best-effort, not fatal." **Confirmed.** 4. **Magic number 8000**: Line 235 is `const maxSalvage = 8000` with no explanatory comment. **Confirmed.** All four findings survive. The verdict remains "Minor issues." ## Verdict: Minor issues - **`phaseDeps.baseTier` is dead code** (`run/phases.go:52`). The field is populated in `executor.go:348` (`baseTier: tier`) but never read anywhere in the package. `grep` confirms `deps.baseTier` has zero usages. This is a dead struct field that will confuse future readers. Remove it from the struct and the construction site. - **`phaseUserMessage` duplicates `runAgent`'s multimodal message logic** (`run/phases.go:244–256` vs `run/executor.go:491–506`). Both functions build a `llm.Message` from `query + images` with the same pattern: no images → simple path, otherwise build `[]llm.Part` with optional text + image parts. The comment on `phaseUserMessage` even acknowledges it "mirrors" `runAgent`. Extract a shared helper (e.g., `func userMessage(query string, images []llm.ImagePart) llm.Message`) so the two call sites don't drift. - **`expandPhaseTemplate` silently swallows parse/execution errors** (`run/phases.go:262–276`). When a template is malformed or execution fails, the function returns the raw template string unchanged — meaning the LLM sees literal `{{.Query}}` / `{{.PhaseName}}` in its system prompt. There is no log, no warning, no audit event. A misconfigured phase prompt would produce subtly wrong behavior with zero diagnostic signal. At minimum, log a `slog.Warn` on failure so operators can detect the problem. - **Magic number `maxSalvage = 8000` in `salvagePhaseTranscript`** (`run/phases.go:235`). The constant has no explanation of why 8000 characters was chosen. A named constant with a comment (e.g., "cap salvage to roughly 2K tokens to avoid bloating downstream phase prompts") would make the intent clear and easier to tune later. </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> ## Verdict: No material issues found The phase runner is linear in both phases and steps, with no N+1 queries, quadratic loops, or unbounded growth. LLM call latency dominates by orders of magnitude; the CPU/memory overhead of template parsing, toolbox copying, and string building is negligible in that context. A few micro-observations that don't rise to the level of material findings: - **`filterToolbox` copies the full toolbox even when `names` is empty** (`phases.go:283-287`). When a phase doesn't restrict tools (the common case), the function could return the original `box` directly instead of allocating a copy and re-adding every tool. With typical toolbox sizes (~10–50 tools) and phase counts (2–5), the wasted allocation is trivial relative to LLM call latency. - **`expandPhaseTemplate` re-parses the template string on every run** (`phases.go:263`). The phase `SystemPrompt` is static per agent definition; caching parsed templates would avoid repeated lexing/parsing. Again, microsecond-level savings against second-to-minute LLM calls. - **`salvagePhaseTranscript` byte-index slices a UTF-8 string** (`phases.go:237`). `out[len(out)-maxSalvage:]` can split a multi-byte rune, producing invalid UTF-8. This is primarily a correctness concern (outside my lens), but could cause wasted work if a downstream LLM API rejects the malformed text. None of these are material enough to flag as performance regressions. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> ## Verdict: Minor issues - **`run/phases.go:237` — `salvagePhaseTranscript` truncates on byte boundary, can split UTF-8** (medium, high confidence). The line `out[len(out)-maxSalvage:]` slices at a byte offset, which can break a multi-byte rune in half, producing invalid UTF-8 in the salvaged output that gets threaded into later phases. Use `[]rune` conversion or iterate with `utf8.DecodeLastRuneInString` to find a safe cut point. Verified by reading the function; `len(out)` on a `string` returns byte count, and the subsequent slice operates on byte indices. - **`run/phases.go:201` — `phaseModel` logs `error=<nil>` when resolver returns nil model without error** (small, high confidence). The condition `err != nil || m == nil` on line 200 triggers the warn log for both cases, but when `m == nil && err == nil` the log field `"error", err` emits `error=<nil>`, making the message "model resolve failed" misleading — there was no error, just a nil model. Distinguishing the two cases (or logging `"model", "nil"` instead) would make the warning actionable. Verified by reading lines 199-205. </details> </details> <details><summary><b>devstral-2:123b-cloud</b> (ollama-cloud) — Minor issues</summary> **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — Minor issues</summary> Now, let's verify the findings in the draft review: ### Finding 1: Template Injection Risk in `expandPhaseTemplate` (`run/phases.go:262`) - **Verification**: The `expandPhaseTemplate` function uses `text/template` with `Option("missingkey=zero")`. This does not prevent template injection attacks if an attacker controls the `SystemPrompt` field. - **Impact**: Confirmed. If an attacker can control the `SystemPrompt`, they could inject malicious template directives. - **Conclusion**: This finding is valid. ### Finding 2: Unsanitized User Input in Template Data (`run/phases.go:267-270`) - **Verification**: The `query` and `priorOutputs` values are directly passed into the template data map without sanitization. - **Impact**: Confirmed. This could lead to XSS or other injection issues if the output is rendered in a web context. - **Conclusion**: This finding is valid. ### Finding 3: Lack of Input Validation for `SystemPrompt` (`run/agent.go:64`) - **Verification**: The `SystemPrompt` field in the `Phase` struct is a free-form string with no validation. - **Impact**: Confirmed. This could lead to resource exhaustion or template injection if the prompt is user-controlled. - **Conclusion**: This finding is valid. All findings in the draft review are confirmed. The verdict remains "Minor issues." </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> ## Verdict Minor issues ## Findings - **`run/phases.go:212`**: The `isPhaseBudgetExhaustion` function checks for `agent.ErrMaxSteps` and `agent.ErrToolLoop`, but these errors are not defined in the repository. This could lead to incorrect behavior if the errors are not properly defined or imported. **Verified** by searching for the error definitions in the repository and finding no matches. ## gadfly-findings ```json [ {"file": "run/phases.go", "line": 212, "severity": "high", "confidence": "high", "title": "Undefined errors `agent.ErrMaxSteps` and `agent.ErrToolLoop` may cause incorrect behavior"} ] ``` </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> Now, let's verify each finding in the draft review: 1. **Inconsistent Naming in `Phase` Struct**: - The draft claims that `IsRunFunc` is inconsistently named compared to other fields like `MaxIterations` and `ModelTier`. However, `IsRunFunc` is a boolean flag that indicates whether a phase is a single bare LLM call, which is distinct from the other fields. The naming is appropriate for its purpose, and there is no inconsistency. 2. **Long Function in `runPhases`**: - The `runPhases` function is indeed long and handles multiple responsibilities. This is a valid observation. 3. **Magic Number in `salvagePhaseTranscript`**: - The constant `maxSalvage = 8000` is defined within the function and lacks context. This is a valid observation. 4. **Redundant Comments**: - The comment for the `phaseDeps` struct is descriptive but could be more concise. This is a valid observation. 5. **Inconsistent Error Handling in `runOnePhase`**: - The error handling in `runOnePhase` is nested and intertwined with the main logic. This is a valid observation. 6. **Unused Variable in `runPhases`**: - The variable `lastOutput` is used in the final synthesis of the result, so it is not unused. This finding is incorrect. Based on this verification, the findings that are confirmed are: - Long function in `runPhases` - Magic number in `salvagePhaseTranscript` - Redundant comments in `phaseDeps` - Inconsistent error handling in `runOnePhase` The corrected review is as follows: ## Corrected Review ### Verdict **Minor issues** ### Findings 1. **Long Function in `runPhases`** - **Location**: `run/phases.go:64-140` - **Issue**: The `runPhases` function is quite long and handles multiple responsibilities, including iterating over phases, handling errors, and managing outputs. This makes it harder to read and maintain. - **Suggestion**: Break this function into smaller, focused functions. For example, separate the error handling logic into a dedicated function. 2. **Magic Number in `salvagePhaseTranscript`** - **Location**: `run/phases.go:260` - **Issue**: The constant `maxSalvage = 8000` is a magic number. While it is defined within the function, it lacks context or explanation. - **Suggestion**: Define this as a named constant at the package level or within the function with a clear comment explaining its purpose. 3. **Redundant Comments** - **Location**: `run/phases.go:46-58` - **Issue**: The `phaseDeps` struct has a comment that explains its purpose, but the comment is somewhat redundant with the struct's name and fields. The comment could be more concise. - **Suggestion**: Shorten the comment to focus on the unique aspects of `phaseDeps` that aren't immediately obvious from the struct definition. 4. **Inconsistent Error Handling in `runOnePhase`** - **Location**: `run/phases.go:143-200` - **Issue**: The `runOnePhase` function handles errors in a nested manner, which can make it difficult to follow the flow of execution. The error handling logic is intertwined with the main logic. - **Suggestion**: Extract error handling into separate functions or use a more structured approach to improve readability. </details> <details><summary><b>⚡ Performance</b> — Minor issues</summary> ## Verdict Minor issues ## Findings - **Potential Inefficiency in `salvagePhaseTranscript`**: The function `salvagePhaseTranscript` in `run/phases.go` (line 218) reconstructs the phase output by iterating over all steps and concatenating their responses. This operation could be inefficient for phases with a large number of steps, as it involves multiple string concatenations and trimming operations. Consider using a more efficient approach, such as preallocating a buffer or using a more optimized string-building method. ## gadfly-findings ```json [ {"file": "run/phases.go", "line": 218, "severity": "medium", "confidence": "high", "title": "Potential inefficiency in salvaging phase transcripts"} ] ``` </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> Now let's verify the findings: 1. **Finding at `run/phases.go:106`**: The draft claims that `strings.TrimSpace(output) == ""` is checked redundantly at lines 106 and 108. However, the code shows that the check at line 106 is part of an `if` condition, while line 108 is part of an `else` block. These are not redundant checks but rather mutually exclusive conditions. Therefore, this finding is incorrect and should be dropped. 2. **Finding at `run/phases.go:186`**: The draft claims that `salvagePhaseTranscript` assumes `res.Steps` is non-nil without an explicit check. Looking at the code, the function `salvagePhaseTranscript` does iterate over `res.Steps` without a nil check. This is a valid concern, and the finding should be kept. 3. **Finding at `run/phases.go:236`**: The draft claims that the `maxSalvage` constant is hardcoded to `8000` without validation. This is confirmed in the code, and there is no additional validation or explanation for this value. This finding should be kept. ## Corrected Review ## VERDICT Minor issues ## Findings - **`run/phases.go:186`**: The `salvagePhaseTranscript` function assumes `res.Steps` is non-nil, but there is no explicit check for this. If `res.Steps` is `nil`, the loop will panic. Add a nil check before iterating over `res.Steps`. - **`run/phases.go:236`**: The `maxSalvage` constant is hardcoded to `8000`, but there is no validation to ensure this value is reasonable or safe for all use cases. Consider making this configurable or adding a comment explaining the rationale for this value. </details> </details> <details><summary><b>glm-5.2:cloud</b> (ollama-cloud) — Minor issues</summary> **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — Minor issues</summary> Both findings are confirmed against the actual source. - Finding 1: `expandPhaseTemplate` (lines 262-275) substitutes prior-phase outputs (`outputs[phase.Name]`, populated at line 122) into a later phase's prompt template; the result (`instructions`) becomes the phase's **system** prompt at line 144 (`system := e.systemPromptWithBody(instructions)`), passed to `agent.New(model, system, ...)` (line 176) and `model.Generate(..., System: system)` (line 150). The injection-into-system-prompt claim holds. (Note: line 107-108 cited in the draft is the budget-exhaustion note strings, not the injection site; the real injection sites are lines 76/122/144/150/176.) - Finding 2: `expandPhaseTemplate` returns the raw `tmpl` on both `Parse` (264-266) and `Execute` (272-274) error with no WARN logged. Confirmed. ## Corrected review **Verdict: Minor issues** - **Untrusted model outputs are injected into later phases' system prompts (`run/phases.go:76`, `run/phases.go:122`, `run/phases.go:144`, `run/phases.go:262-275`)** — `expandPhaseTemplate` substitutes prior-phase outputs (model-generated, influenced by adversarial query/tool content) via `{{.<PhaseName>}}`; the expanded `instructions` become the next phase's `System:` prompt (`systemPromptWithBody` → `agent.New(model, system, …)` / `model.Generate{System: system}`). In the single-loop path model content only ever returns as conversational user/assistant messages; here it is elevated to system-prompt status, which models weight more authoritatively. This is a prompt-injection escalation path: a malicious query or tool result that drives phase A's output can embed instructions that phase B treats as system-level directives. The *system-prompt* placement is the kernel's choice and materially raises the injection surface. Suggested mitigation: render prior-phase outputs into the later phase's **user message** (or a fenced, clearly-delimited context block) rather than the `System:` field; at minimum document the trust boundary for phase-template authors. - **Template parse/execution failures silently fall back to the raw prompt (`run/phases.go:263-266`, `run/phases.go:272-274`)** — `expandPhaseTemplate` returns the original `tmpl` on either `Parse` or `Execute` error. For a malformed operator template this means unexpanded `{{...}}` tokens (which may contain prior-phase output names or `.Query`) reach the model verbatim — not a direct injection, but an operator typo silently leaks template scaffolding into the prompt with no error surfaced. Low severity; consider logging a WARN on the parse/execute-error path so misconfigured phase prompts aren't silently swallowed. No authn/authz, command/SQL injection, SSRF, deserialization, or secret-leakage issues found in the touched code. Tool/model-tier inputs are operator-config-sourced and validated/resolved through existing resolvers; `filterToolbox` skips unknown tool names rather than widening access; `phaseModel` falls back to the base model on resolution failure rather than failing open to an arbitrary tier. </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> ## VERDICT: Minor issues Verified against the actual code in `run/phases.go` (lines 64–134, 140–191), `run/executor.go` (the single-loop/phase branch at 328–355, `statusFor` at 391, `runAgent` at 491), and a repo-wide grep for `ReasoningTokens`/`ErrMaxSteps`/`ErrToolLoop`. - **`run/phases.go:100-118` — a final non-optional phase that exhausts its step/tool budget with no salvageable narrated text aborts the whole run with a hard error and empty output, discarding all earlier phases' completed work — contradicting the file's own documented "NOT fatal" guarantee.** Confirmed by tracing the switch at lines 88–119. `runOnePhase` (lines 178–190) only sets `output` to a non-empty value when `salvagePhaseTranscript(res)` finds assistant prose; otherwise `output` stays `""`. Back in `runPhases`, the budget-exhaustion case is gated by `isPhaseBudgetExhaustion(err) && (!isLast || strings.TrimSpace(output) != "")` (line 100). For the last phase with empty salvage that condition is `true && (false || false)` = `false`, so execution falls through to `default` (line 117): `return lastResult, fmt.Errorf("pipeline phase %q: %w", phase.Name, err)`. `lastResult` at that point is the *last* phase's `res` (it is reassigned every iteration at line 83), so its `.Output` is empty and carries none of the earlier phases' outputs — those live only in the local `outputs` map (line 65) and the transiently-overwritten `lastResult`, both dropped on the early return. The caller (`executor.go:357-363`) then classifies this as a failed run (`statusFor` default branch) with `res.Output = ""`. This is precisely the "long verify phase" the doc comment (lines 34–37) promises will "never [discard] every earlier phase's work": a verify/consolidation phase is typically *last* in a research pipeline, and when the model burns its whole step budget on tool calls without writing a final prose answer, salvage returns `""` and the run is reported as a total failure. The fix is to handle the empty-last-phase-budget case explicitly: synthesize a degraded output from the prior `outputs` map (or at minimum a "final phase did not consolidate" note) and return `nil` error instead of falling to `default`, so earlier phase work survives. (Confidence: medium — the code path is unambiguous; one can argue a hard abort is defensible when there is genuinely nothing to return, but that is not what the documented contract states.) The `addUsage` field concern was already self-dropped by the draft and is not a finding: a repo-wide grep confirms `llm.Usage.ReasoningTokens` is referenced in `audit/writer.go:109,139-140`, so the field exists and `addUsage` compiles. The input-file staging ordering (`stageInputFiles` at `executor.go:323` precedes the single-loop/phase branch) and the `{{.Query}}`/`{{.<PhaseName>}}` template threading through `expandPhaseTemplate` (`missingkey=zero` on a `map[string]string`) are correct as the draft noted. </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> All three findings are confirmed against the actual source code. The duplication between `phaseUserMessage` (phases.go:244-256) and `runAgent` (executor.go:491-505) is real, the convoluted guard at line 100 is as described, and `addUsage` does sum field-by-field against an external `llm.Usage` struct. ## Verdict: Minor issues - **`run/phases.go:244-256` — `phaseUserMessage` duplicates `runAgent`'s multimodal seeding.** The comment admits it "mirrors the single-loop multimodal seeding in runAgent," but the code is copy-pasted rather than shared. `runAgent` (executor.go:491-505) already encapsulates the "user text + inline images → `llm.UserParts`" logic with the same trimming/append pattern. The current duplication means a future change to one (e.g., how empty query text is handled, image part ordering) silently diverges between single-loop and phase paths. Suggested fix: extract a shared `userMessageWithImages(query string, images []llm.ImagePart) llm.Message` helper used by both `runAgent` and `phaseUserMessage`. - **`run/phases.go:100` — convoluted guard condition hurts readability.** `case isPhaseBudgetExhaustion(err) && (!isLast || strings.TrimSpace(output) != "")` couples the "is this a soft stop" check with a last-phase non-empty-output check in one boolean expression, with the real intent (a final phase that exhausts budget with no salvageable output is treated as a hard failure) only visible by reading the body. A two-line comment or an `isLast && strings.TrimSpace(output) == ""` early-fallthrough to `default:` would make the salvage-vs-abort boundary legible without a comment archaeology dig. - **`run/phases.go:304-311` — `addUsage` field-by-field sums `llm.Usage`.** This is the standard Go approach and works today, but it's a maintainability hazard if `llm.Usage` (defined in the external `majordomo/llm` package) ever gains a field — a phased run would silently under-report that field. The `llm` package is an external module dependency (not vendored in this checkout), so the struct definition could not be inspected directly; this is a low-severity "fragile if upstream changes" note, not a current bug. </details> <details><summary><b>⚡ Performance</b> — No material issues found</summary> **VERDICT: No material issues found** Reviewed `run/phases.go`, `run/executor.go`, and `run/phases_test.go` through the performance lens. - **Per-phase allocations are bounded and proportional to phase count.** `outputs` (`map[string]string` sized to `len(ra.Phases)`), `filterToolbox` (one new toolbox per phase, O(tools) copy), `expandPhaseTemplate` (one `text/template` parse per phase), and the per-phase `opts` slice copy of `sharedOpts` are all O(phases) and cheap relative to the LLM calls they bracket. None of these sit in a hot loop beyond the phase loop itself. - **`salvagePhaseTranscript` is capped** at `maxSalvage = 8000` via a tail-trim, and it only iterates `res.Steps` once with a single `strings.Builder`. No quadratic/unbounded growth. - **Usage aggregation (`addUsage`)** is a fixed 5-field additive fold over phases — O(phases), no extra allocation beyond the value copy of `totalUsage`. - **No N+1 or repeated blocking work introduced.** Each phase makes exactly one model call/loop; the shared `stepObserver`/`steer`/`compactor` options are built once in `Run` and reused by reference (`sharedOpts` appended, not rebuilt) across phases. Two things I deliberately did *not* flag (verified they are outside the performance lane / inherent to the feature): - Images are re-seeded into every phase's user message (`phaseUserMessage`), so each phase re-sends all images to the model. This is an inherent cost of "each phase is a fresh loop seeded with the original query," mirroring the single-loop multimodal path — not a regression, and removing it would break phase isolation. - The `WithCompactor` threshold is computed once for the base `tier` and applied to all phase agents regardless of their per-phase `ModelTier`; a phase on a larger-context model could compaction-fire earlier than necessary. That's a correctness/threshold-fidelity concern, not a performance regression, and it depends on tier configs I cannot inspect. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> All three findings are confirmed against the actual code. Line numbers in the draft's JSON need correcting to match the actual source. - Finding 1 (byte-truncation): actual line is 237, not 247. Confirmed — `out[len(out)-maxSalvage:]` slices by byte offset. - Finding 2 (Optional case precedes budget-exhaustion case): confirmed at line 89. - Finding 3 (prior phase transcript carried on last-optional failure): confirmed at lines 82-133. ## VERDICT: Minor issues I verified the error-handling paths against the checked-out code. The phase runner's unhappy paths are mostly sound: cancellation is checked between phases (`runCtx.Err()`), hard errors abort, budget exhaustion salvages partial transcripts, and `agent.ErrMaxSteps`/`agent.ErrToolLoop` are real symbols. A few edge-case nits remain: - **`run/phases.go:237` `salvagePhaseTranscript` — byte-truncation can split a multibyte UTF-8 rune.** The bound `out[len(out)-maxSalvage:]` slices by byte offset; if the cut lands inside a multibyte character the salvaged string begins with invalid UTF-8, which then becomes the phase output (and may be threaded into the next phase's template or returned as the run output). Low impact (only triggers on >8KB salvage), but a real boundary bug. Fix: use a rune-aware truncate. - **`run/phases.go:89` `runPhases` switch ordering — an `Optional` phase that hits budget exhaustion is swallowed with `FallbackMessage`, discarding the salvaged partial transcript.** `case phase.Optional:` precedes the `isPhaseBudgetExhaustion` case, so a long-running optional verify phase that ran out of steps loses its (real, salvageable) partial work and is replaced by a static fallback note. This appears intentional ("Optional swallows a phase's error and substitutes FallbackMessage"), but it is a silent data-loss edge case worth confirming: an optional phase that did substantial work before hitting the step cap yields *less* useful output than a non-optional one in the same situation. - **`run/phases.go:128` `runPhases` — final synthesized `lastResult` may carry a *prior* phase's transcript when the last phase fails optionally.** When the last phase is `Optional` and fails with a nil `res`, `lastResult` is not updated and retains the previous phase's `agent.Result`; the code then overwrites `Output`/`Usage` with the fallback/accumulated usage, but `Messages` remains the prior phase's transcript (passed to the PostRun hook). The doc says "last phase's transcript for the PostRun hook," which is inaccurate in this path. No swallowed errors or missing cleanup found: the `defer`/context handling is consistent with the single-loop path, and `phaseModel`'s resolve-failure fallback is logged at WARN rather than dropped silently. </details> </details> <details><summary><b>kimi-k2.6:cloud</b> (ollama-cloud) — Blocking issues found</summary> **Verdict: Blocking issues found** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> No material issues found. Through the **Security** lens I examined: - **Template expansion** (`expandPhaseTemplate`, run/phases.go:262–276): Phase prompts are parsed with `text/template` against a `map[string]string` of prior outputs and the original query. The data surface is limited to strings (no functions or structs are exposed), and no custom func-map is registered. The key risk — template injection — would only materialize if an untrusted party could author the `Phase.SystemPrompt` template itself; per the PR, `RunnableAgent` (including its `Phases`) is a host-controlled DTO, so the template source is trusted. `missingkey=zero` is a safe default. - **Error logging** (run/phases.go:97, 114): `err.Error()` is logged into the internal `RunRecorder` audit stream on optional-phase failure and budget exhaustion. This is consistent with existing audit logging in the executor and does not expose data to a new, less-trusted consumer. - **Model resolution** (`phaseModel`, run/phases.go:195–206): Tier strings come from the host-provided `Phase` config, not from user input, so the `e.cfg.Models(ctx, tier)` call does not introduce a new SSRF vector. - **Tool filtering** (`filterToolbox`, run/phases.go:281–300): Skips unknown tools with a warning; no authority bypass. - **Result synthesis** (`runPhases`, run/phases.go:64–134): The final result is synthesized from phase outputs that were already produced by the agent/LLM stack; no unsafe deserialization or credential leakage was introduced. </details> <details><summary><b>🎯 Correctness</b> — Blocking issues found</summary> **Blocking issues found** - **`run/phases.go:236-238`** — `salvagePhaseTranscript` truncates by **byte length** (`len(out) > maxSalvage` and `out[len(out)-maxSalvage:]`). For any multi-byte UTF-8 text this slices through runes, producing invalid UTF-8 and garbled output fed into downstream phases. The limit should count runes (`utf8.RuneCountInString`) and slice on a rune boundary. - **`run/phases.go:262-275`** — `expandPhaseTemplate` builds `text/template` variables as `{{.<PhaseName>}}` (line 269: `data[k] = v`). Go `text/template` identifiers may contain only letters, digits, and underscores. Phase names with hyphens (`deep-research`), digits at the start, spaces, etc. parse as invalid expressions or fail execution silently, causing the template to return **unexpanded** — downstream phases receive literal `{{.deep-research}}` instead of the actual prior output. This contradicts the documented threading semantics and is a silent data-corruption bug. Use `index` syntax (`{{index . "Phase-Name"}}`) or validate/sanitize phase names. </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> **Minor issues** - **`run/phases.go:244-255`** — `phaseUserMessage` duplicates the multimodal `llm.Part` assembly already present in `runAgent` (`executor.go:491-505`). Both functions build a `[]llm.Part` from an optional text part + image parts with identical `strings.TrimSpace(query) != ""` guards. Extract a shared helper like `buildMultimodalUserMessage(query string, images []llm.ImagePart) llm.Message` and call it from both sites. - **`run/phases.go:304-311`** — `addUsage` manually accumulates every field of `llm.Usage`. If the upstream struct gains a new field (e.g. a future cost or token category), this function silently under-counts. Prefer an `Add` method on `llm.Usage` (if majordomo doesn't provide one, add it there) or generate this with a simple reflection helper; at minimum add a compile-time assertion that verifies field count. - **`run/phases.go:262-275`** — `expandPhaseTemplate` uses `text/template` for simple key-value substitution and silently returns the original string on parse or execution failure. Phase names containing hyphens, spaces, or other non-identifier characters (e.g. `deep-research`) produce invalid template syntax (`{{.deep-research}}` is parsed as a subtraction expression), causing silent fallback to the unexpanded template. This is a debugging trap for pipeline authors. Consider a simpler `strings.ReplaceAll`-style expander or document the identifier restriction and log/return the error at WARN level so authors know why substitution failed. - **`run/phases.go:235`** — Magic constant `const maxSalvage = 8000` with no explanation of where the value comes from. Is it bytes, runes, characters? Why 8000? Add a comment describing the rationale (e.g. "tail of ~2k tokens at ~4 bytes/token") or make it a configurable default. </details> <details><summary><b>⚡ Performance</b> — Blocking issues found</summary> **Blocking issues found** * **Template re-parsed on every phase execution instead of being pre-parsed or cached** — `run/phases.go:262-263` `expandPhaseTemplate` calls `template.New("phase").Option("missingkey=zero").Parse(tmpl)` once per phase, every run. Since `tmpl` is the static `Phase.SystemPrompt` from configuration, the same AST is rebuilt repeatedly. For a multi-phase agent under load this is pure lexer/parser overhead that scales with request rate. **Fix:** Pre-parse `Phase.SystemPrompt` into a `*template.Template` when the `RunnableAgent` is loaded/constructed (or cache parsed templates in a `sync.Map` keyed by prompt string) and execute with the per-run data map. * **`filterToolbox` copies entire toolbox when no filtering is requested** — `run/phases.go:283-287` When a phase does not specify a tool subset (`len(names) == 0`), `filterToolbox` still allocates a new `llm.Toolbox` and copies every tool into it. If the base toolbox is large and the pipeline has many phases, this is unnecessary allocation and `Add` overhead on every phase of every run. The single-loop path already passes the original `toolbox` pointer directly without copying. **Fix:** Return `box` directly when `len(names) == 0`: ```go if len(names) == 0 { return box } ``` * **Double `strings.TrimSpace` in budget-exhaustion path** — `run/phases.go:100` and `:106` The condition on line 100 evaluates `strings.TrimSpace(output) != ""`, and line 106 evaluates `strings.TrimSpace(output) == ""` for the same string. This redundant scan is minor but easily avoided by binding the trimmed result once. **Fix:** Hoist `trimmed := strings.TrimSpace(output)` before the switch. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> Minor issues found. * **`run/phases.go:100`** — The budget-exhaustion soft-stop condition `isPhaseBudgetExhaustion(err) && (!isLast || strings.TrimSpace(output) != "")` makes the **final** phase fatal when it exhausts its budget and produces no salvageable transcript. This contradicts the stated semantics that non-optional budget exhaustion is always non-fatal. Earlier phases are protected, but a last-phase soft-stop still aborts the whole run. Suggested fix: drop the `!isLast ||` guard so the last phase degrades with a generated placeholder message (like the non-last phases already do). Verified by reading the `runPhases` error switch. * **`run/phases.go:236–237`** — `salvagePhaseTranscript` truncates by raw byte count (`len(out) > maxSalvage` then `out[len(out)-maxSalvage:]`), which can split a multi-byte UTF-8 rune and emit invalid UTF-8 into the next phase’s template variable. Suggested fix: truncate by rune count or walk back to the nearest valid UTF-8 boundary. Verified by reading the function body. </details> </details> <details><summary><b>qwen3.5:397b-cloud</b> (ollama-cloud) — Minor issues</summary> **Verdict: Minor issues** — 5 reviewers: security, correctness, maintainability, performance, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> The finding is **confirmed**. I've verified: 1. **`persona/agent.go:159-191`** — `AgentPhase` struct has both `FallbackMessage` (lines 183-185) and `IsRunFunc` (lines 187-190) fields. 2. **`run/agent.go:62-79`** — `run.Phase` struct has both `FallbackMessage` (lines 72-74) and `IsRunFunc` (lines 75-78) fields. 3. **`persona/runnable.go:26-35`** — The conversion loop copies `Name`, `SystemPrompt`, `ModelTier`, `MaxIterations`, `Tools`, and `Optional`, but **omits** `FallbackMessage` and `IsRunFunc`. This is a genuine bug: personas configured with `IsRunFunc: true` (intended as safe, tool-less transform steps) will silently run as full agent loops with tool access, and custom `FallbackMessage` values are lost. **No material issues found** — the single finding in the draft is valid and survives verification. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> ## VERDICT: No material issues found After thorough review through the **Correctness lens**, I verified the following: - **Template expansion** (`expandPhaseTemplate`, phases.go:262-276): Correctly expands `{{.Query}}` and `{{.<PhaseName>}}` using Go's `text/template` with `missingkey=zero`. Returns original on parse/execute failure (best-effort). Verified against test `TestExpandPhaseTemplate`. - **Output threading** (phases.go:122): Phase outputs stored in `outputs` map and threaded to subsequent phases via template expansion. Verified by test `TestPhases_SequentialThreadsOutputs` showing phase b receives "out-a" and phase c receives "out-b". - **Optional phase error handling** (phases.go:89-98): Sets `output = phase.FallbackMessage` (or generated note) while allowing pipeline to continue. `lastResult` from the failed phase is retained for Messages/Usage. Verified by test `TestPhases_OptionalFailureSubstitutesFallback`. - **Budget exhaustion salvage** (phases.go:100-115, 185-188): Non-optional phases that exhaust step/tool budget salvage partial transcript via `salvagePhaseTranscript` and continue (unless last phase with empty output). `salvagePhaseTranscript` (phases.go:218-239) safely handles nil result. - **Usage aggregation** (`addUsage`, phases.go:304-310): Correctly sums all five `llm.Usage` fields (InputTokens, OutputTokens, CacheReadTokens, CacheWriteTokens, ReasoningTokens). - **IsRunFunc bare call** (phases.go:146-162): Single LLM call without tool loop, properly constructs `agent.Result` with Output, Usage, and Messages. - **SystemHeader composition** (`systemPromptWithBody`, executor.go:437-445): Correctly prepends platform header to each phase's expanded instructions. Verified by test `TestPhases_SystemHeaderAppliedPerPhase`. - **Toolbox filtering** (`filterToolbox`, phases.go:281-299): Preserves palette order, skips unknown tools with WARN. Verified by test `TestFilterToolbox`. - **Context cancellation** (phases.go:72-74): Checks `runCtx.Err()` before each phase to abort on timeout/cancellation. The implementation correctly ports mort's pipeline semantics while reusing kernel machinery. Tests cover sequential threading, optional failures, hard errors, IsRunFunc, SystemHeader, toolbox filtering, and template expansion. </details> <details><summary><b>🧹 Code cleanliness & maintainability</b> — Minor issues</summary> The finding is **confirmed**. I've verified: 1. `baseTier` is declared at `run/phases.go:52` in the `phaseDeps` struct 2. It's populated at `run/executor.go:348` (`baseTier: tier`) 3. A grep across the entire repository for `baseTier` returns only those two occurrences — there is **no read access** to this field anywhere in the codebase This is dead code: the field is written but never read. ```markdown ## Verdict: Minor issues - **`run/phases.go:52` — Dead field `baseTier` in `phaseDeps`** The `baseTier string` field is declared in `phaseDeps`, populated by `executor.go:348` (`baseTier: tier`), but **never read** anywhere in the repository (grep confirms only the declaration and initialization exist). This is dead code that adds confusion without serving a purpose. **Fix:** Remove the `baseTier` field from `phaseDeps` and delete the corresponding initialization in `executor.go:348`. </details> <details><summary><b>⚡ Performance</b> — Minor issues</summary> **VERDICT: Minor issues found** **Performance findings:** - **`run/phases.go:263` — Template re-parsed every phase execution (medium severity, high confidence)**: `expandPhaseTemplate` calls `template.New(...).Parse(tmpl)` on every phase, even when the phase's `SystemPrompt` is a static template string. For pipelines with many phases or repeated runs of the same agent, this is redundant work. The template string itself doesn't change between runs — only the data (`Query`, prior outputs) does. **Fix**: Cache parsed templates per `Phase.SystemPrompt` (e.g., a `sync.Map` keyed by prompt string, or pre-parse during agent initialization). Verified by reading `run/phases.go:262-276`; no caching mechanism exists in the file or surrounding code. - **`run/phases.go:283-287` — Full toolbox copy when `phase.Tools` is empty (small severity, high confidence)**: `filterToolbox` copies ALL tools from the base toolbox when `names` is empty/nil (lines 283-287). If a phased agent defines phases without explicit tool filtering, each phase incurs an O(n) toolbox copy where n is the total tool count. For agents with large tool palettes and many phases, this adds up. **Fix**: When no filtering is needed, return the base toolbox directly (it's read-safe) or share a cached filtered copy. Verified by reading `run/phases.go:278-300`. - **`run/phases.go:235` — `maxSalvage` constant is reasonable but unverified derivation (trivial severity)**: The 8000-character tail-trim bound in `salvagePhaseTranscript` appears arbitrary. While this does prevent unbounded growth (good), I cannot verify this threshold was derived from token budget constraints or empirical data. If this is meant to fit within a specific context window margin, it should be documented or derived from `e.cfg.ContextTokens`. Verified by reading `run/phases.go:218-240` and searching for `ContextTokens` usage in `run/`; no derivation found. </details> <details><summary><b>🧯 Error handling & edge cases</b> — No material issues found</summary> **VERDICT: No material issues found** After reviewing the changes through the error handling & edge cases lens and verifying against the checked-out code: **Findings:** None. The code demonstrates robust error handling: - **Context cancellation**: Checked at phase boundaries (`runCtx.Err()` at phases.go:72) and propagates through `runOnePhase` calls - **Optional phase failures**: Properly substitute `FallbackMessage` (or generated default) and continue pipeline (phases.go:89-98) - **Budget exhaustion**: Non-fatal for non-optional phases; partial transcripts salvaged via `salvagePhaseTranscript` (phases.go:100-115, 215-240) - **Nil guards**: `res != nil` checks before dereferencing (phases.go:82, 180, 219), `deps.rec != nil` checks (phases.go:77, 154) - **Template errors**: `expandPhaseTemplate` returns original string on parse/execute failure (phases.go:262-275) - **Unknown tools**: `filterToolbox` skips with WARN rather than failing (phases.go:291-294) - **Model resolution failures**: `phaseModel` falls back to base model with WARN (phases.go:195-205) - **Empty collections**: Handled correctly (empty `Tools` → full palette, empty `images` → text-only message) - **Zero/negative values**: `MaxIterations <= 0` falls back to `baseMaxIter` (phases.go:165-168) All error paths either handle the error appropriately or propagate it with context. No swallowed errors, missing defers, or unhandled edge cases identified in this lens. </details> </details> </details> <sub>Automated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.</sub>
steve added 1 commit 2026-06-29 19:44:05 +00:00
fix(run): address gadfly review of the phases PR
executus CI / test (pull_request) Successful in 48s
0dd2ced717
Real findings from the consensus review (37 raw; many devstral dups/noise):

- Optional/budget-salvage branches no longer swallow a context
  cancellation / deadline / critic-kill: such errors return immediately so
  the run is classified cancelled/timeout/killed, not "ok" with a fallback.
  (the most serious finding — an Optional final phase could mask a killed run)
- IsRunFunc bare phase now feeds the SHARED step observer (not just the
  audit recorder), so the critic's activity clock + Result.Steps see it —
  a long synthesize phase no longer looks idle to the critic.
- phaseModel returns the resolver's enriched (usage-attribution) context and
  the phase's calls use it, mirroring the single-loop path (non-base-tier
  phases were mis-attributed).
- salvagePhaseTranscript trims the tail on a rune boundary (was a raw byte
  slice that could split a UTF-8 rune); maxSalvage is now a named const with
  rationale.
- expandPhaseTemplate logs a WARN on parse/execute failure instead of
  silently returning the unexpanded template; documented the phase-name
  identifier requirement + the "Query" shadow.
- removed the dead phaseDeps.baseTier field.
- extracted multimodalUserMessage, shared by runAgent + the phase runner
  (was duplicated image-folding).
- aggregated phase usage is stamped onto the result even on a hard-error
  return; TrimSpace computed once; filterToolbox returns the base toolbox
  as-is for the empty-names (full-palette) case instead of copying;
  phaseModel WARN no longer prints error=<nil>.

New test: Optional phase does not swallow a cancellation. Full ./... green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
steve merged commit c071ed4996 into main 2026-06-29 19:52:52 +00:00
steve deleted branch feat/kernel-phases 2026-06-29 19:52:52 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: steve/executus#19