feat(run): durable checkpoint + resume (wire Ports.Checkpointer) #20
Reference in New Issue
Block a user
Delete Branch "feat/kernel-checkpoint"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The kernel defined
run.Ports.Checkpointer+ thecheckpointbattery but never drove them (the documented "P2 follow-up"). This wires durable recovery into the run loop so a run interrupted by shutdown resumes on the next boot instead of being lost — the executus-side half of mort's durable-agent-recovery parity (mort #1355; durable recovery is not host-fixable because the resume state needs the loop's internal[]llm.Messagetranscript).Kernel (
run/)Ports.Checkpointeris now aCheckpointerFactory(Begin(ctx, RunInfo) → (Checkpointer, error), ornilfor a non-durable run). A single per-instanceCheckpointer(Save/Complete/Fail, no run id) can't distinguish concurrent runs; a factory mints one per run, matching mort'sagentexec.CheckpointerFactory. The port was never consumed before, so this is not a breaking change for any caller.RunInfogainsGuildID+ModelTier(so the factory can build resume meta);RunCheckpointStategainsCompletedPhases+ActivePhase(+PhaseOutput).run/checkpoint.go:ResumeState+WithResumeState/WithExistingCheckpointercontext carriers;classifyCheckpointOutcome(success → Complete, shutdown → leave for boot recovery, else → Fail — keyed onrun.ErrShutdown);finalizeCheckpoint.run/executor.go: resolve the per-run checkpointer (existing-from-ctx on a recovery re-run, elsefactory.Begin); the single-loop path wraps the step observer to accumulate the transcript +Saveeach step (the host throttles), and a recovered run seeds the saved transcript viaWithHistoryand continues with no new input (completed tool calls are not re-run); finalize on exit.run/phases.go: phase-boundary checkpointing — record completed phases after each phase; a resumed run skips already-completed phases. The interrupted phase re-runs from its start (boundary-granular, documented) — only the single-loop path resumes mid-loop.Battery (
checkpoint/)NewFactorywires the battery into the factory port (per-run handle, meta derived fromRunInfo);RunCheckpoint+handle.Savenow carry the phase fields.Note for the host (mort)
classifyCheckpointOutcomekeys shutdown onrun.ErrShutdown. mort currently stamps its ownrunengine.ErrShutdown(a distinct value) on the base context — the mort wiring PR will alias them soerrors.Ismatches.Tests (
run/checkpoint_test.go)finalize decision matrix; single-loop Save+Complete; terminal-error Fail; resume seeds history; phase-boundary Saves completed phases; resume skips completed phases. Full
go test ./...green.🤖 Generated with Claude Code
🪰 Gadfly — live review status
7/7 reviewers finished · updated 2026-06-29 20:21:51Z
claude-code/opus· claude-code — ✅ doneclaude-code/opus:max· claude-code — ✅ donedeepseek-v4-pro:cloud· ollama-cloud — ✅ donedevstral-2:123b-cloud· ollama-cloud — ✅ doneglm-5.2:cloud· ollama-cloud — ✅ donekimi-k2.6:cloud· ollama-cloud — ✅ doneqwen3.5:397b-cloud· ollama-cloud — ✅ doneLive status board. Findings are posted in each model's own comment. Advisory only — does not block merge.
🪰 Gadfly consensus review — 20 inline findings on changed lines. See the consensus comment for the full ranked summary.
Advisory only — does not block merge.
@@ -84,0 +87,4 @@// factory is a run.CheckpointerFactory that mints a per-run handle over store,// deriving the per-run meta from the kernel's RunInfo. It is the battery's glue// for the Ports.Checkpointer (factory) seam: every run becomes durable (the// store persists snapshots; a host wanting lazy/short-run skipping uses its own⚪ factory.now field is never set by NewFactory (always nil); dead/untestable field
maintainability · flagged by 1 model
🪰 Gadfly · advisory
@@ -84,0 +92,4 @@type factory struct {store CheckpointStorethrottle time.Durationnow func() time.Time⚪ factory.now field is never assigned (NewFactory omits it) — dead/always-nil clock field
maintainability · flagged by 1 model
checkpoint/handle.go:95—factory.nowis dead. Thefactorystruct carries anow func() time.Timefield, butNewFactorynever sets it (handle.go:103-105), sof.nowis always nil andBeginalways passes nil toNew(handle.go:122), which then defaults totime.Now. Unlikehandle, there's no constructor/option that injects a clock, so the field can never be anything but nil. Either drop it, or add the injection path it implies (mirroringhandle's testable clock).🪰 Gadfly · advisory
@@ -84,0 +100,4 @@// NewFactory returns a run.CheckpointerFactory backed by store: each run gets a// per-run Checkpointer (throttled to at most once per throttle). A nil store// yields factory.Begin returning a no-op Checkpointer.func NewFactory(store CheckpointStore, throttle time.Duration) run.CheckpointerFactory {🟡 NewFactory never initializes factory.now, leaving a configurable-looking but always-nil field
maintainability · flagged by 1 model
f.nowstays nil, passed toNewwhich falls back totime.Now. Confirmed.🪰 Gadfly · advisory
@@ -84,0 +104,4 @@return &factory{store: store, throttle: throttle}}// Begin mints the per-run Checkpointer. The prompt is read from🟡 Potential information leakage in checkpoint metadata
maintainability, security · flagged by 2 models
checkpoint/handle.go:109-121): TheBeginmethod extracts the prompt frominfo.Inputs["prompt"]without validation or sanitization. If the prompt contains sensitive information (e.g., API keys, PII), it will be stored in the checkpoint metadata. This could lead to unintended exposure if the checkpoint store is compromised or accessed by unauthorized parties. Consider redacting sensitive data or providing an allowlist/denylist mecha…🪰 Gadfly · advisory
@@ -0,0 +20,4 @@// WithResumeState; the executor reads it (single-loop seeds the saved transcript// as history; multi-phase skips completed phases and seeds the active phase).type ResumeState struct {History []llm.Message // single-loop transcript OR active-phase transcript🟠 ActivePhase field plumbed through 3 structs but never read/set; doc comments claim active-phase seeding that isn't implemented
correctness, maintainability · flagged by 5 models
ActivePhaseis dead weight, and its docs overpromise —run/checkpoint.go:25,run/ports.go:210,checkpoint/checkpoint.go:41. The field is plumbed through three structs (ResumeState,RunCheckpointState,RunCheckpoint) but the kernel never reads it and never writes a non-empty value to it. Every write site is either the literalActivePhase: ""(run/phases.go:186) or a pass-through copy (checkpoint/handle.go:61), and there is no read site anywhere — the single-loopSave(…🪰 Gadfly · advisory
@@ -0,0 +32,4 @@return context.WithValue(ctx, resumeStateKey{}, rs)}func resumeStateFromContext(ctx context.Context) *ResumeState {🟡 Context value type safety
security · flagged by 1 model
run/checkpoint.go:35,49): TheresumeStateFromContextandexistingCheckpointerFromContextfunctions use type assertions without error handling. If the context contains a value with the wrong type for these keys, the functions will silently returnnil, which could lead to unexpected behavior. While this is not a direct security vulnerability, it could mask configuration errors that might have security implications.🪰 Gadfly · advisory
@@ -0,0 +51,4 @@return cp}// checkpointOutcome is the finalize decision for a durable run.🟡 checkpointOutcome type and constants are not exported, limiting usability outside the package
maintainability · flagged by 1 model
run/checkpoint.go:54-61: ThecheckpointOutcometype and its constants are not exported. This limits their usability outside the package and could be confusing for external consumers who might need to interact with these outcomes. Consider exporting them if they are part of the public API. -run/phases.go:67-74: ThephaseDepsstruct includes acheckpointerfield and aresumefield, but the comments and logic around these fields are not entirely clear. The comments could be i…🪰 Gadfly · advisory
@@ -0,0 +70,4 @@switch {case runErr == nil:return checkpointCompletecase errors.Is(cause, ErrShutdown):🔴 ErrShutdown sentinel mismatch: shutdowns classified as Fail until mort wiring PR lands
correctness, error-handling · flagged by 2 models
run/checkpoint.go:73—classifyCheckpointOutcomekeys shutdown detection onrun.ErrShutdown, but mort currently stamps its ownrunengine.ErrShutdown(a distinct sentinel) on the base context. Until the mort wiring PR aliases them,errors.Is(cause, ErrShutdown)will befalsefor every mort shutdown, causingcheckpointFail(deletes the checkpoint) instead ofcheckpointLeaveRunning. The PR description acknowledges this gap; from the error-handling lens, it means the shutdown→r…🪰 Gadfly · advisory
@@ -0,0 +83,4 @@if cp == nil {return}switch classifyCheckpointOutcome(runErr, cause) {🔴 finalizeCheckpoint silently ignores Complete/Fail errors leaving stale recovery records
error-handling, security · flagged by 4 models
run/checkpoint.go:88-90—finalizeCheckpointswallows errors fromcp.Completeandcp.Fail. If the store fails to delete a finished run's checkpoint (network blip, timeout), the record survives and the next boot will incorrectly attempt to resume a run that already terminated. This creates false recovery candidates.🪰 Gadfly · advisory
@@ -183,0 +185,4 @@// Durable recovery (optional): a recovered run carries a ResumeState (prior// transcript / completed phases) + an existing Checkpointer in ctx so it// continues on the SAME durable record; a fresh run mints a per-run// Checkpointer via the factory (which decides durability — nil = non-durable).🔴 CheckpointerFactory.Begin error silently swallowed with no logging
error-handling · flagged by 1 model
run/executor.go:188—CheckpointerFactory.Beginerror is silently discarded. If the factory returns a storage error (e.g., DB unavailable), the executor proceeds as if the run is non-durable, with no log or metric. Operators have no way to detect a broken checkpoint store.🪰 Gadfly · advisory
@@ -183,0 +190,4 @@resume := resumeStateFromContext(ctx)ckpt := existingCheckpointerFromContext(ctx)if ckpt == nil && e.cfg.Ports.Checkpointer != nil {if c, cerr := e.cfg.Ports.Checkpointer.Begin(ctx, info); cerr == nil {🟠 CheckpointerFactory.Begin error silently swallowed with no log
correctness, error-handling · flagged by 4 models
run/executor.go:193-195— SwallowedBeginerror with no logging. IfCheckpointerFactory.Beginreturns a non-nil error (e.g., a transient store issue the factory didn't self-log), the executor silently degrades to non-durable with zero indication. The interface doc says the factory should log and return(nil, nil), but the executor cannot enforce that. At minimum, log a warning so operators know durability is broken for this run. *(Verified by readingexecutor.go:192-196— the `ce…🪰 Gadfly · advisory
@@ -333,0 +351,4 @@// transcript as history and continues with no new input. acc starts from the// resume history (or the opening user message) and grows as steps complete.obs := stepObserverif ckpt != nil {🔴 Resume logic incorrectly overwrites initial user message
correctness, maintainability, performance · flagged by 5 models
run/executor.go:356-367: The checkpointing logic for single-loop runs incorrectly initializes the transcript accumulator (acc). It starts withmultimodalUserMessage(input, inv.Images)and then overwrites it withresume.Historyif a resume state exists. This means the initial user message is lost during resume, which violates the semantic correctness of resuming a run. The resumed run should continue from the saved transcript, but the initial user message should still be part of…🪰 Gadfly · advisory
@@ -333,0 +364,4 @@if len(s.Results) > 0 {acc = append(acc, llm.ToolResultsMessage(s.Results...))}_ = ckpt.Save(runCtx, RunCheckpointState{Messages: acc, Iteration: s.Index + 1})🔴 Per-step checkpoint Save error silently swallowed
correctness, error-handling, performance, security · flagged by 6 models
run/executor.go:367— Per-step checkpointSaveerror silently swallowed: The per-step save in the single-loop path discards the error with_ =. Same impact as above — checkpoint persistence failures go unnoticed, risking stale recovery state. Verified by readingrun/executor.go:359-368. Suggested fix: Log the error; at minimum ensure the failure is recorded in the audit trail.🪰 Gadfly · advisory
@@ -355,1 +401,4 @@// Finalize durable recovery: clear the checkpoint on success/terminal failure,// or leave it for boot recovery when the run was interrupted by shutdown.finalizeCheckpoint(ctx, ckpt, runErr, context.Cause(runCtx))🟠 finalizeCheckpoint not deferred: a panic (or early setup-error return) bypasses it, leaving an orphaned recoverable checkpoint that should be Fail'd → boot recovery loop
error-handling · flagged by 1 model
run/executor.go:404—finalizeCheckpointis a plain statement, not deferred, so panics and early-error returns bypass it; a panicked run leaves an orphaned recoverable checkpoint. The checkpointer is begun at:193, the wrapped step observerSaves a snapshot each completed step (:367), butfinalizeCheckpoint(...)only runs at:404after the dispatch. The top-levelrecover()defer (:119-123) converts a panic anywhere in the loop intores.Errand returns without finali…🪰 Gadfly · advisory
@@ -64,6 +64,13 @@ type phaseDeps struct {stepObserver func(agent.Step)steer func() []llm.Messagerec RunRecorder// checkpointer records phase-boundary progress (completed phases) for durable🟡 Comments and logic around phaseDeps fields could be improved for clarity
maintainability · flagged by 1 model
run/checkpoint.go:54-61: ThecheckpointOutcometype and its constants are not exported. This limits their usability outside the package and could be confusing for external consumers who might need to interact with these outcomes. Consider exporting them if they are part of the public API. -run/phases.go:67-74: ThephaseDepsstruct includes acheckpointerfield and aresumefield, but the comments and logic around these fields are not entirely clear. The comments could be i…🪰 Gadfly · advisory
@@ -73,10 +80,22 @@ type phaseDeps struct {// deadline/critic-kill — returns the error.func (e *Executor) runPhases(runCtx context.Context, ra RunnableAgent, deps phaseDeps, query string, images []llm.ImagePart) (*agent.Result, error) {outputs := make(map[string]string, len(ra.Phases))var completed []PhaseOutput🟡 Potential Unbounded Growth in
completedSlice inrunPhasesperformance · flagged by 1 model
completedSlice inrunPhasesInrun/phases.go:83, thecompletedslice is appended to in each phase iteration without any bounds checking. This could lead to unbounded growth in memory usage for long-running or frequently resumed multi-phase runs. Consider adding a limit or periodic cleanup mechanism to prevent excessive memory consumption.🪰 Gadfly · advisory
@@ -92,1 +111,4 @@for i, phase := range ra.Phases {// Skip phases already completed on a resumed run (key presence, not output// emptiness — a legitimately-empty phase output still counts as done).if _, done := outputs[phase.Name]; done {🟠 Resume phase-skipping trusts checkpoint integrity without validation
error-handling, maintainability, security · flagged by 3 models
run/phases.go:114-116— Resume phase-skipping trusts checkpoint integrity (medium): On resume, phases are skipped based solely on key presence inoutputsmap (line 114:if _, done := outputs[phase.Name]; done), populated fromdeps.resume.CompletedPhases(lines 91-97). If an attacker compromises the checkpoint store, they can inject arbitraryPhaseOutputentries to skip security-critical phases. Verified: ReadrunPhasesresume logic — no integrity validation of `deps.resume.…🪰 Gadfly · advisory
@@ -154,0 +181,4 @@// hold/serialize it asynchronously.)completed = append(completed, PhaseOutput{Name: phase.Name, Output: output})if deps.checkpointer != nil {_ = deps.checkpointer.Save(runCtx, RunCheckpointState{🔴 Phase-boundary checkpoint Save error silently swallowed
correctness, error-handling, maintainability, performance · flagged by 5 models
run/phases.go:184— Phase-boundary checkpointSaveerror silently swallowed: The phase-boundary save discards the error with_ =. If the checkpoint store fails mid-run (disk full, DB locked, network blip), the run continues but checkpoints aren't persisted. On shutdown/recovery, the host will resume from a stale checkpoint (or none), potentially re-executing work already done or losing progress. Verified by readingrun/phases.go:183-188. Suggested fix: Log the error (at minim…🪰 Gadfly · advisory
@@ -188,2 +200,3 @@// minimal here; the executor extends what it records during the merge.// RunCheckpointState is the resumable snapshot a Checkpointer persists.type RunCheckpointState struct {// Messages is the running transcript (single-loop run) OR the active phase's🟡 Documentation claims multi-phase transcript resume not implemented
correctness · flagged by 1 model
run/ports.go:202-203— Documentation claims unsupported functionality - The comment statesMessagesis "the running transcript (single-loop run) OR the active phase's transcript (multi-phase run)". However, the multi-phase checkpoint save atrun/phases.go:184-187never setsMessages— it only savesCompletedPhasesandActivePhase. This means mid-phase transcript recovery for multi-phase runs is NOT supported (only boundary-granular). - Impact: Documentation promises fu…🪰 Gadfly · advisory
@@ -192,0 +207,4 @@// already finished, in phase order. nil for single-loop runs.CompletedPhases []PhaseOutput// ActivePhase is the name of the in-progress phase (multi-phase only).ActivePhase string🟡 RunCheckpointState.ActivePhase is always empty string — field exists but is never populated with meaningful data
maintainability · flagged by 1 model
run/ports.go:210/run/phases.go:186—RunCheckpointState.ActivePhaseis defined and persisted, but the only write in the entire codebase isActivePhase: ""atrun/phases.go:186. It is never set to a non-empty value.🪰 Gadfly · advisory
🪰 Gadfly review — consensus across 7 models
Verdict: Blocking issues found · 23 findings (9 with multi-model agreement)
run/executor.go:367run/executor.go:354run/phases.go:184run/checkpoint.go:23run/checkpoint.go:86run/executor.go:193run/phases.go:114run/checkpoint.go:73checkpoint/handle.go:10714 single-model findings (lower confidence)
run/executor.go:188run/executor.go:207run/executor.go:404run/phases.go:177checkpoint/handle.go:103run/checkpoint.go:35run/checkpoint.go:54run/phases.go:56run/phases.go:67run/phases.go:83run/ports.go:202run/ports.go:210checkpoint/handle.go:90checkpoint/handle.go:95Per-model detail
claude-code/opus (claude-code) — Minor issues
Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
This PR is internal kernel plumbing for durable checkpoint/resume. I've traced the tainted-input and trust-boundary paths relevant to the security lens.
VERDICT: No material issues found
I reviewed this through the security lens (authz, injection, untrusted-data handling, secret leakage) and verified the key paths against the checked-out code:
run/executor.go:185-196,run/checkpoint.go:18-52):ResumeStateand the existingCheckpointerarrive via in-processcontextvalues set by the host's boot-recovery path, and the saved transcript originates from the host-controlledCheckpointStore(durable infra). This is trusted host infrastructure, not an external/untrusted ingress point, so seedingagent.WithHistory(resume.History)does not cross a new trust boundary.run/phases.go— completed-phase outputs are threaded into the next phase's prompt via{{.<PhaseName>}}substitution. Go'stext/templatesubstitutes a.Fieldvalue as data; it does not re-parse interpolated values, so a phase output containing{{...}}cannot inject template directives. No injection vector here.e.cfg.Ports.Budget.Check(ctx, inv.CallerID),run/executor.go:139-143) runs before any resume seeding, so a recovered run is still gated. Authorization at re-dispatch is the host's responsibility, unchanged by this PR.factory.Begin(checkpoint/handle.go:107-123) copiesPrompt/CallerID/GuildIDintoRunCheckpointMetafor persistence — that is inherent to checkpointing and goes only to the host-supplied store, not to any external sink. Theinfo.Inputs["prompt"].(string)assertion is a safe comma-ok and is not a security concern.No authn/authz gap, injection, SSRF, unsafe deserialization, or insecure default is introduced by this change.
One adjacent observation, explicitly outside my lens (not a security finding): the documented
run.ErrShutdownvs mort'srunengine.ErrShutdownmismatch means a shutdown would currently be classified ascheckpointFailand delete the recovery record — that's an availability/correctness concern for the correctness lens, not security.🎯 Correctness — Minor issues
Both findings are confirmed against the actual code:
Finding 1 — Verified.
resume.Historyis read only in the single-loop path (executor.go:356,376); the multi-phase path reads onlyCompletedPhases(phases.go:91-97).ActivePhasehas no reader anywhere — it is written as the literal""atphases.go:186and merely copied throughhandle.go:61. MeanwhileResumeState(checkpoint.go:21,25) andRunCheckpointState(ports.go:209-210) document active-phase seeding that never occurs. Confirmed dead field + over-promising docs.Finding 2 — Verified.
phaseDepscomment atphases.go:56-57still says the shared options are "(tool-error limits + step observer + compactor)", butexecutor.go:310-312now buildssharedOptswith only tool-error limits + compactor; the step observer is wired per-path (executor.go:373,phases.go:236). Observer is added exactly once per path — no double-count. Pure doc drift.Both hold. Verdict unchanged.
I've traced the durable-recovery logic against the actual checked-out code. The core classification, cause-propagation, resume, and phase-skip logic all hold up. Findings below.
VERDICT: Minor issues
Findings
Multi-phase active-phase resume is documented but unimplemented (write-only/dead fields).
run/checkpoint.go:21,25andrun/ports.go:209-210advertise that a resumed run "seeds the active phase" viaResumeState.ActivePhase/ResumeState.History, andRunCheckpointState.ActivePhasecarries "the name of the in-progress phase (multi-phase only)." Verified by grep + readingrunPhases: the phase path only ever readsdeps.resume.CompletedPhases(run/phases.go:91-97); it never readsresume.ActivePhaseorresume.History(those are read only by the single-loop path atexecutor.go:356,376), and the phase-boundarySavehardcodesActivePhase: ""(run/phases.go:186). So for any multi-phase run the persistedActivePhaseis always empty, and a host that populatesResumeState.ActivePhase+History(per the doc) gets a silent full re-run of the interrupted phase instead of the mid-phase resume the field names imply. The boundary-granular behavior is correct and intentional (it's documented in thephaseDepsblock atphases.go:67-71), but the field-level docs over-promise and theActivePhasefield is effectively dead. Impact: misleading persisted state + a latent host-integration footgun, not a crash. Fix: either dropActivePhasefromRunCheckpointState/ResumeState(and the "seeds the active phase" wording), or actually record/consume it. (low severity.)Stale comment:
phaseDepsstill lists the step observer as part ofsharedOpts.run/phases.go:56-57describes "the shared agent options (tool-error limits + step observer + compactor)". This PR movedWithStepObserverout ofsharedOpts(run/executor.go:306-312) and now wires it per-path (executor.go:373,phases.go:236). The comment now contradicts the code. No behavioral effect — both paths add the observer exactly once, so there's no double-count or audit regression — purely doc drift. (trivial.)Verified clean (within the correctness lens)
classifyCheckpointOutcomekeying onerrors.Is(cause, ErrShutdown)is sound:finalizeCheckpointis called atexecutor.go:404withcontext.Cause(runCtx), before the deferred cancellations run, so the cause reflects the true outcome. Critic-kill/deadline/caller-cancel →Fail, shutdown → leave-running, success →Completeall match the test matrix.executor.go:376-382) seedsWithHistory+ empty input so completed tool calls aren't re-run; theaccreconstruction appends consistently at step granularity.checkpoint/handle.go) is correct; nilnowhandled byNew;RunCheckpointMetahas theGuildID/ModelTierfields, so it compiles.phases.go:114),isLastindexing stays correct across skipped phases, and a fully-resumed run returns the saved final output without a model call.🧹 Code cleanliness & maintainability — Minor issues
All four findings are confirmed against the code:
sharedOpts(executor.go:310-312) now holds only tool-error limits (+ optional compactor); thephaseDepsparenthetical still lists "step observer". The actual stale text is on line 56 (line 57 separately/correctly documents the distinctstepObserverfield). Confirmed.runPhasesconsumes onlyCompletedPhases. The "seeds the active phase" doc is unimplemented. Confirmed.""(phases.go:186; single-loop Save omits it); never read. Confirmed dead plumbing.NewFactory(handle.go:103) never sets it;f.nowis always nil. Confirmed dead field.All survive. Corrected review below (finding 1's line reference adjusted to where the stale text actually sits).
Review
VERDICT: Minor issues
Through the code cleanliness & maintainability lens, the change is generally well-structured (good use of the shared
multimodalUserMessagehelper, the per-path observer wrap reads cleanly, and theclassifyCheckpointOutcome/finalizeCheckpointsplit is tidy). A few stale-doc and speculative-field smells:run/phases.go:56— stale struct doc. ThephaseDepscomment still describes the shared agent options as "(tool-error limits + step observer + compactor)". This PR deliberately moved the step observer out ofsharedOpts(executor.go:310-312 now builds it with only tool-error limits + optional compactor;WithStepObserveronly appears at the per-path sites, executor.go:373 and phases.go:236). The comment now describes the pre-PR layout and contradicts therunOnePhasecomment you did update (phases.go:229-232). Fix: drop "step observer" from thesharedOptsparenthetical here.run/checkpoint.go:25(+ doc lines 20-21) —ResumeState.ActivePhaseis unused; docs describe unimplemented behavior. Grep confirmsResumeState.ActivePhaseis never read anywhere, and in the multi-phase path (runPhases) onlyCompletedPhasesis consumed (phases.go:91-97). Yet the struct doc claims the executor "seeds the active phase". The PR is explicitly boundary-granular (the active phase re-runs from scratch — see phaseDeps doc, phases.go:67-71), so this is a speculative exported field whose documentation promises behavior that doesn't exist. Either dropActivePhaseor clearly mark it "reserved; not yet consumed" so a future reader doesn't assume mid-phase resume works.run/ports.go:210+run/phases.go:186—RunCheckpointState.ActivePhaseis write-only/always empty. The only writer (phases.go:186) setsActivePhase: ""explicitly, and the single-loop path (executor.go:367) never sets it. So the persisted field is always empty in practice — dead plumbing paired with the unusedResumeState.ActivePhaseabove. The explicitActivePhase: ""(zero value) is also redundant.checkpoint/handle.go:95—factory.nowis dead. Thefactorystruct carries anow func() time.Timefield, butNewFactorynever sets it (handle.go:103-105), sof.nowis always nil andBeginalways passes nil toNew(handle.go:122), which then defaults totime.Now. Unlikehandle, there's no constructor/option that injects a clock, so the field can never be anything but nil. Either drop it, or add the injection path it implies (mirroringhandle's testable clock).Nothing here is blocking; these are low-churn doc/dead-field cleanups.
⚡ Performance — Minor issues
I've independently verified both findings against the actual code.
Finding 1 (executor.go:359 — accumulator bypasses compaction): Confirmed. At
executor.go:354-368the checkpoint observer buildsaccas a private append-only slice (acc = append(acc, s.Response.Message())+llm.ToolResultsMessage(s.Results...)every step), entirely independent of the agent's internal history. The compactor is attached only viasharedOpts(executor.go:328), which folds into the agent's options atexecutor.go:374— so compaction shrinks the agent's live working set but never touchesacc. The growth/serialization-cost framing holds.Finding 2 (executor.go:367 + handle.go — full-transcript Save every step, no throttle floor): Confirmed.
ckpt.Save(runCtx, RunCheckpointState{Messages: acc, ...})fires synchronously in the step loop;handle.Save(handle.go:42-72) passes the wholest.Messagestostore.Savewith no delta. The throttle at handle.go:45 is the only guard, andNew's contract (handle.go:30-31) plusNewFactory(handle.go:103-105) confirmthrottle <= 0= save every call with no enforced floor. Config-dependent O(n²) write volume for durable backings is real.Both findings accurately describe the code and stay within the Performance lens. Verdict stands.
VERDICT: Minor issues
The wiring is reasonable and the throttle bounds persistence frequency. Two related performance concerns, both within my lane:
run/executor.go:355-368— the checkpoint accumulator bypasses compaction, so it grows unbounded on exactly the long runs durable recovery targets. The single-loop observer buildsaccby appendings.Response.Message()+llm.ToolResultsMessage(s.Results...)on every step and never drops anything. But the agent itself runs with a compactor (sharedOptsincludesagent.WithCompactor(...)atexecutor.go:328, and single-loopoptsfoldssharedOptsin atexecutor.go:374). When compaction fires, the agent's live working set shrinks;accdoes not. Result: on a long run the checkpoint transcript is strictly larger than the live conversation and grows without the bound compaction is there to provide — both as retained memory (a second, ever-growing copy of every message) and, more importantly, as the payload serialized on each persistedSave. Suggested fix: feed the checkpoint from the agent's current (post-compaction) message history rather than a private append-only accumulator, or cap/compactaccin step with the agent.run/executor.go:367+checkpoint/handle.go:42-72— each persistedSaveserializes the entire growing transcript, and the kernel relies entirely on the host throttle with no floor.ckpt.Save(runCtx, RunCheckpointState{Messages: acc, ...})fires synchronously in the agent step loop on every step. When the throttle window has passed,handle.Savecallsstore.Savewith the fullacc— a blocking store write on the hot path that re-serializes the whole history every time (no delta/incremental write). The throttle (handle.go:45) is the only mitigation, andNew/NewFactoryexplicitly supportthrottle <= 0= "save every call" (handle.go:30-31). With that documented config, a single-loop run does a full-transcript store write on every step → O(n²) cumulative write volume plus blocking I/O between every agent step. This is config-dependent (the in-memory default makes each write cheap), so it's a real edge for the durable (mort durable-job table) backing, not a default-path regression. Suggested fix: enforce a minimum throttle inNewFactory, or document that durable stores must pass a non-zero throttle; ideally combine with the fix above so the serialized payload is bounded.Everything else I checked is fine: the throttled in-window path is cheap (lock +
time.Now+ compare); the phase-boundarySavecopies only the small[]PhaseOutputslice (phases.go:185-189);accappend is amortized O(1) and passing it by reference toSaveis not a per-step copy; the resume seed (executor.go:357,WithHistory) is a one-time copy. No N+1, no quadratic per-step copying.🧯 Error handling & edge cases — Minor issues
Both findings are confirmed against the actual source:
executor.go:404): Verified —finalizeCheckpoint(...)is a plain statement at line 404, executed only after the dispatch returns. The top-levelrecover()defer (lines 119–123) setsres.Errand returns, bypassing line 404 entirely on a panic. The wrapped step observer (lines 359–368)Saves each step via the liveckpt, so a panic after ≥1 Save leaves a persisted record that is neitherCompleted norFailed — whichclassifyCheckpointOutcomereserves forErrShutdown(leave-for-recovery), not a terminal panic. The setup-error early returns afterBegin(lines 209/217/228/241) likewise skip finalize. Confirmed.executor.go:193): Verified —cerris dropped with no log. TheCheckpointerFactorydoc (ports.go:178–183) does place the "log + degrade to (nil,nil)" duty on the factory, whilephaseModel(phases.go:270) emitsslog.Warnon the analogous resolve failure. The inconsistency the finding draws is accurate.Both survive. The verdict is unchanged.
VERDICT: Minor issues
My lens is error handling & edge cases. The happy paths and the documented degraded paths (optional/budget-exhausted phases, nil-safe ports, throttle-skip) are handled cleanly. Two unhappy-path gaps the diff introduces:
run/executor.go:404—finalizeCheckpointis a plain statement, not deferred, so panics and early-error returns bypass it; a panicked run leaves an orphaned recoverable checkpoint. The checkpointer is begun at:193, the wrapped step observerSaves a snapshot each completed step (:367), butfinalizeCheckpoint(...)only runs at:404after the dispatch. The top-levelrecover()defer (:119-123) converts a panic anywhere in the loop intores.Errand returns without finalizing. So if a panic occurs after one or moreSaves, the persisted checkpoint is neitherCompleted norFailed — by omission it gets the "leave for boot recovery" treatment thatclassifyCheckpointOutcomereserves forErrShutdown. A panicking run is terminal and shouldFail(delete) the record; instead it becomes a recovery candidate that will be re-dispatched on the next boot and can re-panic (a recovery loop). This is reachable with the shipped battery (whoseSaveactually persists), so it's store-independent. Suggested fix: make finalize adeferthat capturesrunErr/context.Cause(runCtx)(or have therecoverdeferFailthe checkpointer).Same root cause, lower-impact corollary: the setup-error early returns after
Begin—build toolbox(:209), delegation tools (:217), extra tools (:228), session tools (:241) — also skip finalize. Harmless for the shipped battery (noSavehas happened yet), but a factory whoseBeginmarks the run durable/recoverable (the PR explicitly models this on mort'sagentexec) would leak an un-Failed record on any of these.run/executor.go:193—Beginerror is silently swallowed.if c, cerr := ...Begin(ctx, info); cerr == nil { ckpt = c }drops a non-nilcerrwith no log; the run silently continues non-durable. TheCheckpointerFactorydoc (ports.go:178-183) does place the "log + degrade to (nil,nil)" duty on the factory, so this is defensible — but it diverges from the codebase's own convention for analogous resolution failures (phaseModelatphases.go:270emitsslog.Warnon a model-resolve failure). A misbehaving factory that returns an error here loses durability with zero signal. Suggested fix:slog.Warnthe droppedcerr.Things I checked and cleared (not findings): the shutdown-cause classification reaching
context.Cause(runCtx)is sound —MergeCancellationpropagates the secondary ctx's cause, so a host-setErrShutdownflows through.detach(ctx)correctly bases cleanup writes off the parent (non-cancelled) ctx. The phase-skip uses key-presence not output-emptiness (phases.go:114), correctly handling a legitimately-empty phase output. The liveaccslice handed toSaveat:367is not an aliasing/mutation bug —appendonly writes at/after the handed-off slice'slen, so the store's prefix view is never mutated.claude-code/opus:max (claude-code) — Minor issues
Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — Minor issues
I've verified the finding against the actual source. The factual claims hold:
checkpoint/handle.go:76—Faildoes discard the passed error (_ error) and only deletes, sorunErris never persisted. Confirmed.run/executor.go:359-368— the wrapped step observer accumulatess.Response.Message()andllm.ToolResultsMessage(s.Results...)intoacc, then callsckpt.Saveeach step. Confirmed.Saveand persist the transcript. Confirmed.Save/Checkpointerpath. Confirmed.The finding is accurate and security-relevant; it survives.
VERDICT: Minor issues
This change is internal plumbing — the data crossing the new boundary (transcripts, phase outputs, run meta) originates from the agent's own run, and the only trust boundary is the checkpoint store, which is host-controlled infrastructure. I found no injection, SSRF, deserialization, or authz defect introduced in this diff. One security-relevant property change is worth flagging:
run/executor.go:359-368/checkpoint/handle.go:56— newly-activated unredacted persistence of full transcripts at rest. The checkpointer port was never consumed before this PR (the executor wiring was a pending P2 follow-up), and the new step-observer accumulatess.Response.Message()andllm.ToolResultsMessage(s.Results...)intoacc, then callsckpt.Saveevery step. Tool results routinely contain secrets/credentials/PII the agent fetched (config reads, API responses with tokens, etc.). Previously these lived only in process memory for the run's lifetime; this PR is the first to write the complete[]llm.Messagetranscript to durable storage (mort's job table / any host store) with no redaction hook. Impact: it widens data-at-rest exposure and makes the store's encryption/access-control load-bearing for secret confidentiality. This is largely inherent to durable resume (you can't continue without the transcript), so it's a design tradeoff rather than a code defect — but it deserves an explicit note, and ideally a redaction/opt-out seam, since the in-treeMemory()default is in-process only and hides the at-rest cost from anyone reading the core. Severity: small.Two things I checked and found not to be problems (so I'm not reporting them):
handle.Faildiscards the passederrorand onlyDeletes, sorunErris never persisted.Budget.Check(inv.CallerID)gate and the rebuilt toolbox/permissions still run;CallerID/GuildIDare persisted into meta so recovery can restore attribution. The actual boot re-dispatch + its authz live host-side (mort), explicitly out of this diff, so there's no in-diff authz gap to report.Outside my lens:
classifyCheckpointOutcomekeys shutdown onrun.ErrShutdownwhile mort stamps a distinctrunengine.ErrShutdown(per the PR note, to be aliased later) — until that alias lands, a shutdown-interrupted run would be classifiedcheckpointFailand its recovery record deleted, defeating durable recovery. That's a correctness/availability issue for the other lens, not security.🎯 Correctness — No material issues found
I've verified the change thoroughly against the checked-out code. Build/vet require approval in this sandbox, but I confirmed every cross-file symbol the diff depends on (
MergeCancellationcause-propagation,ErrShutdown/ErrCriticKill,PhaseOutput,phaseProvider,multimodalUserMessage,handle.Newnil-nowdefault).VERDICT: No material issues found
I checked the correctness-sensitive seams and they hold up:
classifyCheckpointOutcomedecision matrix (run/checkpoint.go:67) — verified the cause actually reaches it correctly.finalizeCheckpointreadscontext.Cause(runCtx), andMergeCancellation(run/runengine.go:62) callscancel(context.Cause(secondary)), so a host-stampedErrShutdownon the base ctx propagates torunCtxas its cause →checkpointLeaveRunning. A critic kill stampsErrCriticKillon the innercancelCauselayer, whichcontext.Causesurfaces through the merged child →checkpointFail. TherunErr == nilcheck precedes the cause check, so a run that completes despite a concurrent shutdown stillCompletes (clears). All consistent with the test matrix.run/executor.go:355-368) — the reconstructedaccis always balanced: each step appends the assistant message and (when present) the paired tool-results message beforeSave, so a saved transcript never ends on a dangling tool call. On resume it's correctly re-seeded fromresume.History(themultimodalUserMessageseed is overwritten), andag.Run(runCtx, "", WithHistory(...))continues without re-running completed tool calls — matching the PR claim. System prompt is re-applied viaagent.New, correctly not stored in history. Iterations.Index+1is the right 1-based completed-iteration count.run/phases.go:91-117,177-188) — skip uses key-presence (_, done := outputs[...]), so a legitimately-empty phase output still counts as done; thecompletedslice is copied perSaveand ordering is preserved across resumed + fresh phases. The interrupted phase correctly re-runs from scratch (boundary-granular, as documented), and the all-phases-completed edge returns the last saved output without a model call.checkpoint/handle.go:109-122) — nilnowis defaulted insideNew, so the unsetfactory.nowis harmless;RunCheckpointMetacarries theGuildID/ModelTierfieldsBeginpopulates. No import cycle (battery→core only).One non-blocking observation (not a correctness defect, so excluded from the findings block):
RunCheckpointState.ActivePhaseis plumbed throughhandle.Savebut never written with a non-empty value — the single-loop path leaves it zero and the phase-boundarySaveexplicitly setsActivePhase: ""(run/phases.go:186). Resume doesn't consume it (the active phase re-runs fromCompletedPhases), so behavior is correct today; the field is just latent/dead. Worth a comment so a future host doesn't rely on it to identify the in-flight phase.🧹 Code cleanliness & maintainability — Minor issues
All four findings are confirmed against the actual code:
ActivePhase— grep shows every occurrence is either a field declaration, the literalActivePhase: ""(run/phases.go:186), or the pass-through copy (checkpoint/handle.go:61). No read site exists; the single-loopSave(run/executor.go:367) never sets it, sost.ActivePhaseis always"". The doc comments atrun/checkpoint.go:21,25do claim "seeds the active phase" / "the phase that was in flight." Confirmed.ActivePhase: ""atrun/phases.go:186— confirmed zero-value with no reader.resume != nil && len(resume.History) > 0atrun/executor.go:356and:376. Confirmed.factory.nownever set —NewFactoryreturns&factory{store, throttle}(handle.go:104), leavingnownil; only safe viaNew's nil→time.Nowdefault. No assignment anywhere in the package. Confirmed.All findings survive.
VERDICT: Minor issues
ActivePhaseis dead weight, and its docs overpromise —run/checkpoint.go:25,run/ports.go:210,checkpoint/checkpoint.go:41. The field is plumbed through three structs (ResumeState,RunCheckpointState,RunCheckpoint) but the kernel never reads it and never writes a non-empty value to it. Every write site is either the literalActivePhase: ""(run/phases.go:186) or a pass-through copy (checkpoint/handle.go:61), and there is no read site anywhere — the single-loopSave(run/executor.go:367) doesn't set it, and the phase resume loop (run/phases.go:91-97) only readsCompletedPhases. The design is boundary-granular — the interrupted phase re-runs from its start — soActivePhasecarries no information the resume path uses. Meanwhile the doc comments claim otherwise:run/checkpoint.go:21says multi-phase resume "skips completed phases and seeds the active phase", and the field comment (:25) calls it "the phase that was in flight." No active-phase seeding exists. This will mislead a future maintainer into thinking mid-phase resume is implemented. Fix: either drop theActivePhasefield from all three structs (and the misleading clauses in the comments), or, if it's a deliberate forward hook for the Phases follow-up, mark it clearly as reserved/unused and correct the comments so they don't describe behavior that isn't there.Redundant explicit zero-value with no payoff —
run/phases.go:186. The phase-boundarySavesetsActivePhase: ""explicitly. Since it's always the zero value and nothing reads it, the line is noise that hints at intent the code doesn't deliver. Drop it (folds into the finding above).Duplicated resume guard —
run/executor.go:356and:376.resume != nil && len(resume.History) > 0is computed twice within the single-loop branch (once to seedacc, once to choose theag.Run-with-history vsrunAgentpath). Low-churn fix: hoist to a localresuming := resume != nil && len(resume.History) > 0and reuse it, so the two sites can't drift.(trivial) Unused
nowfield onfactory—checkpoint/handle.go(factory struct +NewFactory).NewFactorynever setsnow, sofactory.nowis always nil; it's only safe becauseNewdefaults nil→time.Now(handle.go:36-38). It's a dead field (no same-package assignment, no test sets it). Either drop it or giveNewFactorya way to inject it, to match the testabilityNewalready offers.No structural problems with the executor refactor itself: moving
WithStepObserverout ofsharedOptsand wiring it per-path (single-loop wraps for checkpointing; phases wire it inrunOnePhase) is a clean, well-commented split, and the updated comments atrun/executor.go:306-309andrun/phases.go:229-232accurately describe the new arrangement.⚡ Performance — No material issues found
I've verified the mechanism against the actual code. My findings:
executor.go:367does callckpt.Save(runCtx, ...)synchronously inside the wrapped step observerobs, wired viaagent.WithStepObserver(obs)— confirmed.handle.Save(handle.go:42–72) is synchronous and throttle-gated; the throttle short-circuits before the store write — confirmed.throttle <= 0"saves every call" is documented (handle.go:30–31) and the factory makes every run durable by default (handle.go:88–91) — confirmed.However, the performance framing does not hold up. The step observer fires once per agent step — i.e., once per model round-trip (see
run/steps.goand theagent.Step-keyed observer). It is not a tight inner loop. A synchronous local checkpoint write at LLM-round-trip cadence (seconds) is negligible against the model latency that dominates each step, even atthrottle=0. The "quadratic-ish" transcript re-serialization is true in big-O but trivial in absolute terms because the number of writes is bounded by the number of model calls. The author themselves rate this severitysmall/ confidencelowand concede it "doesn't bite in the intended (throttled) configuration." As a performance finding there is no material cost to keep.Dropping it leaves nothing.
VERDICT: No material issues found
The checkpoint/resume wiring is performance-conscious: the store write is throttle-gated and short-circuits before serialization, the single-loop
accaccumulation is append-only / amortized O(1) per step, and phase-boundary checkpointing copies a small slice bounded by phase count. The one per-stepckpt.Saveruns synchronously in the step observer, but that observer fires at model-round-trip cadence, so a local durable write there is not a meaningful hot-path cost.🧯 Error handling & edge cases — Minor issues
Both findings verified against the actual code:
Finding 1 (phases.go:114-117) — Confirmed. The skip guard
if _, done := outputs[phase.Name]; doneis newly added in this diff, and the sameoutputsmap is populated at line 177 as each phase completes.Phase.Name(run/agent.go:62) is a free-form string with no uniqueness validation anywhere inrun/(grep confirms onlyuniqueNamefor input files and palette dedup — nothing for phase names). So a fresh pipeline with two same-named phases now skips the second. Real new edge case.Finding 2 (executor.go:193) — Confirmed at the code level:
cerris dropped with no log. The port contract (run/ports.go) does delegate logging to the factory, so this is a minor defensive gap, but the observation is accurate and consistent with the nearbyslog.Warndegrade convention.Both survive.
VERDICT: Minor issues
Reviewed strictly through the error-handling & edge-cases lens. I traced the shutdown-cause propagation (
ctx→WithoutCancel+WithTimeout→WithCancelCause→MergeCancellation) and confirmedcontext.Cause(runCtx)carriesErrShutdownintoclassifyCheckpointOutcome(run/checkpoint.go, run/executor.go:259-264), so the finalize/classify matrix is sound. The accumulator never persists a transcript ending on a dangling tool call (obs appends results beforeSave), and nil/empty inputs are handled (info.Inputs["prompt"]on a nil map,f.nownil →time.Now). Two real but low-severity issues:run/phases.go:114-117— repeated phase name now silently skips the second phase (new edge case, not resume-only). The new skip guard keys onoutputs[phase.Name]presence, but that same map is populated as each phase completes (line 177).Phase.Nameis a free-form string with no uniqueness validation (run/agent.go:62-63; no validator anywhere inrun/). So in a fresh (non-resume) pipeline where two phases share a name, the first runs and setsoutputs["x"]; when the loop reaches the second"x", the guard fires and it is skipped entirely. Before this diff every phase ran unconditionally. The failure is silent: the skipped phase never executes,lastOutputis the earlier phase's text, and the run is then checkpoint-Completed as if finished. Impact: a misconfigured-but-previously-working pipeline (e.g. a duplicated "synthesize" name) returns the wrong output with no error. Trigger likelihood is low (requires duplicate names), hence small severity. Fix: either validate phase-name uniqueness up front, or track skip-set membership in a dedicated set seeded only fromresume.CompletedPhasesrather than reusing the liveoutputsmap.run/executor.go:193—Beginerror silently swallowed, no log.if c, cerr := …Begin(ctx, info); cerr == nil { ckpt = c }drops a non-nilcerrwith noslog. The port contract (run/ports.go) says a storage error "should be logged and degraded to (nil, nil)" — i.e. logging is the factory's job — so degrading to non-durable is defensible, but a contract-violating factory that returns an error becomes invisibly non-durable. This is inconsistent with the codebase's degrade-with-slog.Warnconvention used nearby (phase model resolve, template parse, unknown tool). Fix:slog.Warn("run: checkpointer Begin failed; running non-durable", "run_id", info.RunID, "error", cerr)in the error branch.deepseek-v4-pro:cloud (ollama-cloud) — Minor issues
Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
Verdict: No material issues found
After tracing tainted input through the full checkpoint/resume flow and examining the new code for authn/authz gaps, injection, SSRF, unsafe deserialization, credential leakage, and missing validation, I find no material security issues in this PR.
Here's what I verified:
Context value carriers (
WithResumeState,WithExistingCheckpointer): use unexported key types (resumeStateKey{},existingCheckpointerKey{}), the standard Go pattern preventing cross-package spoofing. Verified atrun/checkpoint.go:28-51.Template injection in phase prompts (
expandPhaseTemplateatrun/phases.go:343): usestext/templatewithmissingkey=zero. The template source is the agent author'sphase.SystemPrompt; user-supplied values (query, prior phase outputs) are substituted as plain strings — Go'stext/templatedoes not recursively re-parse substituted values. A user typing{{.Query}}produces literal text, not expansion.info.Inputs["prompt"].(string)incheckpoint/handle.go:110: safe comma-ok type assertion. A non-string or nilInputsmap yields""— no panic, no injection. The prompt is metadata for re-dispatch, not used for any security decision.detachfor finalization (run/executor.go:526-530): usescontext.WithoutCancel+WithTimeout— the correct pattern for cleanup that must not be aborted by caller cancellation. No context-value leakage beyond what the caller already put there.Checkpoint data at rest:
RunCheckpointcarries[]llm.Message(full transcript) and[]run.PhaseOutput. This is inherent to durable resume — you can't resume without saving state. TheCheckpointStoreinterface leaves encryption/access control to the implementation (mort's durable-job table, SQLite, etc.). The battery itself adds no new exposure.No authz bypass: the
Checkpointerhandle is per-run, minted by the factory and scoped to a singleRunID.Complete/Faildelete by that run ID only. No cross-run access vector exists.No SSRF, path traversal, or unsafe deserialization: the code touches no URLs, file paths, or dynamic
interface{}deserialization. All persisted types are concrete Go structs with string/int/slice fields.Error suppression in finalize (
run/checkpoint.go:88-90):Complete/Failerrors are silently discarded (_ = cp.Complete(...)). While this is a reliability concern (a failedCompleteleaves a stale checkpoint), it is security-positive: store error details are never surfaced to callers.🎯 Correctness — Minor issues
Verdict: Minor issues
run/phases.go:184-187—ActivePhaseis always hardcoded to""in the phase-boundary checkpoint save, and the single-loop path (run/executor.go:367) never sets it either. The field exists inRunCheckpointState,RunCheckpoint, andResumeStatebut the executor never writes a non-empty value. A host loading a checkpoint to build aResumeStatewill always seeActivePhase == "", making the field dead data in the kernel. Either the field should be populated (e.g., set to the next phase name at each boundary, or to""/ the current phase name mid-loop for single-loop runs) or removed from the kernel types to avoid misleading consumers. (Verified by readingrun/phases.go:183-188andrun/executor.go:354-368—ActivePhaseis never assigned a non-zero value in any save path.)run/executor.go:192-195—CheckpointerFactory.Beginerrors are silently swallowed. The interface doc says the factory should log and degrade, but if an implementation returns an error without logging (or the error is unexpected), the executor proceeds with no checkpointing and no diagnostic. Aslog.Warnhere would be defense-in-depth consistent with how the executor logs other non-fatal port failures. (Verified by readingrun/executor.go:192-195— thecerrvariable is checked but never logged.)run/executor.go:367vsrun/phases.go:185— Inconsistent defensive copying: the phase path copiescompletedviaappend([]PhaseOutput(nil), completed...)before passing toSave, but the single-loop path passes the liveaccslice directly (Messages: acc). If anyCheckpointStoreimplementation retains theRunCheckpointstruct (or itsMessagesfield) beyond the synchronousSavereturn — e.g., for async serialization — the single-loop path's checkpoint data would be mutated by subsequent observer calls. The single-loop path should copyaccthe same way the phase path copiescompleted. (Verified by comparingrun/executor.go:367withrun/phases.go:185.)🧹 Code cleanliness & maintainability — Minor issues
All three findings are confirmed against the actual code. Let me verify the exact line numbers for the findings:
run/checkpoint.go:25—ResumeState.ActivePhaseis defined but never read. Grep confirms zero reads ofresume.ActivePhaseanywhere; the only.ActivePhasereads are onRunCheckpointState(incheckpoint/handle.go:61), never onResumeState. The phase runner usesCompletedPhasesbut never consultsActivePhase.run/ports.go:210/run/phases.go:186—RunCheckpointState.ActivePhaseis defined and persisted, but the only write in the entire codebase isActivePhase: ""atrun/phases.go:186. It is never set to a non-empty value.run/executor.go:355-358— On the resume path,accis allocated as[]llm.Message{multimodalUserMessage(input, inv.Images)}at line 355 and then immediately overwritten withresume.Historyat line 357. The initial allocation is dead on the resume path.All three survive verification.
Verdict: Minor issues
run/checkpoint.go:25— Dead fieldResumeState.ActivePhase(verified). The field is defined, documented ("the phase that was in flight"), and carried through the context plumbing, but it is never read anywhere in the codebase. I grepped forresume.ActivePhase,\.ActivePhaseacross the entire repo — the only.ActivePhasereads are onRunCheckpointState(incheckpoint/handle.go:61), never onResumeState. The phase runner (run/phases.go:91-97) usesdeps.resume.CompletedPhasesto skip finished phases but never consultsActivePhase. This is a dead field that will confuse future readers who expect it to drive behavior. Either wire it up (e.g., to validate that the active phase matches expectations on resume) or remove it.run/ports.go:210/run/phases.go:186—RunCheckpointState.ActivePhaseis always""(verified). The field is defined, documented, and persisted to the store, but the only write in the codebase isActivePhase: ""atrun/phases.go:186. It is never set to a non-empty value. A reader seeing this field on the struct would reasonably expect it to carry the name of the in-progress phase during a multi-phase run, but it never does. Either populate it (set it tophase.Namebefore the phase loop body atrun/phases.go:111) or remove it from the struct until it's actually used.run/executor.go:355-358— Wasted allocation inaccinitialization (verified). When resuming,accis first allocated as[]llm.Message{multimodalUserMessage(input, inv.Images)}and then immediately overwritten withresume.Historyon the next line. The initialmultimodalUserMessageallocation is dead on the resume path. A cleaner structure avoids the throwaway allocation:⚡ Performance — No material issues found
After a thorough review of the hot paths, allocation patterns, and synchronization overhead introduced by this PR, I find no material performance regressions. Here's what I verified:
Single-loop checkpoint observer (
run/executor.go:359-368): The closure capturesaccby reference and callsckpt.Saveon every step. However,handle.Save(checkpoint/handle.go:42-72) throttles aggressively — after the first persist, subsequent calls within the throttle window do only a mutex lock/check/unlock (uncontended, single-goroutine). TheRunCheckpointStatestruct is stack-allocated and theMessagesslice header copy is O(1). No deep copy occurs on the hot path.Phase-boundary checkpointing (
run/phases.go:182-188):append([]PhaseOutput(nil), completed...)allocates a copy of the completed-phases slice at each boundary. For N phases this is O(N²) total allocation, but N is small (typically 2–5), making this negligible.Resume path (
run/executor.go:355-357): The initialaccallocation ([]llm.Message{multimodalUserMessage(...)}) is discarded when resume history is present. This is one wasted allocation per resumed run — truly trivial.No unbounded growth, missing pagination, or N+1 patterns were introduced. The transcript grows with the conversation, but that's inherent to durable recovery and bounded by the existing compaction system.
No blocking calls on hot paths beyond the throttled Save. The mutex in
handle.Saveis uncontended (one goroutine per run), andtime.Now()(if used) is vDSO-optimized.Verdict: No material issues found
🧯 Error handling & edge cases — Minor issues
Verdict: Minor issues
run/executor.go:193-195— SwallowedBeginerror with no logging. IfCheckpointerFactory.Beginreturns a non-nil error (e.g., a transient store issue the factory didn't self-log), the executor silently degrades to non-durable with zero indication. The interface doc says the factory should log and return(nil, nil), but the executor cannot enforce that. At minimum, log a warning so operators know durability is broken for this run. (Verified by readingexecutor.go:192-196— thecerris checked only for nil-ness, never logged.)run/executor.go:367—ckpt.Saveerror silently discarded. If the checkpoint store is down mid-run, every step's save fails silently; the run completes,finalizeCheckpointcallsComplete(which may also fail silently), and there is no durable record — but the caller sees a successful run with no indication durability was lost. Same pattern atrun/phases.go:184for phase-boundary saves. (Verified by reading both call sites — both use_ =to discard the error.)run/checkpoint.go:88-89—cp.Complete/cp.Failerrors silently discarded infinalizeCheckpoint. If the store is unavailable at finalize time, a successful run's checkpoint is never cleaned up (orphaned record) or a failed run's checkpoint persists (waste). Neither case is logged. (Verified by readingcheckpoint.go:86-93.)run/executor.go:367—accslice passed toSavewithout copying, inconsistent with the defensive copy inphases.go:185(append([]PhaseOutput(nil), completed...)). Thephases.gocomment explicitly notes the checkpointer "may hold/serialize it asynchronously." In the single-loop path,acccontinues to grow viaappendin subsequent observer calls afterSavereturns. If anyCheckpointerimplementation holds theMessagesslice reference pastSave's return, the transcript it serializes will be corrupted by later appends. In practicehandle.Saveis synchronous, but the inconsistency is a latent race. (Verified by comparingexecutor.go:367withphases.go:184-185.)run/checkpoint.go:73—classifyCheckpointOutcomekeys shutdown detection onrun.ErrShutdown, but mort currently stamps its ownrunengine.ErrShutdown(a distinct sentinel) on the base context. Until the mort wiring PR aliases them,errors.Is(cause, ErrShutdown)will befalsefor every mort shutdown, causingcheckpointFail(deletes the checkpoint) instead ofcheckpointLeaveRunning. The PR description acknowledges this gap; from the error-handling lens, it means the shutdown→recovery path is non-functional with current mort. (Verified by readingrunengine.go:34andcheckpoint.go:73— two distincterrors.Newsentinels.)devstral-2:123b-cloud (ollama-cloud) — Blocking issues found
Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — Minor issues
Verdict
Minor issues
Findings
Potential information leakage in checkpoint metadata (
checkpoint/handle.go:109-121): TheBeginmethod extracts the prompt frominfo.Inputs["prompt"]without validation or sanitization. If the prompt contains sensitive information (e.g., API keys, PII), it will be stored in the checkpoint metadata. This could lead to unintended exposure if the checkpoint store is compromised or accessed by unauthorized parties. Consider redacting sensitive data or providing an allowlist/denylist mechanism for what can be stored in checkpoints.Context value type safety (
run/checkpoint.go:35,49): TheresumeStateFromContextandexistingCheckpointerFromContextfunctions use type assertions without error handling. If the context contains a value with the wrong type for these keys, the functions will silently returnnil, which could lead to unexpected behavior. While this is not a direct security vulnerability, it could mask configuration errors that might have security implications.gadfly-findings
🎯 Correctness — Blocking issues found
Verdict
Blocking issues found
Findings
run/checkpoint.go:73: The logic inclassifyCheckpointOutcomeis incorrect. The function checkserrors.Is(cause, ErrShutdown), butErrShutdownis defined inrunengine.goas a distinct error value (errors.New("mort: shutting down")). However, the PR description explicitly states thatmortcurrently stamps its ownrunengine.ErrShutdown(a distinct value) on the base context, and the mort wiring PR will alias them soerrors.Ismatches. This means the current implementation will not correctly classify shutdowns until the mort wiring PR is merged, which could lead to incorrect checkpoint finalization (e.g., treating shutdowns as failures instead of leaving them for recovery). This is a blocking correctness issue.run/executor.go:356-367: The checkpointing logic for single-loop runs incorrectly initializes the transcript accumulator (acc). It starts withmultimodalUserMessage(input, inv.Images)and then overwrites it withresume.Historyif a resume state exists. This means the initial user message is lost during resume, which violates the semantic correctness of resuming a run. The resumed run should continue from the saved transcript, but the initial user message should still be part of the history. This is a blocking correctness issue.gadfly-findings
🧹 Code cleanliness & maintainability — Minor issues
Based on the verification of the actual code, here is the corrected review:
Verdict
Minor issues
Findings
run/checkpoint.go:54-61: ThecheckpointOutcometype and its constants are not exported. This limits their usability outside the package and could be confusing for external consumers who might need to interact with these outcomes. Consider exporting them if they are part of the public API.run/phases.go:67-74: ThephaseDepsstruct includes acheckpointerfield and aresumefield, but the comments and logic around these fields are not entirely clear. The comments could be improved to better explain the purpose and usage of these fields, especially in the context of durable recovery.⚡ Performance — Minor issues
Corrected Review
Verdict
Minor issues
Findings
Potential Unbounded Growth in
completedSlice inrunPhasesIn
run/phases.go:83, thecompletedslice is appended to in each phase iteration without any bounds checking. This could lead to unbounded growth in memory usage for long-running or frequently resumed multi-phase runs. Consider adding a limit or periodic cleanup mechanism to prevent excessive memory consumption.Unverified: Potential Inefficiency in Slice Copying
In
run/phases.go:185, thecompletedslice is copied usingappend([]PhaseOutput(nil), completed...). While this ensures isolation, it could introduce unnecessary allocations if the slice is large. This is a minor concern but worth noting for optimization in high-throughput scenarios.gadfly-findings
🧯 Error handling & edge cases — Minor issues
All three findings are confirmed in the actual code:
cp.Failis ignored.ckpt.Saveis ignored.deps.checkpointer.Saveis ignored.The draft review is accurate, and all findings are valid. The verdict remains "Minor issues."
glm-5.2:cloud (ollama-cloud) — Minor issues
Verdict: Minor issues — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
I traced every tainted-data path this PR opens, through my assigned security lens:
executor.go:379seedsresume.Historyviaagent.WithHistory. This is the same trust boundary as the existing image-folding path (runAgentatexecutor.go:547already buildsWithHistoryfrominput/images). Resume data originates from the host's own persisted checkpoint, not a new external input.phases.go:92-96pre-populatesoutputsfromresume.CompletedPhases, andexpandPhaseTemplate(phases.go:343) feeds them intotext/template. I verified the template is parsed from the staticphase.SystemPromptand the recovered outputs are substituted as data values, not parsed as template text — so there's no template-injection sink. This is the same flow as in-run phase outputs, which were already attacker-reachable via model output.Runentry point, soBudget.Check(executor.go:139-143), model resolution, and the toolbox build all re-run.ResumeStatecarries only transcript/phase data — no agent/tier/invocation override — so there's no authz bypass.WithResumeState/WithExistingCheckpointer): host-trusted recovery plumbing, not untrusted input surfaces.factory.Beginprompt read (handle.go:110): a safe type-assertion frominfo.Inputs, already part of the trusted run input; not a new injection vector.classifyCheckpointOutcome): keyed onerrors.Is(cause, ErrShutdown)fromcontext.Cause(runCtx)— only the host stamps the cause; an external caller can't forge it to force a leave-for-recovery outcome.Verdict: No material issues found.
🎯 Correctness — Minor issues
I now have enough to verify both findings.
Finding 1 claims that when compaction fires, the majordomo agent's internal message history "is replaced by a summary" while
acckeeps the raw pre-compaction messages, causing resume to replay the full transcript. But thecompact/compactor.godocumentation in this repo explicitly states (lines 36-38):So the agent's internal transcript is not replaced by compaction — the compactor only shapes what's sent to the model per-call, and the loop keeps the raw messages. That means
acc(the accumulated raw transcript) is the actual conversation state, and replaying it on resume is correct. Finding 1's central premise is contradicted by the in-repo compactor docs, so it must be dropped. The reviewer even flagged the agent internals as "unverified" — verification against the available compactor documentation overturns it.Finding 2 —
ResumeState(run/checkpoint.go:22-26) carries onlyHistory,CompletedPhases,ActivePhase; there is no iteration offset. On the single-loop resume path (run/executor.go:357),accis seeded fromresume.History, and each Save recordsIteration: s.Index + 1(run/executor.go:367) wheres.Indexis the new agent loop's step index starting at 0. So a run saved at iteration 5, resumed for one more step, checkpointsIteration: 1. The prior iteration count is genuinely lost in the durable metadata. This is a real but metadata-only issue (it doesn't break resume continuation). Finding 2 is confirmed and kept.VERDICT: Minor issues
run/executor.go:367— On a single-loop resume, each checkpoint Save recordsIteration: s.Index + 1wheres.Indexis the new agent loop's step index (starting at 0), andResumeState(run/checkpoint.go:22-26) carries no iteration offset. A run previously saved at iteration 5 that resumes for one more step therefore checkpointsIteration: 1, losing the prior iteration count in the durable record. This corrupts the durable progress metadata that a host may use for display, throttling, or step-budget decisions on a subsequent recovery. Resume continuation itself is unaffected. Minor, metadata-only. Fix: carry the prior iteration count inResumeStateand adds.Indexto it when saving.🧹 Code cleanliness & maintainability — Minor issues
All four findings verified against the actual code:
ActivePhase —
ResumeState.ActivePhase(run/checkpoint.go:25),RunCheckpointState.ActivePhase(run/ports.go:210),RunCheckpoint.ActivePhase(checkpoint/checkpoint.go:41) are all declared but never read;RunCheckpointState.ActivePhaseis never set to anything but""(run/phases.go:186). Confirmed dead.Checkpoint observer duplicate transcript assembly — confirmed at run/executor.go:359-367, and phases.go:236 wires plain
deps.stepObserverwith no checkpointing, so the "differs" rationale only applies to single-loop. Confirmed.Redundant lastOutput — pre-population loop (run/phases.go:95) sets
lastOutput = pc.Output; skip branch (line 115) reassignslastOutput = outputs[phase.Name](same value). Confirmed no-op.NewFactory never initializes factory.now — confirmed at checkpoint/handle.go:103-104;
f.nowstays nil, passed toNewwhich falls back totime.Now. Confirmed.VERDICT: Minor issues
run/checkpoint.go:25—ResumeState.ActivePhaseis dead. The field is declared and documented ("the phase that was in flight"), but no code reads it: grep for.ActivePhaseonly returns write sites (checkpoint/handle.go:61copiesst.ActivePhaseinto the persisted struct;run/phases.go:186always savesActivePhase: ""), andrunPhasesonly consumesResumeState.CompletedPhases(skipping whole phases, not seeding an in-flight one).RunCheckpointState.ActivePhase/RunCheckpoint.ActivePhaseare likewise never set to a non-empty value, and the persistedRunCheckpointis never loaded back to populateResumeState.ActivePhase(no caller readsRunCheckpoint.ActivePhasefrom aLoad). The comment onRunCheckpointState.Messages("OR the active phase's transcript", run/ports.go:202-203) is therefore also misleading — the multi-phase path never records an active-phase transcript. Either wire active-phase resume or drop the fields/comments to match what's shipped (boundary-granular only).run/executor.go:354-369— checkpoint observer duplicates transcript assembly thatagentalready owns. The wrappedobsmanually reconstructs the conversation (s.Response.Message()+llm.ToolResultsMessage(s.Results...)) intoacc, mirroring exactly what the agent loop builds internally. The comment atrun/executor.go:307-309cites that the observer "differs single-loop vs per-phase" as the reason it moved out ofsharedOpts, but the per-phase path (run/phases.go:236) just wires the plaindeps.stepObserverwith no checkpointing — so the "differs" rationale only applies to single-loop. Atranscript()accessor onagent.Step/agent.Result(or reusingrunRes.Messagesat finalize) would remove this hand-rolled zip, which is exactly the pairwise-call/result shape the existingstepObserveratrun/executor.go:280-303already exercises. Worth a follow-up to share the accumulation rather than duplicate it.run/phases.go:114-117— redundantlastOutputreassignment on the resume skip path.lastOutputwas already set topc.Outputfor every resumed phase during the pre-population loop atrun/phases.go:91-97. Re-assigning it inside the skip branch (lastOutput = outputs[phase.Name]) is a no-op for resumed phases and dead-confusing next to the identical-looking line at178. The skip branch can be a barecontinue.checkpoint/handle.go:103—NewFactorysilently dropsnow.factory.nowis read byBegin→New(passed asf.now), butNewFactorynever initializes it, so it's alwaysnilandNewfalls back totime.Now(handle.go:36-38). That's functionally fine, but it leaves a struct field that looks configurable and isn't — either acceptnowinNewFactory(parity withNew) or dropfactory.nowand passnilexplicitly so the field doesn't read as a forgotten injection point. The omission stands out against the existingNew(store, meta, throttle, now)signature.⚡ Performance — No material issues found
Verdict: No material issues found
Through the performance lens, the durable-checkpoint wiring is reasonably behaved:
Saveis throttled.handle.Save(checkpoint/handle.go:42) takes the throttle window under a brief mutex, releases it before the store I/O, and skips the persist inside the window — so a hot agent loop doesn't block on store writes every step. Per-step work that always runs (theaccappend,llm.ToolResultsMessage, theRunCheckpointStatestruct construction) is O(1) amortized / cheap, so even unthrottled the per-step overhead is minimal.accgrows O(steps) and each persisted save is O(transcript), but frequency is bounded by the throttle, so total cost is O(saves × transcript), not O(steps²). Phase-boundary saves copy the completed-phases list (append([]PhaseOutput(nil), completed...),run/phases.go:185) — O(phases), and phase counts are small. Both match mort's established pattern; not a regression.Complete/Faildelete the record (checkpoint/handle.go:74-76); only shutdown-interrupted runs survive, which is the intended recovery set. The in-processMemorystore'sListInterruptedcopies the whole map (checkpoint/memory.go:47-54), but that store is test-only and explicitly non-durable per its own doc.Minor (trivial) nits, not material:
run/executor.go:355-358: on a resume,multimodalUserMessage(input, inv.Images)is built and immediately discarded (overwritten by the resume-history copy). One wasted message construction per run — negligible.run/executor.go:359-368: the wrapped observer constructsllm.ToolResultsMessage(s.Results...)and aRunCheckpointStateevery step even whenhandle.Savewill be throttled away. Negligible allocation on the hot path; could short-circuit, but not worth it.throttle <= 0toNewFactorymakes every step block synchronously onstore.Save— but this is documented as host-controlled ("the host throttles") and mirrors mort, so it's a configuration footgun, not a defect.🧯 Error handling & edge cases — Minor issues
VERDICT: Minor issues
run/executor.go:207(and the other pre-loop early returns at ~215, ~228) skipfinalizeCheckpoint. Confirmed:finalizeCheckpointhas a single call site atexecutor.go:404, after the agent loop; the checkpointer is acquired earlier atexecutor.go:190-196— before the toolbox/delegation/extra-tool builds. The build-error early returns (return resat lines 207, 215, 228, and the session-tool one at ~242) all run beforefinalizeCheckpoint, leavingckptunfinalized. For a fresh run this is harmless — the handle writes nothing to the store until the firstSave, which only happens inside the loop, soListInterruptedfinds no record. But for a recovered run (WithExistingCheckpointer,executor.go:191), the durable record already exists from the prior interrupted run; a persistent pre-loop error (e.g., a misconfigured tool failingRegistry.Build) takes the earlyreturn reswithoutComplete/Fail, so the record survives → boot recovery re-dispatches the same run → same build error → indefinite retry loop. Suggested fix: route the build-error early returns throughfinalizeCheckpoint(ctx, ckpt, res.Err, context.Cause(runCtx))so a terminal build errorFails an existing record instead of leaving it recoverable (ordeferit guarded by whether the run actually started).run/executor.go:192-196silently swallows theCheckpointerFactory.Beginerror. Confirmed againstrun/ports.go:181-182, whose doc comment onCheckpointerFactoryexplicitly states: "A storage error should be logged and degraded to (nil, nil)". The executor doesif c, cerr := e.cfg.Ports.Checkpointer.Begin(ctx, info); cerr == nil { ckpt = c }and discardscerrwith no log, so a failing/misconfigured checkpoint store makes every run silently non-durable with zero operator signal. Minor, but it contradicts the stated invariant. Suggested fix: emit anslog.Warnon thecerr != nilbranch.The
classifyCheckpointOutcomedecision matrix is sound:ErrShutdown/ErrCriticKillare defined atrun/ports.go-adjacent code, therunErr == nilshort-circuit correctly precedes the shutdown check so a clean success is alwaysCompleteeven if shutdown landed late, and the test matrix inrun/checkpoint_test.gocovers the cases.kimi-k2.6:cloud (ollama-cloud) — Blocking issues found
Verdict: Blocking issues found · ⚠️ 1/5 lens(es) errored — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — No material issues found
No material issues found
I scrutinized the durable-checkpoint wiring for authentication, injection, unsafe deserialization, secret leakage, input validation, and insecure defaults. The changes are clean for their intended threat model:
run/checkpoint.go:WithResumeState/WithExistingCheckpointerare standard context carriers; the unexported key types mean only package-level code can read them.classifyCheckpointOutcomecorrectly useserrors.Is(cause, ErrShutdown)to avoid exact-equality fragility.finalizeCheckpointruns on a detached context so cancellation doesn't prevent terminal-state recording. No injection or deserialization surface here.run/executor.go: The resume path seedsresume.Historyviaagent.WithHistoryand passes empty input. The transcript accumulator starts from the resume history (or the opening user message) and appends step responses + tool results. No command/template/SQL injection vectors introduced — the only template isexpandPhaseTemplate, which uses the agent-definedSystemPrompt(trusted) withmissingkey=zeroagainst user data.checkpoint/handle.go:NewFactory.Beginreadsinfo.Inputs["prompt"]with a safe type assertion to string (defaults to empty). The factory mints a per-runhandlekeyed byRunID, so concurrent runs don't collide. Error fromBeginis silently swallowed on the executor side, but the interface contract explicitly expects degradation to non-durable rather than failing the run.run/phases.go: Phase-boundary checkpointing copies thecompletedslice before saving (defensive against async serialization). Resumed runs skip phases by key presence inoutputs, which is populated from the trustedResumeState.run/ports.go:GuildIDandModelTierare added toRunInfoand checkpoint meta for resume re-dispatch attribution. This is needed metadata, not a leakage bug.No authn/authz gaps, injection points, SSRF, unsafe deserialization, credential leakage, or missing input validation were introduced by this diff.
🎯 Correctness — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 10: context deadline exceeded
🧹 Code cleanliness & maintainability — No material issues found
No material issues found.
Through the cleanliness & maintainability lens, the PR is well-structured:
progressSinkKey).executor.gois localized and clearly commented.agentexec.CheckpointerFactory, keeping cross-repo consistency.phaseProvider) rather than duplicating setup.Begin→ nil checkpointer, nil-safe throughout).No dead code, confusing names, or leaky abstractions are introduced.
⚡ Performance — No material issues found
No material issues found
Through the performance lens, the checkpoint wiring is clean:
ckpt.Saveeach step, but the battery-levelhandle.Savethrottles these calls by time window (skipping eagerly inside the throttle), and the host controls the throttle duration. The work inside the observer (appending toacc, callingSave) is O(1) when throttled and O(n) in total transcript length only when a save actually fires — which is unavoidable since the full transcript is required for mid-loop resume.WithHistoryis a one-time cost at loop start.🧯 Error handling & edge cases — Blocking issues found
Blocking issues found
run/executor.go:188—CheckpointerFactory.Beginerror is silently discarded. If the factory returns a storage error (e.g., DB unavailable), the executor proceeds as if the run is non-durable, with no log or metric. Operators have no way to detect a broken checkpoint store.run/checkpoint.go:88-90—finalizeCheckpointswallows errors fromcp.Completeandcp.Fail. If the store fails to delete a finished run's checkpoint (network blip, timeout), the record survives and the next boot will incorrectly attempt to resume a run that already terminated. This creates false recovery candidates.qwen3.5:397b-cloud (ollama-cloud) — Blocking issues found
Verdict: Blocking issues found — 5 reviewers: security, correctness, maintainability, performance, error-handling
🔒 Security — Minor issues
VERDICT: Minor issues found
Security lens findings:
run/phases.go:177-188+run/phases.go:343-359— Template injection via persisted phase outputs (medium): Phase outputs (derived from LLM responses) are stored in checkpoints at line 182-187 (completed = append(completed, PhaseOutput{Name: phase.Name, Output: output})) and later used as template data inexpandPhaseTemplateon resume. On resume,outputsmap is populated fromdeps.resume.CompletedPhases(lines 91-97), then passed directly toexpandPhaseTemplateat line 123. TheexpandPhaseTemplatefunction usestext/templatewith no sanitization ofpriorOutputs— it directly inserts phase outputs into the template data map (lines 349-352). An attacker controlling LLM output via prompt injection could inject Go template syntax ({{...}}) that gets evaluated when the checkpoint resumes. Verified: ReadexpandPhaseTemplate— usestemplate.New("phase")...Parse(tmpl)andt.Execute(&buf, data)wheredataincludes unsanitizedpriorOutputs. Fix: Sanitize phase outputs before template insertion, or use a non-interpolating format for phase threading.checkpoint/handle.go:110-122— Prompt stored in checkpoint metadata without size validation (small): The user prompt is extracted frominfo.Inputs["prompt"]at line 110 via direct type assertion (prompt, _ := info.Inputs["prompt"].(string)) with no length limit or validation before being stored inRunCheckpointMeta.Promptat line 118. A host using a bounded store (e.g., fixed-size job table) could face DoS via oversized prompts. Verified: ReadBegin()— direct type assertion with no bounds check. Fix: Truncate or reject prompts exceeding a reasonable threshold before storing in meta.run/checkpoint.go:86-93— Checkpoint finalize swallows errors silently (small):finalizeCheckpointdiscards all errors fromComplete()/Fail()with_ =at lines 88 and 90. If the checkpoint store fails during terminal state recording, the run completes but the checkpoint may remain in an inconsistent state (e.g., marked as "interrupted" when it actually finished). Verified: ReadfinalizeCheckpoint— errors explicitly ignored. Fix: At minimum log the error; consider surfacing it to audit/recorder.run/phases.go:114-116— Resume phase-skipping trusts checkpoint integrity (medium): On resume, phases are skipped based solely on key presence inoutputsmap (line 114:if _, done := outputs[phase.Name]; done), populated fromdeps.resume.CompletedPhases(lines 91-97). If an attacker compromises the checkpoint store, they can inject arbitraryPhaseOutputentries to skip security-critical phases. Verified: ReadrunPhasesresume logic — no integrity validation ofdeps.resume.CompletedPhasesbefore using them to skip phases. Fix: Add checkpoint integrity verification (e.g., HMAC) if store is untrusted; document trust boundary.🎯 Correctness — Minor issues
VERDICT: Minor issues found
Findings:
run/phases.go:186—ActivePhasefield unused in resume logic (dead code)ResumeState.ActivePhasefield is populated from checkpoints but never read during resume. The multi-phase resume logic (lines 91-97, 114-117) determines which phase to run next solely by checkingCompletedPhases— it runs the first phase whose name is NOT in the completed set. TheActivePhasefield has no effect on behavior.ActivePhasefromResumeStateandRunCheckpointState, or document it as metadata-only (not used for resume decisions). Verified by readingrun/phases.go:81-117andrun/checkpoint.go:22-26.run/ports.go:202-203— Documentation claims unsupported functionalityMessagesis "the running transcript (single-loop run) OR the active phase's transcript (multi-phase run)". However, the multi-phase checkpoint save atrun/phases.go:184-187never setsMessages— it only savesCompletedPhasesandActivePhase. This means mid-phase transcript recovery for multi-phase runs is NOT supported (only boundary-granular).Messagesis only populated for single-loop runs; multi-phase checkpoints do not preserve the active phase's transcript. Verified by readingrun/phases.go:179-188andrun/ports.go:200-211.run/checkpoint.go:25—ActivePhasefield semantically inconsistentResumeState.ActivePhaseis documented as "the phase that was in flight", but the checkpoint save at phase boundaries (after a phase completes) setsActivePhase: ""(line 186). The field would only be meaningful if saved mid-phase, but no mid-phase save exists for multi-phase runs.run/phases.go:179-188.🧹 Code cleanliness & maintainability — Minor issues
VERDICT: Minor issues found
Findings:
run/checkpoint.go:23— TheResumeState.Historyfield comment claims it holds "single-loop transcript OR active-phase transcript", but the multi-phase path inrun/phases.go:88-97never readsdeps.resume.History— it only usesCompletedPhasesto skip finished phases. The active phase re-runs from scratch with the original query. This is documented as intentional boundary-granular recovery in the code comments, but the struct comment is misleading. Fix: Update the comment to clarifyHistoryis only used for single-loop runs; multi-phase recovery does not resume mid-phase transcripts.run/checkpoint.go:25— TheResumeState.ActivePhasefield is defined but never read by the executor. Grep confirms it is only written (incheckpoint/handle.go:61,run/phases.go:186) but never read. The multi-phase path determines which phase to run by checking key presence inoutputsmap (run/phases.go:114-117), not by consultingActivePhase. This field adds cognitive load without functional purpose. Fix: Either remove the field or add a comment explaining it's metadata only (e.g., for host debugging/logging) and not used by the kernel.checkpoint/handle.go:107-108— Thefactory.Beginmethod extractspromptfrominfo.Inputs["prompt"]with a comment saying this is "so a recovered run can re-dispatch". However, the kernel's multi-phase recovery path (run/phases.go:81-191) does not use any recovered prompt — it skips completed phases and re-runs the active phase with the originalquerypassed torunPhases. TheRunInfo.Promptfield (populated from this extraction) is also never read anywhere in the executor or phases logic. The prompt extraction appears to be dead code for the multi-phase case. Fix: Clarify the comment to state this is only relevant for single-loop resume, or remove if unused.⚡ Performance — No material issues found
No material issues found
After reviewing the checkpoint/resume wiring through the performance lens, the implementation handles the key efficiency concerns appropriately:
Verified findings:
Throttling is correctly implemented (
checkpoint/handle.go:45-48): TheSavemethod checks the throttle window before persisting, preventing excessive I/O on every step. Confirmed by readinghandle.go.No unnecessary slice copies in hot path (
run/executor.go:359-367): The single-loop observer accumulatesaccin-place and passes the slice header toSave. The slice grows naturally viaappend, which is amortized O(1). Confirmed by readingexecutor.go.Multi-phase checkpoint copy is intentional (
run/phases.go:185):append([]PhaseOutput(nil), completed...)creates a snapshot copy at phase boundaries to prevent aliasing—this is correct since the checkpointer may hold the slice asynchronously. Phase boundaries are infrequent (not per-step), so this copy is negligible. Confirmed by readingphases.go.Resume path does one-time O(n) copy (
run/executor.go:356-357):append([]llm.Message(nil), resume.History...)copies the saved transcript once at startup to prevent aliasing between the resumed state and new messages. This is a necessary correctness cost, not a hot-loop regression. Confirmed by readingexecutor.go.No N+1 or unbounded growth issues: Checkpoints are per-run (factory pattern), the
Memorystore holds one checkpoint per run ID (overwrites on save), and completed runs delete their checkpoints immediately viaComplete. Confirmed by readingmemory.go:23-28andhandle.go:74-76.The throttle prevents hot-path I/O, slice handling avoids unnecessary copies while preserving correctness, and there are no quadratic behaviors or blocking calls in performance-critical sections.
🧯 Error handling & edge cases — Blocking issues found
Blocking issues found
run/phases.go:184— Phase-boundary checkpointSaveerror silently swallowed: The phase-boundary save discards the error with_ =. If the checkpoint store fails mid-run (disk full, DB locked, network blip), the run continues but checkpoints aren't persisted. On shutdown/recovery, the host will resume from a stale checkpoint (or none), potentially re-executing work already done or losing progress. Verified by readingrun/phases.go:183-188. Suggested fix: Log the error (at minimum) and consider surfacing it via theRunRecorderaudit trail so the host knows durability failed.run/executor.go:367— Per-step checkpointSaveerror silently swallowed: The per-step save in the single-loop path discards the error with_ =. Same impact as above — checkpoint persistence failures go unnoticed, risking stale recovery state. Verified by readingrun/executor.go:359-368. Suggested fix: Log the error; at minimum ensure the failure is recorded in the audit trail.run/checkpoint.go:90—Failerror silently swallowed infinalizeCheckpoint: When a run terminates with a non-shutdown error,cp.Fail(detach(ctx), runErr)is called but its error is discarded. IfFailitself fails (e.g., store unreachable during cleanup), the checkpoint record may remain in an ambiguous "running" state instead of being marked failed. On boot recovery, the host might incorrectly attempt to resume a run that should be terminal. Verified by readingrun/checkpoint.go:80-94. Suggested fix: Log the error; at minimum ensure the failure is recorded in the audit trail.Automated adversarial review by Gadfly — consensus across the model swarm. Advisory only — does not block merge.