fix(run): address gadfly review of the checkpoint PR
executus CI / test (pull_request) Successful in 45s

Real findings from the consensus review (44 raw; heavy devstral noise):

- finalizeCheckpoint is now fired from the top-of-Run defer, so it runs on
  EVERY exit: a panic, an early build-error return (before the run loop), AND
  normal completion. Previously an early return on a recovered run left its
  durable record unfinalized → boot recovery would retry it forever on a
  persistent build error. (opus + glm)
- Removed the dead ActivePhase field from run.RunCheckpointState +
  run.ResumeState (and the battery RunCheckpoint) — phase recovery is
  boundary-granular (skip completed phases; the interrupted phase re-runs from
  its start), so ActivePhase was never written nor read. Docs across
  ports/checkpoint/phases now state this plainly (5-model consensus that the
  field + docs over-promised mid-phase resume).
- CheckpointerFactory.Begin error is now logged (WARN) before degrading to
  non-durable, per the documented contract (was silently swallowed). (4 models)
- finalizeCheckpoint logs Complete/Fail errors (was silent).
- Resume phase-skip now keys off a SEPARATE resumeSkip set, not the live
  outputs map — a fresh run with two same-named phases no longer skips the
  second (the outputs map fills as phases run). (opus:max) + regression test.
- Removed the dead checkpoint.factory.now field (never set). (opus + glm)
- Fixed the stale phaseDeps doc (the step observer moved out of sharedOpts to
  per-path). Hoisted the resume guard to a local; dropped the wasted acc
  allocation on the resume path; documented that Save throttling is the
  Checkpointer's responsibility and the accumulated transcript is pre-compaction
  (host size-caps it).

Note (carried from the PR): classifyCheckpointOutcome keys shutdown on
run.ErrShutdown; mort stamps its own runengine.ErrShutdown — the mort wiring PR
aliases them so errors.Is matches.

New test: duplicate phase names both run on a fresh run. Full ./... green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-29 16:34:42 -04:00
parent 899059a791
commit 38d656ec71
7 changed files with 98 additions and 39 deletions
+15 -11
View File
@@ -53,9 +53,10 @@ import (
// phaseDeps carries the per-run state the phase runner shares with Run: the base
// model, the full decorated toolbox (filtered per phase), the base step cap, the
// shared agent options (tool-error limits + step observer + compactor), the
// shared step observer (also fed by IsRunFunc bare calls), the critic/session
// steer, and the audit recorder (phase events).
// shared agent options (tool-error limits + compactor — the step observer is
// added per phase, NOT in sharedOpts, so checkpointing can vary per path), the
// shared step observer (wired into each phase's loop AND invoked for IsRunFunc
// bare calls), the critic/session steer, and the audit recorder (phase events).
type phaseDeps struct {
baseModel llm.Model
baseToolbox *llm.Toolbox
@@ -85,12 +86,18 @@ func (e *Executor) runPhases(runCtx context.Context, ra RunnableAgent, deps phas
var lastOutput string
var totalUsage llm.Usage
// Resume: pre-populate from the saved checkpoint so already-finished phases are
// skipped. The interrupted (active) phase is NOT pre-populated, so it re-runs
// from its start (boundary-granular recovery).
// resumeSkip is the set of phases already finished on a RECOVERED run — kept
// SEPARATE from the live `outputs` map (which fills as phases run this time) so
// the skip guard only skips RESUME-completed phases, never a fresh run's own
// phases. (Reusing `outputs` would make a second phase with a duplicate name
// skip itself.) Pre-populate outputs + completed so a resumed run threads the
// saved outputs into later phases. The interrupted (active) phase is NOT
// pre-populated, so it re-runs from its start (boundary-granular recovery).
resumeSkip := map[string]bool{}
if deps.resume != nil {
for _, pc := range deps.resume.CompletedPhases {
outputs[pc.Name] = pc.Output
resumeSkip[pc.Name] = true
completed = append(completed, pc)
lastOutput = pc.Output
}
@@ -109,10 +116,8 @@ func (e *Executor) runPhases(runCtx context.Context, ra RunnableAgent, deps phas
}
for i, phase := range ra.Phases {
// Skip phases already completed on a resumed run (key presence, not output
// emptiness — a legitimately-empty phase output still counts as done).
if _, done := outputs[phase.Name]; done {
lastOutput = outputs[phase.Name]
// Skip phases already completed on a resumed run.
if resumeSkip[phase.Name] {
continue
}
// A killed/timed-out/cancelled run must not start its next phase.
@@ -183,7 +188,6 @@ func (e *Executor) runPhases(runCtx context.Context, ra RunnableAgent, deps phas
if deps.checkpointer != nil {
_ = deps.checkpointer.Save(runCtx, RunCheckpointState{
CompletedPhases: append([]PhaseOutput(nil), completed...),
ActivePhase: "",
})
}
}