fix(run): address gadfly review of the checkpoint PR
executus CI / test (pull_request) Successful in 45s

Real findings from the consensus review (44 raw; heavy devstral noise):

- finalizeCheckpoint is now fired from the top-of-Run defer, so it runs on
  EVERY exit: a panic, an early build-error return (before the run loop), AND
  normal completion. Previously an early return on a recovered run left its
  durable record unfinalized → boot recovery would retry it forever on a
  persistent build error. (opus + glm)
- Removed the dead ActivePhase field from run.RunCheckpointState +
  run.ResumeState (and the battery RunCheckpoint) — phase recovery is
  boundary-granular (skip completed phases; the interrupted phase re-runs from
  its start), so ActivePhase was never written nor read. Docs across
  ports/checkpoint/phases now state this plainly (5-model consensus that the
  field + docs over-promised mid-phase resume).
- CheckpointerFactory.Begin error is now logged (WARN) before degrading to
  non-durable, per the documented contract (was silently swallowed). (4 models)
- finalizeCheckpoint logs Complete/Fail errors (was silent).
- Resume phase-skip now keys off a SEPARATE resumeSkip set, not the live
  outputs map — a fresh run with two same-named phases no longer skips the
  second (the outputs map fills as phases run). (opus:max) + regression test.
- Removed the dead checkpoint.factory.now field (never set). (opus + glm)
- Fixed the stale phaseDeps doc (the step observer moved out of sharedOpts to
  per-path). Hoisted the resume guard to a local; dropped the wasted acc
  allocation on the resume path; documented that Save throttling is the
  Checkpointer's responsibility and the accumulated transcript is pre-compaction
  (host size-caps it).

Note (carried from the PR): classifyCheckpointOutcome keys shutdown on
run.ErrShutdown; mort stamps its own runengine.ErrShutdown — the mort wiring PR
aliases them so errors.Is matches.

New test: duplicate phase names both run on a fresh run. Full ./... green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-29 16:34:42 -04:00
parent 899059a791
commit 38d656ec71
7 changed files with 98 additions and 39 deletions
+1 -3
View File
@@ -58,7 +58,6 @@ func (h *handle) Save(ctx context.Context, st run.RunCheckpointState) error {
Messages: st.Messages,
Iteration: st.Iteration,
CompletedPhases: st.CompletedPhases,
ActivePhase: st.ActivePhase,
UpdatedAt: now,
}); err != nil {
return err
@@ -92,7 +91,6 @@ func (noop) Fail(context.Context, error) error { return nil }
type factory struct {
store CheckpointStore
throttle time.Duration
now func() time.Time
}
var _ run.CheckpointerFactory = (*factory)(nil)
@@ -119,5 +117,5 @@ func (f *factory) Begin(_ context.Context, info run.RunInfo) (run.Checkpointer,
ModelTier: info.ModelTier,
ParentRunID: info.ParentRunID,
}
return New(f.store, meta, f.throttle, f.now), nil
return New(f.store, meta, f.throttle, nil /* now defaults to time.Now */), nil
}