Real findings from the consensus review (44 raw; heavy devstral noise):
- finalizeCheckpoint is now fired from the top-of-Run defer, so it runs on
EVERY exit: a panic, an early build-error return (before the run loop), AND
normal completion. Previously an early return on a recovered run left its
durable record unfinalized → boot recovery would retry it forever on a
persistent build error. (opus + glm)
- Removed the dead ActivePhase field from run.RunCheckpointState +
run.ResumeState (and the battery RunCheckpoint) — phase recovery is
boundary-granular (skip completed phases; the interrupted phase re-runs from
its start), so ActivePhase was never written nor read. Docs across
ports/checkpoint/phases now state this plainly (5-model consensus that the
field + docs over-promised mid-phase resume).
- CheckpointerFactory.Begin error is now logged (WARN) before degrading to
non-durable, per the documented contract (was silently swallowed). (4 models)
- finalizeCheckpoint logs Complete/Fail errors (was silent).
- Resume phase-skip now keys off a SEPARATE resumeSkip set, not the live
outputs map — a fresh run with two same-named phases no longer skips the
second (the outputs map fills as phases run). (opus:max) + regression test.
- Removed the dead checkpoint.factory.now field (never set). (opus + glm)
- Fixed the stale phaseDeps doc (the step observer moved out of sharedOpts to
per-path). Hoisted the resume guard to a local; dropped the wasted acc
allocation on the resume path; documented that Save throttling is the
Checkpointer's responsibility and the accumulated transcript is pre-compaction
(host size-caps it).
Note (carried from the PR): classifyCheckpointOutcome keys shutdown on
run.ErrShutdown; mort stamps its own runengine.ErrShutdown — the mort wiring PR
aliases them so errors.Is matches.
New test: duplicate phase names both run on a fresh run. Full ./... green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The kernel defined run.Ports.Checkpointer + the checkpoint battery but never
drove them (the documented "P2 follow-up"). This wires durable recovery into
the run loop so a run interrupted by shutdown can resume on the next boot
instead of being lost — the executus-side half of mort's durable-agent-recovery
parity (mort #1355).
Kernel (run/):
- Ports.Checkpointer is now a CheckpointerFactory (Begin per run → a per-run
Checkpointer, or nil for a non-durable run). The single per-instance
Checkpointer couldn't distinguish runs; a factory mints one per run, matching
mort's agentexec.CheckpointerFactory.
- RunInfo gains GuildID + ModelTier (so the factory can build resume meta);
RunCheckpointState gains CompletedPhases + ActivePhase (+ PhaseOutput).
- run/checkpoint.go: ResumeState + WithResumeState / WithExistingCheckpointer
context carriers, classifyCheckpointOutcome (success→Complete, shutdown→leave
for boot recovery, else→Fail using run.ErrShutdown), and finalizeCheckpoint.
- run/executor.go: resolve the per-run checkpointer (existing-from-ctx on a
recovery re-run, else factory.Begin); single-loop wraps the step observer to
accumulate the transcript + Save each step (host throttles), and a recovered
run seeds the saved transcript via WithHistory and continues with no new
input; finalize on exit.
- run/phases.go: phase-boundary checkpointing — record completed phases after
each phase; a resumed run skips already-completed phases (the interrupted
phase re-runs from its start — boundary-granular, documented; only the
single-loop path resumes mid-loop).
Battery (checkpoint/): NewFactory wires the battery into the factory port
(per-run handle, meta derived from RunInfo); RunCheckpoint + handle.Save carry
the phase fields.
Tests (run/checkpoint_test.go): the finalize decision matrix; single-loop
Save+Complete; terminal-error Fail; resume seeds history; phase-boundary Saves
completed phases; resume skips completed phases. Full ./... green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>