3 Commits

Author SHA1 Message Date
steve 38d656ec71 fix(run): address gadfly review of the checkpoint PR
executus CI / test (pull_request) Successful in 45s
Real findings from the consensus review (44 raw; heavy devstral noise):

- finalizeCheckpoint is now fired from the top-of-Run defer, so it runs on
  EVERY exit: a panic, an early build-error return (before the run loop), AND
  normal completion. Previously an early return on a recovered run left its
  durable record unfinalized → boot recovery would retry it forever on a
  persistent build error. (opus + glm)
- Removed the dead ActivePhase field from run.RunCheckpointState +
  run.ResumeState (and the battery RunCheckpoint) — phase recovery is
  boundary-granular (skip completed phases; the interrupted phase re-runs from
  its start), so ActivePhase was never written nor read. Docs across
  ports/checkpoint/phases now state this plainly (5-model consensus that the
  field + docs over-promised mid-phase resume).
- CheckpointerFactory.Begin error is now logged (WARN) before degrading to
  non-durable, per the documented contract (was silently swallowed). (4 models)
- finalizeCheckpoint logs Complete/Fail errors (was silent).
- Resume phase-skip now keys off a SEPARATE resumeSkip set, not the live
  outputs map — a fresh run with two same-named phases no longer skips the
  second (the outputs map fills as phases run). (opus:max) + regression test.
- Removed the dead checkpoint.factory.now field (never set). (opus + glm)
- Fixed the stale phaseDeps doc (the step observer moved out of sharedOpts to
  per-path). Hoisted the resume guard to a local; dropped the wasted acc
  allocation on the resume path; documented that Save throttling is the
  Checkpointer's responsibility and the accumulated transcript is pre-compaction
  (host size-caps it).

Note (carried from the PR): classifyCheckpointOutcome keys shutdown on
run.ErrShutdown; mort stamps its own runengine.ErrShutdown — the mort wiring PR
aliases them so errors.Is matches.

New test: duplicate phase names both run on a fresh run. Full ./... green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 16:34:42 -04:00
steve 899059a791 feat(run): durable checkpoint + resume (wire Ports.Checkpointer)
executus CI / test (pull_request) Successful in 46s
Adversarial Review (Gadfly) / review (pull_request) Successful in 17m25s
The kernel defined run.Ports.Checkpointer + the checkpoint battery but never
drove them (the documented "P2 follow-up"). This wires durable recovery into
the run loop so a run interrupted by shutdown can resume on the next boot
instead of being lost — the executus-side half of mort's durable-agent-recovery
parity (mort #1355).

Kernel (run/):
- Ports.Checkpointer is now a CheckpointerFactory (Begin per run → a per-run
  Checkpointer, or nil for a non-durable run). The single per-instance
  Checkpointer couldn't distinguish runs; a factory mints one per run, matching
  mort's agentexec.CheckpointerFactory.
- RunInfo gains GuildID + ModelTier (so the factory can build resume meta);
  RunCheckpointState gains CompletedPhases + ActivePhase (+ PhaseOutput).
- run/checkpoint.go: ResumeState + WithResumeState / WithExistingCheckpointer
  context carriers, classifyCheckpointOutcome (success→Complete, shutdown→leave
  for boot recovery, else→Fail using run.ErrShutdown), and finalizeCheckpoint.
- run/executor.go: resolve the per-run checkpointer (existing-from-ctx on a
  recovery re-run, else factory.Begin); single-loop wraps the step observer to
  accumulate the transcript + Save each step (host throttles), and a recovered
  run seeds the saved transcript via WithHistory and continues with no new
  input; finalize on exit.
- run/phases.go: phase-boundary checkpointing — record completed phases after
  each phase; a resumed run skips already-completed phases (the interrupted
  phase re-runs from its start — boundary-granular, documented; only the
  single-loop path resumes mid-loop).

Battery (checkpoint/): NewFactory wires the battery into the factory port
(per-run handle, meta derived from RunInfo); RunCheckpoint + handle.Save carry
the phase fields.

Tests (run/checkpoint_test.go): the finalize decision matrix; single-loop
Save+Complete; terminal-error Fail; resume seeds history; phase-boundary Saves
completed phases; resume skips completed phases. Full ./... green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 16:04:06 -04:00
steve dc2d4ec425 P4c: remaining batteries — checkpoint + schedule + critic
executus CI / test (push) Failing after 1m6s
Completes the P4 battery set (squashed onto main from phase-4c-batteries).
- checkpoint/: run.Checkpointer durable-resume (CheckpointStore + throttled
  handle + Memory).
- schedule/: generic cron Runner (Tick/Loop; no cron grammar of its own).
- critic/: two-tier timeout watchdog (run.Critic) + Escalator policy seam +
  ExtendOnce default.
Includes the verified gadfly #6 fixes (ExtendOnce per-run, Kill-sticky, watch
panic-recovery; checkpoint throttle-after-success; schedule Next-before-Run +
nil-guard + Loop recovery).

P4 battery set complete: audit, budget, persona, skill, checkpoint, schedule,
critic — each nil-safe, each with a default, each core-import-clean. Executor
wiring for Critic/Checkpointer remains a P2 follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 00:15:32 -04:00