Files
executus/checkpoint/checkpoint.go
T
steve 899059a791
executus CI / test (pull_request) Successful in 46s
Adversarial Review (Gadfly) / review (pull_request) Successful in 17m25s
feat(run): durable checkpoint + resume (wire Ports.Checkpointer)
The kernel defined run.Ports.Checkpointer + the checkpoint battery but never
drove them (the documented "P2 follow-up"). This wires durable recovery into
the run loop so a run interrupted by shutdown can resume on the next boot
instead of being lost — the executus-side half of mort's durable-agent-recovery
parity (mort #1355).

Kernel (run/):
- Ports.Checkpointer is now a CheckpointerFactory (Begin per run → a per-run
  Checkpointer, or nil for a non-durable run). The single per-instance
  Checkpointer couldn't distinguish runs; a factory mints one per run, matching
  mort's agentexec.CheckpointerFactory.
- RunInfo gains GuildID + ModelTier (so the factory can build resume meta);
  RunCheckpointState gains CompletedPhases + ActivePhase (+ PhaseOutput).
- run/checkpoint.go: ResumeState + WithResumeState / WithExistingCheckpointer
  context carriers, classifyCheckpointOutcome (success→Complete, shutdown→leave
  for boot recovery, else→Fail using run.ErrShutdown), and finalizeCheckpoint.
- run/executor.go: resolve the per-run checkpointer (existing-from-ctx on a
  recovery re-run, else factory.Begin); single-loop wraps the step observer to
  accumulate the transcript + Save each step (host throttles), and a recovered
  run seeds the saved transcript via WithHistory and continues with no new
  input; finalize on exit.
- run/phases.go: phase-boundary checkpointing — record completed phases after
  each phase; a resumed run skips already-completed phases (the interrupted
  phase re-runs from its start — boundary-granular, documented; only the
  single-loop path resumes mid-loop).

Battery (checkpoint/): NewFactory wires the battery into the factory port
(per-run handle, meta derived from RunInfo); RunCheckpoint + handle.Save carry
the phase fields.

Tests (run/checkpoint_test.go): the finalize decision matrix; single-loop
Save+Complete; terminal-error Fail; resume seeds history; phase-boundary Saves
completed phases; resume skips completed phases. Full ./... green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 16:04:06 -04:00

54 lines
2.0 KiB
Go

// Package checkpoint is the durable-resume battery: it persists a run's
// resumable progress so a run interrupted by a shutdown can be recovered and
// continued on the next boot, rather than silently lost. It plugs into
// run.Ports.Checkpointer.
//
// Mort backs CheckpointStore with its durable-job table; Memory() is the
// zero-dependency default; contrib/store can add a SQLite one. The executor calls
// run.Ports.Checkpointer (a CheckpointerFactory) during the run loop; NewFactory
// wires this battery into that seam.
package checkpoint
import (
"context"
"time"
"gitea.stevedudenhoeffer.com/steve/majordomo/llm"
"gitea.stevedudenhoeffer.com/steve/executus/run"
)
// RunCheckpointMeta is the run attribution needed to resume a run from scratch
// (mirrors mort's agentexec.RunCheckpointMeta).
type RunCheckpointMeta struct {
RunID string
AgentID string
AgentName string
CallerID string
ChannelID string
GuildID string
Prompt string
ModelTier string
ParentRunID string
}
// RunCheckpoint is one persisted snapshot of a run's resumable progress.
type RunCheckpoint struct {
Meta RunCheckpointMeta
Messages []llm.Message // conversation so far (single-loop or active phase)
Iteration int // completed agent-loop iterations
CompletedPhases []run.PhaseOutput // finished phases, in order (multi-phase agents)
ActivePhase string // current phase name (multi-phase agents); "" otherwise
UpdatedAt time.Time
}
// CheckpointStore persists run checkpoints keyed by run id. A live checkpoint
// means "this run was in flight and not cleanly finished"; Complete/Fail delete
// it. ListInterrupted returns every surviving checkpoint at boot for recovery.
type CheckpointStore interface {
Save(ctx context.Context, cp RunCheckpoint) error
Load(ctx context.Context, runID string) (*RunCheckpoint, error)
Delete(ctx context.Context, runID string) error
ListInterrupted(ctx context.Context) ([]RunCheckpoint, error)
}