feat(run): durable checkpoint + resume (wire Ports.Checkpointer)
The kernel defined run.Ports.Checkpointer + the checkpoint battery but never drove them (the documented "P2 follow-up"). This wires durable recovery into the run loop so a run interrupted by shutdown can resume on the next boot instead of being lost — the executus-side half of mort's durable-agent-recovery parity (mort #1355). Kernel (run/): - Ports.Checkpointer is now a CheckpointerFactory (Begin per run → a per-run Checkpointer, or nil for a non-durable run). The single per-instance Checkpointer couldn't distinguish runs; a factory mints one per run, matching mort's agentexec.CheckpointerFactory. - RunInfo gains GuildID + ModelTier (so the factory can build resume meta); RunCheckpointState gains CompletedPhases + ActivePhase (+ PhaseOutput). - run/checkpoint.go: ResumeState + WithResumeState / WithExistingCheckpointer context carriers, classifyCheckpointOutcome (success→Complete, shutdown→leave for boot recovery, else→Fail using run.ErrShutdown), and finalizeCheckpoint. - run/executor.go: resolve the per-run checkpointer (existing-from-ctx on a recovery re-run, else factory.Begin); single-loop wraps the step observer to accumulate the transcript + Save each step (host throttles), and a recovered run seeds the saved transcript via WithHistory and continues with no new input; finalize on exit. - run/phases.go: phase-boundary checkpointing — record completed phases after each phase; a resumed run skips already-completed phases (the interrupted phase re-runs from its start — boundary-granular, documented; only the single-loop path resumes mid-loop). Battery (checkpoint/): NewFactory wires the battery into the factory port (per-run handle, meta derived from RunInfo); RunCheckpoint + handle.Save carry the phase fields. Tests (run/checkpoint_test.go): the finalize decision matrix; single-loop Save+Complete; terminal-error Fail; resume seeds history; phase-boundary Saves completed phases; resume skips completed phases. Full ./... green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
+31
-5
@@ -33,9 +33,10 @@ type Ports struct {
|
||||
Budget Budget
|
||||
// Critic optionally monitors a long run for hangs/runaways. nil = none.
|
||||
Critic Critic
|
||||
// Checkpointer persists resumable progress for durable recovery. nil = no
|
||||
// checkpointing (a run interrupted by shutdown is simply lost).
|
||||
Checkpointer Checkpointer
|
||||
// Checkpointer mints a per-run Checkpointer for durable recovery (it decides
|
||||
// per run whether the run is durable). nil = no checkpointing (a run
|
||||
// interrupted by shutdown is simply lost).
|
||||
Checkpointer CheckpointerFactory
|
||||
// Palette resolves SkillPalette / SubAgentPalette entries into delegation
|
||||
// tools (skill__<name> / agent__<name>). nil = those entries are inert.
|
||||
Palette PaletteSource
|
||||
@@ -66,7 +67,9 @@ type RunInfo struct {
|
||||
Name string
|
||||
CallerID string
|
||||
ChannelID string
|
||||
GuildID string // the originating guild/server id (empty for DMs/triggers)
|
||||
ParentRunID string
|
||||
ModelTier string // the run's resolved base tier (for checkpoint re-dispatch)
|
||||
Inputs map[string]any
|
||||
StartedAt time.Time
|
||||
// MaxIterations is the run's base tool-dispatch step ceiling, so a critic can
|
||||
@@ -172,6 +175,16 @@ type CriticHandle interface {
|
||||
|
||||
// --- Checkpointer ---
|
||||
|
||||
// CheckpointerFactory decides, per run, whether the run is durable and (if so)
|
||||
// mints the per-run Checkpointer that records its progress. It returns (nil, nil)
|
||||
// for a non-durable run (the common short-run case — no checkpointing overhead).
|
||||
// A storage error should be logged and degraded to (nil, nil) so a failing
|
||||
// checkpoint store never fails the run. Mirrors mort's
|
||||
// agentexec.CheckpointerFactory.
|
||||
type CheckpointerFactory interface {
|
||||
Begin(ctx context.Context, info RunInfo) (Checkpointer, error)
|
||||
}
|
||||
|
||||
// Checkpointer persists a run's resumable progress for durable recovery.
|
||||
// Mirrors mort's agentexec.RunCheckpointer.
|
||||
type Checkpointer interface {
|
||||
@@ -184,11 +197,24 @@ type Checkpointer interface {
|
||||
Fail(ctx context.Context, err error) error
|
||||
}
|
||||
|
||||
// RunCheckpointState is the resumable snapshot a Checkpointer persists. Kept
|
||||
// minimal here; the executor extends what it records during the merge.
|
||||
// RunCheckpointState is the resumable snapshot a Checkpointer persists.
|
||||
type RunCheckpointState struct {
|
||||
// Messages is the running transcript (single-loop run) OR the active phase's
|
||||
// transcript (multi-phase run). May be nil.
|
||||
Messages []llm.Message
|
||||
Iteration int
|
||||
// CompletedPhases is set only for multi-phase runs: the outputs of phases
|
||||
// already finished, in phase order. nil for single-loop runs.
|
||||
CompletedPhases []PhaseOutput
|
||||
// ActivePhase is the name of the in-progress phase (multi-phase only).
|
||||
ActivePhase string
|
||||
}
|
||||
|
||||
// PhaseOutput is one completed pipeline phase's name and output text, recorded in
|
||||
// a checkpoint so a resumed multi-phase run can skip already-finished phases.
|
||||
type PhaseOutput struct {
|
||||
Name string
|
||||
Output string
|
||||
}
|
||||
|
||||
// --- PaletteSource ---
|
||||
|
||||
Reference in New Issue
Block a user