38d656ec71
executus CI / test (pull_request) Successful in 45s
Real findings from the consensus review (44 raw; heavy devstral noise): - finalizeCheckpoint is now fired from the top-of-Run defer, so it runs on EVERY exit: a panic, an early build-error return (before the run loop), AND normal completion. Previously an early return on a recovered run left its durable record unfinalized → boot recovery would retry it forever on a persistent build error. (opus + glm) - Removed the dead ActivePhase field from run.RunCheckpointState + run.ResumeState (and the battery RunCheckpoint) — phase recovery is boundary-granular (skip completed phases; the interrupted phase re-runs from its start), so ActivePhase was never written nor read. Docs across ports/checkpoint/phases now state this plainly (5-model consensus that the field + docs over-promised mid-phase resume). - CheckpointerFactory.Begin error is now logged (WARN) before degrading to non-durable, per the documented contract (was silently swallowed). (4 models) - finalizeCheckpoint logs Complete/Fail errors (was silent). - Resume phase-skip now keys off a SEPARATE resumeSkip set, not the live outputs map — a fresh run with two same-named phases no longer skips the second (the outputs map fills as phases run). (opus:max) + regression test. - Removed the dead checkpoint.factory.now field (never set). (opus + glm) - Fixed the stale phaseDeps doc (the step observer moved out of sharedOpts to per-path). Hoisted the resume guard to a local; dropped the wasted acc allocation on the resume path; documented that Save throttling is the Checkpointer's responsibility and the accumulated transcript is pre-compaction (host size-caps it). Note (carried from the PR): classifyCheckpointOutcome keys shutdown on run.ErrShutdown; mort stamps its own runengine.ErrShutdown — the mort wiring PR aliases them so errors.Is matches. New test: duplicate phase names both run on a fresh run. Full ./... green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
53 lines
1.9 KiB
Go
53 lines
1.9 KiB
Go
// Package checkpoint is the durable-resume battery: it persists a run's
|
|
// resumable progress so a run interrupted by a shutdown can be recovered and
|
|
// continued on the next boot, rather than silently lost. It plugs into
|
|
// run.Ports.Checkpointer.
|
|
//
|
|
// Mort backs CheckpointStore with its durable-job table; Memory() is the
|
|
// zero-dependency default; contrib/store can add a SQLite one. The executor calls
|
|
// run.Ports.Checkpointer (a CheckpointerFactory) during the run loop; NewFactory
|
|
// wires this battery into that seam.
|
|
package checkpoint
|
|
|
|
import (
|
|
"context"
|
|
"time"
|
|
|
|
"gitea.stevedudenhoeffer.com/steve/majordomo/llm"
|
|
|
|
"gitea.stevedudenhoeffer.com/steve/executus/run"
|
|
)
|
|
|
|
// RunCheckpointMeta is the run attribution needed to resume a run from scratch
|
|
// (mirrors mort's agentexec.RunCheckpointMeta).
|
|
type RunCheckpointMeta struct {
|
|
RunID string
|
|
AgentID string
|
|
AgentName string
|
|
CallerID string
|
|
ChannelID string
|
|
GuildID string
|
|
Prompt string
|
|
ModelTier string
|
|
ParentRunID string
|
|
}
|
|
|
|
// RunCheckpoint is one persisted snapshot of a run's resumable progress.
|
|
type RunCheckpoint struct {
|
|
Meta RunCheckpointMeta
|
|
Messages []llm.Message // conversation so far (single-loop runs)
|
|
Iteration int // completed agent-loop iterations
|
|
CompletedPhases []run.PhaseOutput // finished phases, in order (multi-phase agents)
|
|
UpdatedAt time.Time
|
|
}
|
|
|
|
// CheckpointStore persists run checkpoints keyed by run id. A live checkpoint
|
|
// means "this run was in flight and not cleanly finished"; Complete/Fail delete
|
|
// it. ListInterrupted returns every surviving checkpoint at boot for recovery.
|
|
type CheckpointStore interface {
|
|
Save(ctx context.Context, cp RunCheckpoint) error
|
|
Load(ctx context.Context, runID string) (*RunCheckpoint, error)
|
|
Delete(ctx context.Context, runID string) error
|
|
ListInterrupted(ctx context.Context) ([]RunCheckpoint, error)
|
|
}
|