fix(run): address gadfly review of the checkpoint PR
executus CI / test (pull_request) Successful in 45s
executus CI / test (pull_request) Successful in 45s
Real findings from the consensus review (44 raw; heavy devstral noise): - finalizeCheckpoint is now fired from the top-of-Run defer, so it runs on EVERY exit: a panic, an early build-error return (before the run loop), AND normal completion. Previously an early return on a recovered run left its durable record unfinalized → boot recovery would retry it forever on a persistent build error. (opus + glm) - Removed the dead ActivePhase field from run.RunCheckpointState + run.ResumeState (and the battery RunCheckpoint) — phase recovery is boundary-granular (skip completed phases; the interrupted phase re-runs from its start), so ActivePhase was never written nor read. Docs across ports/checkpoint/phases now state this plainly (5-model consensus that the field + docs over-promised mid-phase resume). - CheckpointerFactory.Begin error is now logged (WARN) before degrading to non-durable, per the documented contract (was silently swallowed). (4 models) - finalizeCheckpoint logs Complete/Fail errors (was silent). - Resume phase-skip now keys off a SEPARATE resumeSkip set, not the live outputs map — a fresh run with two same-named phases no longer skips the second (the outputs map fills as phases run). (opus:max) + regression test. - Removed the dead checkpoint.factory.now field (never set). (opus + glm) - Fixed the stale phaseDeps doc (the step observer moved out of sharedOpts to per-path). Hoisted the resume guard to a local; dropped the wasted acc allocation on the resume path; documented that Save throttling is the Checkpointer's responsibility and the accumulated transcript is pre-compaction (host size-caps it). Note (carried from the PR): classifyCheckpointOutcome keys shutdown on run.ErrShutdown; mort stamps its own runengine.ErrShutdown — the mort wiring PR aliases them so errors.Is matches. New test: duplicate phase names both run on a fresh run. Full ./... green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
+15
-11
@@ -53,9 +53,10 @@ import (
|
||||
|
||||
// phaseDeps carries the per-run state the phase runner shares with Run: the base
|
||||
// model, the full decorated toolbox (filtered per phase), the base step cap, the
|
||||
// shared agent options (tool-error limits + step observer + compactor), the
|
||||
// shared step observer (also fed by IsRunFunc bare calls), the critic/session
|
||||
// steer, and the audit recorder (phase events).
|
||||
// shared agent options (tool-error limits + compactor — the step observer is
|
||||
// added per phase, NOT in sharedOpts, so checkpointing can vary per path), the
|
||||
// shared step observer (wired into each phase's loop AND invoked for IsRunFunc
|
||||
// bare calls), the critic/session steer, and the audit recorder (phase events).
|
||||
type phaseDeps struct {
|
||||
baseModel llm.Model
|
||||
baseToolbox *llm.Toolbox
|
||||
@@ -85,12 +86,18 @@ func (e *Executor) runPhases(runCtx context.Context, ra RunnableAgent, deps phas
|
||||
var lastOutput string
|
||||
var totalUsage llm.Usage
|
||||
|
||||
// Resume: pre-populate from the saved checkpoint so already-finished phases are
|
||||
// skipped. The interrupted (active) phase is NOT pre-populated, so it re-runs
|
||||
// from its start (boundary-granular recovery).
|
||||
// resumeSkip is the set of phases already finished on a RECOVERED run — kept
|
||||
// SEPARATE from the live `outputs` map (which fills as phases run this time) so
|
||||
// the skip guard only skips RESUME-completed phases, never a fresh run's own
|
||||
// phases. (Reusing `outputs` would make a second phase with a duplicate name
|
||||
// skip itself.) Pre-populate outputs + completed so a resumed run threads the
|
||||
// saved outputs into later phases. The interrupted (active) phase is NOT
|
||||
// pre-populated, so it re-runs from its start (boundary-granular recovery).
|
||||
resumeSkip := map[string]bool{}
|
||||
if deps.resume != nil {
|
||||
for _, pc := range deps.resume.CompletedPhases {
|
||||
outputs[pc.Name] = pc.Output
|
||||
resumeSkip[pc.Name] = true
|
||||
completed = append(completed, pc)
|
||||
lastOutput = pc.Output
|
||||
}
|
||||
@@ -109,10 +116,8 @@ func (e *Executor) runPhases(runCtx context.Context, ra RunnableAgent, deps phas
|
||||
}
|
||||
|
||||
for i, phase := range ra.Phases {
|
||||
// Skip phases already completed on a resumed run (key presence, not output
|
||||
// emptiness — a legitimately-empty phase output still counts as done).
|
||||
if _, done := outputs[phase.Name]; done {
|
||||
lastOutput = outputs[phase.Name]
|
||||
// Skip phases already completed on a resumed run.
|
||||
if resumeSkip[phase.Name] {
|
||||
continue
|
||||
}
|
||||
// A killed/timed-out/cancelled run must not start its next phase.
|
||||
@@ -183,7 +188,6 @@ func (e *Executor) runPhases(runCtx context.Context, ra RunnableAgent, deps phas
|
||||
if deps.checkpointer != nil {
|
||||
_ = deps.checkpointer.Save(runCtx, RunCheckpointState{
|
||||
CompletedPhases: append([]PhaseOutput(nil), completed...),
|
||||
ActivePhase: "",
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user