P4c: remaining batteries — checkpoint + schedule + critic #6
Reference in New Issue
Block a user
Delete Branch "phase-4c-batteries"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Completes the P4 battery set (stacked on #5). Three batteries that plug into the remaining
run.Portsseams. NOTE: the executor's call sites for Critic/Checkpointer are a P2 follow-up — these provide the seams + impls + defaults ahead of that wiring, each nil-safe and core-import-clean (verifiedcore → battery= 0).Batteries
checkpoint/—run.Checkpointerdurable-resume:CheckpointStoreseam + a per-runhandle(throttledSave,Complete/Faildelete) +Memorydefault (with the honest caveat that in-memory doesn't survive the restart it recovers from).ListInterruptedis the boot-recovery query.schedule/— a generic cronRunner(Tick= one pass,Loop= run on an interval). Every dependency is wired by the host (Due/Run/Mark/Next), so the battery owns no cron grammar and never duplicates the parser. A bad job is logged + left due (retries); only a failingDuelister is pass-fatal.critic/— a two-tier timeout watchdog (run.Critic). executus owns the deterministic mechanics (activity tracking, fire-once-per-idle-period on soft timeout, extendable deadline, hard-kill backstop); the policy is theEscalatorseam (nudge/extend/kill). Mort plugs its LLM critic-agent in here;ExtendOnceis the default. Race-tested.Also in this PR
ci(gadfly)): image →sha-d0de034, and ollama-cloud concurrency moved from the model axis to the lens axis (PROVIDER_CONCURRENCY=1,PROVIDER_LENS_CONCURRENCY=3) — so this PR's own review runs one model at a time with its 3 lenses concurrent (first comment sooner; the other two models in series).Verification
go build/vet/test -race ./...green; core go.sum still free of host/DB deps.P4 status
Battery set complete: audit, budget, persona, skill (nouns), checkpoint, schedule, critic +
contrib/storeSQLite for budget/persona/skill/audit. Next: P5 (gadfly-on-core canary) and P6 (mort rewrite).🤖 Generated with Claude Code
Plugs into run.Ports.Checkpointer (the executor's call site is a P2 follow-up; this provides the seam + impls ahead of it): - checkpoint.go: CheckpointStore seam + RunCheckpoint{Meta, Messages, Iteration, ActivePhase} + RunCheckpointMeta (mirrors mort's agentexec types). - handle.go: New(store, meta, throttle, now) -> run.Checkpointer. Save writes a throttled snapshot; Complete/Fail delete it (a cleanly finished or terminally failed run is NOT a recovery candidate; a shutdown-interrupted run never calls them, so its checkpoint survives ListInterrupted at boot). nil store -> no-op. - memory.go: NewMemory() default (with the honest caveat that in-memory does not survive the restart it exists to recover from — a durable store is mort's). Tests: save+complete clears the recovery candidate; throttle skips in-window saves; nil-store is a clean no-op. Core imports ZERO from checkpoint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>🪰 Gadfly review —
minimax-m3:cloud(ollama-cloud)Verdict: Blocking issues found — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
No
os/exec, no SQL, no HTTP, no templates in the three new packages — all DB-touching code lives incontrib/store(out of scope for this diff, and that's host-supplied SQL anyway).There's one minor subtlety worth noting:
schedule.Runner.Runandschedule.Runner.Markare called sequentially on the host-supplied goroutine, and the sameRunner.Due/Markcallbacks will eventually call into whatever storage mort plugs in — but that wiring is host-side, not in this diff, and the same trust boundary already exists in the rest of the codebase. Nothing in this PR creates a new injection sink.The CI workflow change (
gadfly:sha-d0de034, concurrency shifted to lens axis) is pure config — no secrets expanded into the diff beyondsecrets.GITEA_TOKENwhich was already there. No new secret leakage introduced.One small CI note worth flagging: the new workflow pins a specific image sha (
sha-d0de034) — pinning by sha is good practice (avoids tag-mutation supply-chain risk), and no integrity verification is added beyond the registry's own. That's consistent with the prior pin (sha-6e3a83c) and not a regression.VERDICT: No material issues found
Through the security lens:
schedule/runner.go,checkpoint/{handle,memory,checkpoint}.go, andcritic/critic.gofor anyos/exec,database/sql,net/http, ortext/templateusage. None present in this diff (all SQL/DB code lives incontrib/store, not touched here). The only string-handling that touches host-supplied data (j.ID,j.Cron) is passing them back to host-suppliedRun/Mark/Nextcallbacks via structuredslogfields — no format-string eval, no shell.Messages []llm.MessageinRunCheckpointStateflows intoCheckpointStore.Saveand only the in-memory map implementation is shipped here — no JSON marshaling, no SQL interpolation, no template rendering.RunInfofields flow intoEscalator.OnSoftTimeout, but the Escalator is host-supplied and runs in-process on the kernel goroutine; no untrusted-code boundary is crossed.handle.Savecorrectly uses a mutex aroundlastSave;Memoryusessync.RWMutex;critic.handleusessync.Mutexfor all shared state. No obvious data races.http,exec,sql,templateand finding none.sha-d0de034) and shifts concurrency config — supply-chain posture is unchanged (sha pinning is good; same as the prior pin), and the only secret referenced is the pre-existingsecrets.GITEA_TOKEN.Verified by reading
schedule/runner.go,checkpoint/handle.go,checkpoint/memory.go,critic/critic.go,run/ports.go, and grepping the diff paths for injection sinks.🎯 Correctness — Blocking issues found
VERDICT: Blocking issues found
Findings
critic/critic.go:54-74—ExtendOnceis per-Escalator, but the contract is per-run (real bug, verified).By... then takes no further action" — per-run language.firedon theExtendOncestruct itself (line 58), behind amu.System(line 78) holds oneesc Escalator(line 79);New(line 89-97) constructs a single&ExtendOnce{}when esc is nil;Monitor(line 102-125) reusess.escfor every handle it creates. Thetick(line 215) callsh.sys.esc.OnSoftTimeout(...)on the shared esc.e.fired == trueand getsDecision{}— no extension, no nudge. The per-handleescalatedAtde-duplication (line 211) does NOT prevent this because each handle has its ownescalatedAt; it only re-arms within a single handle, not across runs sharing one esc.critic_test.go:79-89,TestExtendOnceOnlyFiresOnce) callsOnSoftTimeouttwice on the sameExtendOnce{}instance with differentRunInfovalues, but it never wires twoMonitors through oneSystemwith the default esc, so the bug is not exercised.ExtendOnce) silently loses the "one extension per stall" guarantee for every run after the first. Under fanout load all but the first stalling run are killed prematurely; single-run deployments only appear safe because there is no second run.run.RunInfo.RunID(e.g.map[string]struct{}protected bymu) so each run gets its own one-shot.critic/critic.go:202-227—Decision.Killis not sticky across re-arms (real but narrow bug, verified).Killtick,deadlineis collapsed toh.now()(line 226). The per-handleescalatedAtis set tolastActivityat the time of that tick (line 211). If the executor hasn't yet observed the collapsed deadline and callsRecordStepbefore its next cancellation check (line 145-150),lastActivityadvances, so a subsequent tick seesh.escalatedAt != h.lastActivityand the gateidle < softTimeout || escalatedAt.Equal(lastActivity)(line 207) lets it through once a fresh idle period exceeds softTimeout. The Escalator is called again; if it now returnsExtendBy > 0, line 222-224 doesh.deadline = h.deadline.Add(d.ExtendBy)— pushing the just-collapsed deadline back into the future.ExtendOnceis immune (itsfiredflag sticks at the Escalator level), but any Escalator that returns a mix ofKillandExtendByacross calls — e.g. a wavering LLM critic that decides to "give it one more try" after first issuing a kill — is susceptible. Also susceptible is any Escalator that returnsKillonce and thenDecision{}(which would leave the deadline collapsed but lose the kill intent if a subsequentExtendByever arrives).killed boolon the handle; once set,tick()short-circuits regardless of Escalator output. Or, equivalently, setescalatedAtto a sentinel that noRecordStepcan produce after Kill.checkpoint/handle.go:42-58—Savewrites viah.store.Saveoutside the throttle mutex (verified clean, no bug).lastSave = nowupdate, and theh.mu.Unlock()happen under the mutex;h.store.Saveruns after the unlock. The gate is monotonic (now.Sub(h.lastSave) < h.throttle) andlastSaveis updated before the write, so a concurrent caller cannot undercount saves. No correctness issue here.schedule/runner.go:60-81— failedMarkis logged but the job is still effectively considered "ran" for this tick (verified, design choice, not a bug).Markfailure after a successfulRunleavesnext_run_atunchanged, so the same job will be picked up next tick and re-run. The doc-comment on lines 56-59 already frames this as the safety/correctness trade-off (skippingMarkis better than skipping the work). Not material.checkpoint/memory.go:30-38—Loadreturnsnil, nilfor missing entries (verified, intentional).CheckpointStore.Loadincheckpoint.go:47("returns (nil, nil) if no checkpoint exists") would prevent future misreading but is not blocking.Summary
The
ExtendOnceper-run/per-Escalator mismatch is a real blocking correctness bug that contradicts the documented contract and will manifest under any multi-run workload that uses the default Escalator. The Kill-stickiness issue is real but narrower and only bites Escalators that issue inconsistent decisions across idle periods. Everything else I traced checks out.🧯 Error handling & edge cases — Minor issues
Error-handling & edge-case review — 🧯
VERDICT: Minor issues
Findings
critic/critic.go:202-215 —
escalatedAtde-dup is racy across theOnSoftTimeoutcall. Betweenh.mu.Unlock()at L213 and the re-lock at L217, a concurrentRecordStep(L145-150) can advancelastActivity. The next tick will seeescalatedAt != lastActivityand re-escalate against what is effectively the same idle period (the post-OnSoftTimeoutlock at L217-227 doesn't re-check the dedup condition). A fast step-then-stall sequence can double-fire the Escalator. Fix: snapshot the decision inputs (lastActivity,escalatedAt) and re-checkescalatedAt.Equal(lastActivity)inside the second locked section before applying the Decision, or hold the mu acrossOnSoftTimeout(the Escalator is documented as policy and must be cheap; if it isn't, document the expectation).checkpoint/memory.go:30-38 —
Loadreturns(nil, nil)for a missing key. Conventional Go returns(nil, ErrNotFound)(or an explicit sentinel) so callers can't conflate "no checkpoint because the run finished cleanly" with "Save never landed". Verified by reading the interface (checkpoint.go:45-49) and theMemoryimpl. Impact: a future durableCheckpointStore(mort's durable-job table, contrib/store SQLite) has to invent its own convention andMemory/SQLitewill diverge silently. Fix: definevar ErrNoCheckpoint = errors.New("no checkpoint")incheckpoint.goand return it fromMemory.Load; have callers (and the package docs) say "checkerrors.Is(err, ErrNoCheckpoint)".schedule/runner.go:67-78 —
Tickretries failing jobs forever with no backoff. Verified against the docstring: aRunfailure (L67-70) or aMarkfailure (L76-78) logs andcontinues without stamping, so the job stays due and re-runs next tick indefinitely. This conflates transient and permanent failures and amplifies load against a flapping downstream (especiallyMark— if the store is down, every due job retries everyInterval). Fix: add a per-job retry budget (e.g.,Runner.MaxRetriesor a circuit-breaker map keyed by job id), or at minimum a linear/exponential backoff stamp on failure instead of leavingnextAtunchanged.schedule/runner.go:62-78 — nil
Due/Run/Mark/Nextpanic with nil deref. Verified: none ofr.Due(L62),r.Run(L67),r.Next(L71),r.Mark(L76) is guarded. Docstring says "Every func field is required except Now and Logger" but a partialRunnercompiles silently and crashes at the first tick. Fix: add aValidate()method that returns an error if any required func is nil, and haveLoopcall it once at start; or have each call site return a wrapped error fromTickso the host learns about misconfiguration instead of taking a nil deref.schedule/runner.go:86-103 —
Loophas no panic recovery; a nil-func panic inTickkills the scheduler goroutine silently. This is distinct from the validation finding above: even after the immediate nil-deref is fixed, any future panic from aDue/Run/Next/Markimpl would unwind throughTick, out ofLoop'sselect, and terminate the scheduler with no log line.Loopcurrently only catches errors returned fromTick(L98-100), not panics. Fix: wrap theTickcall indefer recover()that logs and continues, so a single bad tick doesn't silently kill the cron runner for the process lifetime.critic/critic.go:187-227 —
watch/tickdo not recover panics fromOnSoftTimeout(or from the post-decision apply). A misbehaving Escalator that panics takes down the watch goroutine; the run then has no critic (no steer, no deadline extension, no kill) for the rest of its lifetime, with no log. Verified:watch(L187-200) has nodefer recover(), andtick(L202-228) has none either. Fix: adddefer recover()inwatch(logs and returns — the goroutine exits, but at least it's logged), or intick(logs, treats the panic as aDecision{}, and continues). Watch-level recovery is sufficient and simpler.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 4m 19s
🪰 Gadfly review —
deepseek-v4-flash:cloud(ollama-cloud)Verdict: Blocking issues found — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
Verdict: No material issues found
I reviewed the three new battery packages (
checkpoint/,critic/,schedule/) and the CI/CLAUDE.md changes through the security lens. No material security issues were found.Analysis summary:
No injection vectors. All three batteries operate through abstracted interfaces/seams (
CheckpointStore,Escalator,Runnerfunction fields) that the host wires. The batteries never directly handle SQL, shell commands, or template execution. Data flows from host-provided callbacks through the battery and back to host-provided callbacks — the battery is pure orchestration.No untrusted input paths. The
schedule.RunnerpassesDue.IDandDue.Cronstrings from the host'sDuecallback to the host'sRun/Nextcallbacks. Thecriticpackage passesrun.RunInfoandProgressto the host'sEscalator. Thecheckpointpackage stores[]llm.Messagethrough the host'sCheckpointStore. In every case the host controls both the producer and consumer of the data; the battery introduces no new trust boundary.No secret/credential leakage. The CI workflow change passes
GITEA_TOKENas an env var to the Docker container, but this is a pre-existing pattern unchanged by this PR (only the image tag and concurrency settings were modified).No unsafe defaults.
ExtendOnce(defaultEscalator) fires at most once. Thenoopcheckpointer is a safe nil-store default. Theschedule.Runnerdefaults totime.Nowandslog.Default()— no dangerous fallbacks.No SSRF, deserialization, or authn/authz gaps. These batteries don't make network calls, deserialize untrusted data, or implement access control — those concerns live in the host layer.
The code is clean from a security perspective.
🎯 Correctness — Minor issues
🎯 Correctness — Verdict: Minor issues
Findings
1.
checkpoint/handle.go:50-56—lastSaveupdated before the store write, so a failed save is invisible to the throttleIf
h.store.Savereturns an error,lastSavehas already been bumped. Every subsequentSavecall within the throttle window will be silently skipped (the throttle check seesnow.Sub(lastSave) < throttleand returns nil). The checkpoint is therefore lost — the caller believes it was persisted but it wasn't, and the throttle prevents retries until the window expires.Fix: Move the
h.lastSave = nowassignment after the successful store write, or use a two-phase approach (save first, then update the timestamp under the lock).2.
schedule/runner.go:72-80—Markfailure after a successfulRuncauses duplicate executionWhen
Runsucceeds butMarkfails, the job was already executed but its next-run time was never updated. On the nextTick, the job is still due and runs again. For idempotent jobs this is merely wasteful; for non-idempotent ones (e.g. sending a notification, charging a credit) it is a correctness bug — the same work is performed twice.The same problem applies when
Runsucceeds butNextfails (unparseable cron): the job runs every tick forever.Fix: Either (a) make
Markfailure fatal to the job (skipRunon the next tick by recording a temporary "don't retry" marker), or (b) document this as an intentional design choice and add a note that callers must make theirRunhandler idempotent. The current code silently assumes idempotency without enforcing or documenting it.3.
schedule/runner.go:33-40— Required function fields (Due,Run,Mark,Next) have no nil guardThe doc comment says "Every func field is required except
NowandLogger", butTickcallsr.Due(ctx, now),r.Run(ctx, j.ID),r.Next(j.Cron, now), andr.Mark(ctx, j.ID, now, next)without nil-checking any of them. A zero-valueRunner{}panics on the firstTick.Fix: Add nil checks in
Tick(or a constructor) that return a clear error instead of panicking, or make the fields safe by providing no-op defaults (likeNowandLoggeralready have).4.
critic/critic.go:93-97—backstopMul <= 1silently overrides to 3, butbackstopMul = 1is a valid (if tight) choiceA host that explicitly passes
backstopMul = 1(hard deadline == soft timeout, meaning the run is killed the instant it idles past the soft timeout) has its choice silently discarded. WhilebackstopMul = 1is unusual, silently overriding it violates the principle of least surprise. The condition should bebackstopMul < 1(treat only strictly-less-than-1 as "unset"), or the doc should state that<= 1is treated as "use default 3".Fix: Change
<= 1to< 1so thatbackstopMul = 1is accepted as an intentional (if tight) configuration.5.
critic/critic.go:93—Newdoes not setcheckInterval, so it is always zero and recomputed perMonitorcallThis is not a bug per se, but it means every
Monitorcall re-derives the same interval fromsoftTimeout / 2. If aSystemmonitors runs with different soft timeouts, each run gets a different check frequency. If that's intentional, fine — but the field is exported (checkIntervalis lowercase, so it's unexported — actually it's lowercase, so it's package-private). The test sets it directly (s.checkInterval = 5 * time.Millisecond), which works because they're in the same package. No action needed, just noting the design.No material issues found in the
checkpoint/memory.gostore, thecheckpointinterface definitions, thecriticExtendOncelogic, thecriticde-duplication byescalatedAt.Equal(lastActivity), or thescheduleLoop/Tickcontrol flow — those are all correct after verification againstrun/ports.go.🧯 Error handling & edge cases — Blocking issues found
All three findings are confirmed against the actual code. No findings need to be dropped.
🧯 Error handling & edge cases — Verdict: Blocking issues found
Blocking
checkpoint/handle.go:49-52— Throttle timer advanced before store save, causing silent data loss on transient errors.h.lastSave = nowis set before callingh.store.Save. If the underlying store save fails (e.g. network hiccup, disk full), the caller receives the error — but the throttle window has already been consumed. Any subsequentSavecall within the throttle period is silently dropped (returns nil), even though the previous snapshot was never persisted. The caller believes the retry succeeded when in fact the data is lost.Fix: Move
h.lastSave = nowto after a successfulh.store.Save(...):Minor
schedule/runner.go:60-81— Nil func fields cause panic instead of a clear error.Tickcallsr.Due,r.Run,r.Next, andr.Markwithout nil-guards. The doc says "Every func field is required" but a host that forgets to wire one gets a confusing runtime panic at the first tick, not a clear error at construction or on first call. No constructor exists (the struct is built directly).Fix: Add a constructor or validate at the top of
Tick/Loop:schedule/runner.go:67-74— SuccessfulRun+ failedNextcauses duplicate execution.When
Runsucceeds butNextfails (unparseable cron), the job's next-run time is not updated. The job stays due and is re-fetched on every tick, causing re-execution of an already-successful job. The code logs "not rescheduling" but doesn't acknowledge that the job will be re-run. For non-idempotent jobs this is a correctness bug.Fix: Either stamp a sentinel value (e.g. far-future time) to prevent re-fetch, or document the re-execution risk explicitly so hosts can make their jobs idempotent.
Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 51s
🪰 Gadfly review —
glm-5.2:cloud(ollama-cloud)Verdict: Blocking issues found — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
Verdict: No material issues found
Through the security lens, this PR adds three host-wired batteries (checkpoint, schedule, critic) plus a workflow image bump. I traced the tainted-input paths and found nothing material:
schedule/runner.go: All potentially-untrusted values (Due.ID,Due.Cron, errors) come exclusively from host-supplied funcs (Due/Run/Mark/Next). They flow only into (a) theRun(id)callback (host's own dispatch, its problem) and (b)slog.Warnlog lines. No SQL/command/path interpolation in this package; the only log of raw untrusted content isj.ID/j.Cron/erratrunner.go:68,73,77,99, which are job ids and cron expressions — not credentials or user PII. No injection surface owned by this code.checkpoint/handle.go/memory.go:meta.RunIDis used as a map key inMemory(parameterization is the host's concern when backing with SQL — no query construction here). The storedRunCheckpointcarriesPromptandMessages(conversation content), but that's inherent to a resume store and not newly leaked anywhere —ListInterruptedjust returns them to the boot-recovery caller; nothing logs them. No secret exposure introduced.critic/critic.go: TheEscalator.OnSoftTimeoutreturnsDecision.Nudge([]llm.Message) that the executor injects as a steer. The escalator is host-wired (Mort's LLM critic), so prompt-injection-via-nudge is host-policy territory, not a seam this PR leaves insecure.Progress.LastTool(a tool name) is passed to the escalator, but again the consumer is host-trusted. TheKillpath only mutates an in-memorydeadline; no privileged side effect..gitea/workflows/adversarial-review.yml): swaps a pinned image digest (sha-d0de034) and tunes provider concurrency env vars. No secrets are added or echoed;GITEA_TOKENremains sourced fromsecrets.and not logged. Nothing to flag.I verified the seam types against
run/ports.go:42-150(RunInfo,Critic/CriticHandle,Checkpointer/RunCheckpointState) and confirmed the implementations satisfy them; no undefined/unsafe symbol usage. Thecontrib/storeSQLite layer touched by grepping uses parameterizeddatabase/sqlaccessors (no raw query building in the diff scope).Nothing in the security lane rises to material. Stopping here.
🎯 Correctness — Blocking issues found
Both findings verified against the actual code.
Finding 1 (ExtendOnce): Confirmed at
critic/critic.go:89-96(single&ExtendOnce{}installed inSystem.escwhenesc == nil),:62-73(firedis a plain bool set once, never reset), and:176-183(every handle calls the sharedh.sys.esc.OnSoftTimeout). The package doc at:77and type doc at:76-77state "one System monitors many runs concurrently," whileExtendOnce's own doc at:50-53reads per-run. Real design defect.Finding 2 (schedule hot-loop): Confirmed at
schedule/runner.go:71-74— onNexterror the loopcontinues withoutMarking, leavingnext_rununchanged, so aDuequery of "next_run <= now" returns it again. WithRunsucceeding butNextpermanently failing, the job'sRunexecutes everyIntervalindefinitely. Real, lower severity.Verdict: Blocking issues found
Findings:
critic/critic.go:89-96+critic/critic.go:50-73—ExtendOnceis shared across all runs, breaking the default Escalator under the documented multi-run usage.New(nil, …)installs a single&ExtendOnce{}intoSystem.esc(:90-92).System.Monitor(:102) creates a fresh per-runhandle, but every handle calls the sameh.sys.esc.OnSoftTimeout(...)insidetick(:184), andExtendOnce.firedis a plain bool set totrueon the first stall and never reset (:65-68). The package/type doc (:76-77) explicitly states "oneSystemmonitors many runs concurrently," andExtendOnce's own doc (:50-53) reads as per-run ("the first time a run stalls it extends… then takes no further action — so a genuinely hung run is later killed by the hard backstop"). Under the documented usage the consequence is: the first run to stall gets its one extension; every other run that later stalls receivesDecision{}(no nudge, no extend) and is instead killed outright by the hard backstop — i.e. the "extend once" guarantee silently applies once globally, not once per run. The existing tests don't catch this:TestMonitorEscalatesOncePerIdlePeriodAndExtendsuses a customescFunc, andTestExtendOnceOnlyFiresOnceexercises a singleExtendOncein isolation. Suggested fix: key thefiredstate per run (e.g. amap[string]struct{}of run ids that have fired, keyed offinfo.RunIDinOnSoftTimeout), or document that the defaultExtendOnceis only valid for single-runSystems and haveNewconstruct a freshExtendOnceperMonitorcall (per-run, not per-System).schedule/runner.go:66-78— a permanently-unparseable cron expression causes the job to execute everyIntervalforever, not "retry later."When
Nextreturns an error, the job is left unstamped (:72-74, noMarkcall). Combined with the typicalDuequery ("next_run <= now"), the job remains due andRunis re-invoked on every tick — so a job with a bad cron runs at fullIntervalcadence indefinitely rather than backing off. The comment ("will retry next tick"/"left due (retries)") implies an eventual-success retry, but for a permanent parse failure this is an infinite hot-loop of actualRunexecutions. Lower severity (matches stated intent for transient failures), but worth either skipping jobs whoseNexterror is non-transient, or marking/stamping them to defer the next attempt.No issues found in the
checkpointbattery: the throttle logic (atomic check-and-set underh.muathandle.go:34-48), nil-store no-op (handle.go:25-27), andMemorySave/Load/Delete/ListInterrupted(memory.go) are correct and match therun.Checkpointer/CheckpointStoreinterfaces.🧯 Error handling & edge cases — Minor issues
Minor issues
schedule/runner.go:76-78—Markfailure silently causes duplicate job execution. A job whoseRunsucceeds and whoseNextparses has already executed; ifMarkthen errors, only aWarnis logged and the loop moves on. Because the next-fire time was never stamped,Duereturns the same job on the next tick andRunexecutes it again — duplicate side effects, surfaced only by the one-time Warn. TheTickdocstring (lines 55-59) only promises the "stays due and retries" behavior forRun/Nextfailures, not forMark; theMarkerror path (lines 76-78) has nocontinueand no stamp recovery. Suggested fix: either document thatMarkfailure also implies a duplicate next tick, or treat aMarkfailure as pass-fatal (return it) so the host learns the scheduler lost track of the job.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 2m 7s
Pull request closed