Real findings from the consensus review (37 raw; many devstral dups/noise):
- Optional/budget-salvage branches no longer swallow a context
cancellation / deadline / critic-kill: such errors return immediately so
the run is classified cancelled/timeout/killed, not "ok" with a fallback.
(the most serious finding — an Optional final phase could mask a killed run)
- IsRunFunc bare phase now feeds the SHARED step observer (not just the
audit recorder), so the critic's activity clock + Result.Steps see it —
a long synthesize phase no longer looks idle to the critic.
- phaseModel returns the resolver's enriched (usage-attribution) context and
the phase's calls use it, mirroring the single-loop path (non-base-tier
phases were mis-attributed).
- salvagePhaseTranscript trims the tail on a rune boundary (was a raw byte
slice that could split a UTF-8 rune); maxSalvage is now a named const with
rationale.
- expandPhaseTemplate logs a WARN on parse/execute failure instead of
silently returning the unexpanded template; documented the phase-name
identifier requirement + the "Query" shadow.
- removed the dead phaseDeps.baseTier field.
- extracted multimodalUserMessage, shared by runAgent + the phase runner
(was duplicated image-folding).
- aggregated phase usage is stamped onto the result even on a hard-error
return; TrimSpace computed once; filterToolbox returns the base toolbox
as-is for the empty-names (full-palette) case instead of copying;
phaseModel WARN no longer prints error=<nil>.
New test: Optional phase does not swallow a cancellation. Full ./... green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The kernel carried RunnableAgent.Phases as a DTO but never executed it —
Run always ran a single agent loop with ra.SystemPrompt, so a phased agent
(mort's deepresearch/research) silently ran one loop with the base prompt
instead of its pipeline. This implements the phase loop, ported from mort's
agentexec pipeline but reusing the kernel's own machinery.
- run/phases.go: runPhases / runOnePhase. Phases run sequentially; each is a
fresh agent loop (or a bare LLM call for IsRunFunc phases) with its own
template-expanded system prompt ({{.Query}} + {{.<PhaseName>}}), model
tier, step cap, and tool subset. Outputs thread into later phases; the
final phase's output is the run output. Optional phases swallow errors and
substitute FallbackMessage; a non-optional phase that merely exhausts its
step/tool budget salvages its partial transcript and continues (a hard
error still aborts); per-phase tier-resolve failures fall back with a WARN.
- run/agent.go: Phase gains IsRunFunc + FallbackMessage (the kernel Phase
struct previously omitted them).
- run/executor.go: Run factors the shared agent options (tool-error limits,
step observer, compactor) and branches — single loop (critic's dynamic
step ceiling) vs the phase runner (fixed per-phase caps; the run-level
critic's steer + hard deadline still apply across phases). systemPrompt
now delegates to systemPromptWithBody so each phase keeps the platform
header. The same step observer feeds audit/steps/critic across all phases.
Tests (run/phases_test.go): sequential output threading + template
expansion, Optional-failure → FallbackMessage continues, hard-error abort,
IsRunFunc bare call, per-phase SystemHeader, filterToolbox subset, template
expansion. Full ./... suite green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adopts gadfly's review-representation overhaul: one ranked consensus comment
across the swarm + an advisory COMMENT-state inline PR review, on image
sha-3095ebf. Swarm config still rides the owner variables.
[skip ci]
Second-pass findings on the security fix:
- Mime sanitized ONCE and passed to BOTH StageInputFile and the descriptor (was
passing raw f.MimeType to the host store while only the descriptor sanitized) —
3 models.
- sanitizeField now also strips Unicode format chars (category Cf, incl. the bidi
overrides U+202A–U+202E that can reorder how the descriptor renders); IsControl
already covers \n\r\t so the explicit checks are dropped.
- fileID is sanitized before inlining + an empty file_id drops the file (defense
vs a misbehaving stager).
- humanizeBytes clamps the prefix index so an absurd size (≥1024^6) can't index
past "KMGTPE" and panic — a no-panic guarantee independent of the per-file cap.
- Docs sync: README Ports list gains InputFiles; tool.InputFile.Name doc now says
the executor reduces an untrusted name to a safe base name (was claiming the
field is already safe).
Tests: bidi/control stripping; mime sanitized in staged value + descriptor; empty
file_id drop; humanizeBytes no-panic across sizes up to 1<<62. Suite green (-race).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The full swarm (5-6 models) flagged that stageInputFiles passed the untrusted
attachment filename straight to StageInputFile and inlined it into the
[ATTACHED FILES]/`/workspace/<name>` descriptor with no sanitization — a path
the byte-cap already treats as a trust boundary. A name like ../../etc/passwd or
an absolute/drive path could escape the host store or the sandbox workspace, and
newlines in the name/mime could inject text into the prompt block.
- sanitizeName: strips control chars/newlines, then reduces to a base name
(path.Base after backslash-normalization) so ../, nested dirs, and absolute /
drive paths all collapse to their last element; "attachment" fallback for
empty/"."/"..". Applied BEFORE staging AND inlining.
- sanitizeField: strips control chars from MimeType (also inlined verbatim).
- maxInputFiles (32) count cap — defense-in-depth vs a flood of tiny files,
independent of the per-file byte cap.
Tests: sanitizeName table (traversal/absolute/backslash/control/fallback, +
no-separator invariant); traversal staged+described under the base name only;
oversize skip; count-cap truncation. Full suite green (-race).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
executus's tool.Invocation already carried InputFiles (audio/PDF/binary), but the
executor never staged them — only Images were folded into the run. This adds the
host seam mort's chat/chatbot surfaces need for audio-input parity with agentexec.
- run.Ports gains InputFiles InputFileStager (nil-safe; nil = input files silently
ignored, run still proceeds text-only). The interface mirrors mort's skill
FileStorage: StageInputFile(ctx, runID, agentID, name, mime, content) → file_id.
- run/input_files.go (ported from mort agentexec/input_files.go): stageInputFiles
persists each file under run scope and appends an [ATTACHED FILES] descriptor
block to the prompt so the agent can reach them by file_id (e.g. code_exec
files_in → /workspace/<name>). Bytes are NEVER inlined into model context.
Best-effort: empty/oversized(>50MB)/save-error files are skipped; colliding
base names are disambiguated (name-2, name-3) so they don't clobber at
/workspace/<name>.
- Executor.Run calls it after the model/toolbox build, before the loop, so the
descriptor rides the first user turn (alongside the existing Images folding).
Tests: stages + builds the block; nil stager / no files leave the prompt intact;
dedup; empty/save-error skipping. Full suite green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The reusable now reads swarm config from user-scope vars (GADFLY_DEFAULT_* +
GADFLY_ENDPOINT_*); this immutable @sha bumps past the long-lived-runner ref
cache so the vars-config reusable is adopted. Direct to main + [skip ci] to
avoid triggering the review swarm.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Long-lived act_runners cache the reusable-workflow ref, so a moved @v1 tag
keeps resolving to a stale cached copy and a newly-added reviewer never runs.
Pinning to a unique immutable sha forces a cache miss → fresh fetch.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Every reviewer flagged that runAgent appended llm.Text(input) unconditionally, so
an image-only run (blank prompt) emitted an empty TextPart — inconsistent with the
sibling runSession.AttachImages which guards it. Mirror that guard
(strings.TrimSpace(input) != ""). Also:
- copy opts before appending (variadic backing array can have spare capacity; avoid
aliasing a caller's slice).
- reword the doc comment to drop the mort-agentexec reference (executus is a
standalone lib; a consumer name doesn't belong in its godoc).
Tests: image+text are co-located in ONE user message; an image-only run emits no
blank TextPart.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The executor passed only the text `input` to majordomo's agent.Run, silently
dropping inv.Images — so a multimodal run (vision: chatbot @mention, chat API)
lost its images on the executus path. majordomo's Run input arg is text-only, so
fold the images into the first user message (text + image parts) via WithHistory
and call Run with empty input, mirroring mort agentexec's multimodal seeding. The
image-less path is unchanged (prompt passes straight through).
Tests: a run with Images carries the image bytes + prompt into the first model
request; the text-only path still reaches the model.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address the swarm's findings on this rollout:
- Replace `secrets: inherit` (which forwarded ALL repo secrets — registry/
Komodo/Discord/DB creds the reviewer never uses) with explicit forwarding of
only OLLAMA_CLOUD_API_KEY / CLAUDE_CODE_OAUTH_TOKEN / findings tokens.
GITEA_TOKEN is the automatic job token (github.token in the reusable).
- Pin uses: ...@main -> @20a5c43 (immutable) so a push to gadfly can't change
the code that runs with our forwarded secrets.
Requires gadfly's review-reusable.yml secrets contract (steve/gadfly#9, merged).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the full self-contained stub with a thin caller of steve/gadfly's
reusable workflow, using gadfly's own dogfood config: 6 cloud models +
the Claude Code engine (sonnet, opus, opus:max). No local Macs / foreman.
Advisory only.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two convergent gadfly refinements on the PostRun wiring:
- PostRun now runs on detach(ctx), not the caller's ctx — a finished/cancelled
caller no longer aborts artifact production (3-model: glm-5.2/minimax/deepseek).
- Cleanup is panic-isolated via safeCleanup (recover+log), matching runPostRun, so
a misbehaving teardown can't clobber an otherwise-successful run (deepseek).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The session-tool TYPES already lived in tool/ (P4 move) but the executor never
used them. This wires them, unblocking artifact-producing host surfaces (mort's
chat API / chatbot / .skill / scaddy) to run on executus:
- run/session.go: steerMailbox (thread-safe message queue) + runSession
(tool.AgentSession over it: AttachImages → a user-role multimodal message
injected before the agent's next step) + runPostRun (panic-isolated hook call).
- executor: create the mailbox + set inv.AttachImages BEFORE the toolbox build;
add inv.ExtraTools + a SessionToolFactory's per-run Tools to the toolbox; defer
its Cleanup; merge the session mailbox with the critic's nudges into ONE
WithSteer; after the run, call PostRun with the full transcript
(runRes.Messages) → Result.PostRunResult (best-effort, never fails the run).
- run.Result += PostRunResult *tool.PostRunResult.
- dropped the now-dead criticBinding.steerOptions (superseded by drainSteer).
Tests: a factory whose PostRun emits an artifact from the output+transcript +
Cleanup lands on Result.PostRunResult; a factory-added tool is callable.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pool now: minimax-m3, glm-5.2, glm-5.1, deepseek-v4-pro, nemotron-3-super,
qwen3-coder:480b (all cloud, ollama-cloud=3). Removed the low-value reviewers +
the last local endpoint (m5).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The WithCancelCause+timer rewrite made MaxRuntime surface as Canceled (not
DeadlineExceeded), so statusFor's context.Cause(DeadlineExceeded) check could
relabel (a) a genuine run error as 'timeout' and (b) a caller cancel/deadline as
'timeout' (was 'cancelled'). Convergent gadfly finding (glm-5.2 + cluster).
Fix: keep MaxRuntime as WithTimeout (its DeadlineExceeded propagates → 'timeout',
preserving own-timeout vs caller-cancel), add a NESTED WithCancelCause layer only
for the kill. statusFor consults context.Cause ONLY for ErrCriticKill; everything
else is classified by the run error itself. Tests: generic-error-not-relabeled +
caller-cancel-stays-cancelled.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the run-critic seam so a host adapter (mort's agentcritic) has full
fidelity, closing the two limitations gadfly surfaced on mort #1334.
- RecordStep(iter int, resp *llm.Response): the completed step's model response
is now passed to the critic (was index-only), so a host that records a trace
(mort's ProgressRecorder) can show what the agent actually produced, not just
an iteration count. The executor forwards s.Response; the battery ignores it
(its Progress is count-based).
- CriticHandle.KillCause() error + ErrCriticKill: the executor now distinguishes
an explicit critic KILL from a natural backstop expiry. runCtx uses a
cause-carrying cancel (WithCancelCause + a MaxRuntime timer cancelling with
DeadlineExceeded); the deadline-watch cancels with ErrCriticKill when
KillCause()!=nil, else DeadlineExceeded. statusFor reads context.Cause →
killed / timeout / cancelled are now distinct (were all "cancelled"). The
battery sets killCause from Decision.KillReason on a Kill.
Tests: statusFor "killed" case (cause=ErrCriticKill, err=Canceled); fake handle
+ battery RecordStep/KillCause signatures. Core stays battery-free.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
m1/qwen3:14b proved consistently low-value + slowest in the pool over multiple
PRs. Removed from GADFLY_MODELS + GADFLY_PROVIDER_CONCURRENCY + its endpoint so it
never fires again. m5 retained.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A buggy/hostile Escalator returning a huge RaiseStepsBy could wrap handle.maxSteps
negative (which the executor reads as defer-to-base). Clamp at math.MaxInt.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prerequisite for a full-fidelity mort agentcritic adapter (which adjusts a
healthy-but-long run's iteration budget, not just its deadline). executus's
CriticHandle was deadline+steer only; this adds the dynamic step ceiling above
an unchanged majordomo (which already exposes WithMaxStepsFunc).
- run.RunInfo += MaxIterations (the run's base ceiling, so a critic can raise it
relative to the baseline).
- run.CriticHandle += MaxSteps() int — polled by the executor each step via
agent.WithMaxStepsFunc; <=0 defers to the base. The executor uses
WithMaxStepsFunc(critic.MaxSteps) when a critic is active, else WithMaxSteps.
- critic battery: handle.maxSteps (initialised from RunInfo.MaxIterations) +
MaxSteps(); Decision gains RaiseStepsBy so an Escalator can raise the ceiling
alongside ExtendBy. ExtendOnce default is unchanged (time-only).
Test: a critic returning MaxSteps=5 lets a base-MaxIterations=1 run complete two
tool-dispatch steps past the base ceiling. Core stays battery-free (run doesn't
import critic).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-adds the local Macs (m1/qwen3:14b, m5/qwen3.6:35b-mlx) via their foreman endpoints alongside the 3 cloud models. Cloud keeps lens fan-out (ollama-cloud=1 model + lens=3); each Mac runs one model with lenses serial (foreman serializes anyway); all provider lanes parallel. Bumps the job timeout 30->90m for the slow local lanes. With findings telemetry now on, gadfly-reports can quantify whether the Macs earn their keep.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
majordomo's step observer fires post-iteration, so the critic's activity clock
refreshes per-iteration, not mid-tool — a single long tool call won't refresh it
until it returns. Documented + the host-progress-bridge mitigation (mort's
pattern). A true pre-dispatch hook needs majordomo support (follow-up).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From PR #9 (minimax + deepseek):
- Run now has a top-level recover() — the "never propagates a panic" promise was
unenforced; a panicking host Port (Critic/Audit/Palette) on the run goroutine
now becomes Result.Err instead of unwinding into the caller.
- The critic deadline-watch goroutine recovers panics from a host Deadline()
(it's a separate goroutine, so Run's recover can't catch it) — a buggy
CriticHandle can't crash the process.
- CriticHandle interface documents its concurrency contract (Record*/Steer on the
run goroutine vs Deadline()/Stop() from the watch goroutine — impls must be
concurrent-safe; the critic battery already is).
- startCritic's dead `soft <= 0 -> noop` guard (withFallbacks already coerces to
90s) replaced with a defensive inline 90s default, so a bypass of withFallbacks
still gets a working critic instead of silently none.
- Delivery tests made honest: the old "error path" test only checked the
early-return (no delivery); added TestDeliverErrorOnRunFailure (in-loop model
error -> DeliverError to the target) + renamed the early-return test.
Graded all #9 findings in the gadfly MCP.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Continues finishing the executor's run.Ports wiring (after C0's Palette).
Critic (run/critic.go): when Ports.Critic is set and the agent enables it, the
executor calls Monitor at run start, feeds RecordStep/RecordToolStart from the
step observer, drains the critic's Steer messages into the loop via
agent.WithSteer, and binds the run's hard cancellation to the critic's
(extendable) Deadline through a watch goroutine — a healthy-but-slow run gets
room while a hung one is killed. Stop() on run end. Soft timeout from
Defaults.CriticSoftTimeout (default 90s). nil-safe: no critic / not-enabled =
no-op.
Delivery (run/executor.go deliver): after the run, when Ports.Delivery is set
and inv.DeliveryID is non-empty, the executor posts Result.Output (or
DeliverError on failure) to a host-interpreted deliver.Target
{inv.DeliveryKind, inv.DeliveryID}. Empty target = caller reads Result.Output
itself (the synchronous default; the `.agent run` canary). Best-effort +
detached.
tool.Invocation gains DeliveryKind/DeliveryID (host-set egress target).
Tests: critic monitored/fed/steered/stopped when enabled, untouched when not;
delivery posts on a target, skips without one. Deferred: Checkpointer (needs a
majordomo hook to snapshot the running message history).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From the PR #8 review (all graded in the gadfly MCP):
- skip empty palette names + dedupe by final tool name, instead of producing a
"skill__" tool or an opaque box.Add duplicate error.
- delegationResult: no trailing blank line when a non-ok child produced no output.
- delegationErr: fold a child's partial output into the hard-failure error so it
isn't silently dropped.
Deferred to C0b (design-level, not trivial): route delegation through the
tool.Registry gate/audit wrappers; expose the skill's real input schema to the
LLM instead of a generic inputs map. typed-nil PaletteSource is left as a caller
contract (the == nil guard catches the untyped-nil interface).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The first cutover prerequisite: the executor now turns an agent's SkillPalette /
SubAgentPalette into delegation tools so a mort agent that delegates works
through run.Executor (the piece the `.agent run` canary needs beyond the
already-wired audit/budget).
- run/palette.go: addDelegationTools builds a skill__<name> tool (structured
inputs) per SkillPalette entry and an agent__<name> tool (prompt) per
SubAgentPalette entry, each invoking run.Ports.Palette as a CHILD of the
current run (parentRunID = inv.RunID, inheriting caller + channel). A non-ok
child status is surfaced to the parent with the partial output. nil-safe: no
PaletteSource or empty palette → no delegation tools (unchanged behavior).
- executor.go: call it right after building the low-level toolbox.
Tests: the model calls skill__helper → routed through Palette with the right
name/caller/inputs/parent; nil palette → run still works.
Deferred to C0b (the remaining run.Ports executor wiring): Critic (soft-timeout
monitor + deadline binding + steer), Delivery (output egress for surfaces that
need executor-side delivery), Checkpointer (needs a majordomo message-history
hook to snapshot resumable state). The `.agent run` canary delivers its returned
Result.Output itself, so these aren't on its critical path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds GADFLY_FINDINGS_URL / GADFLY_FINDINGS_TOKEN (user-scope secrets) so each review POSTs its run + findings to the gadfly-reports store, and bumps the pinned gadfly image to sha-d7f364d (the build carrying the findings-emit). Advisory only — emit failures never affect the review.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All 3 cloud models converged (all "minor" — example code, no blocking):
- Consolidate: a model whose every lens errored now reads "review incomplete",
not a misleading "no issues found" (all 3 models). + test.
- Consolidate: swarm-cancelled (unattributed) cells now surface a "swarm
cancelled — N cell(s) did not run" banner instead of vanishing (all 3). + test.
- main: io.ReadAll(os.Stdin) error is surfaced (all 3); a TTY stdin no longer
hangs forever (TTY guard, minimax).
- providerOf: a bare tier name now keys its own PerKey bucket instead of all
bare tiers collapsing onto "tier" (minimax, glm-5.2) — distinct tiers throttle
independently.
- Review doc reworded (the closure, not fanout, carries per-cell errors).
Left as documented example-scope behavior: no per-cell timeout (caller supplies
ctx), unknown-severity → lowest rank (no crash).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
examples/reviewer proves the core is sufficient for a static-binary light host
(gadfly's shape) with NO batteries:
- config.Env + model.Configure -> env-driven model fleet + tier overrides
- model.ParseModelForContext -> tier resolution + failover
- fanout.Run (PerKey caps) -> N models x M lenses swarm, per-provider bound
- model.GenerateWith[T] -> structured findings per (model, lens) cell
- Consolidate -> one verdict-led report section per model
Hermetic test runs the full 2x3 swarm against majordomo's fake provider and
asserts the consolidated verdicts. A go list -deps CI check asserts the canary
imports ZERO batteries (the light-tier invariant) — gadfly's go.sum stays free
of gorm/redis/discordgo/sqlite. README + docs updated.
This is the canary; migrating the LIVE gadfly repo onto executus core is a
follow-up (kept separate to not destabilize the active reviewer).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merges the skill half of the persona/skill pair plus the second nested module.
(Squashed onto main from phase-4b-skill; the audit/budget/persona batteries it
was stacked on already landed via the P4 merge.)
- skill/: clean-redesign Skill noun + LEAN SkillStore (lifecycle/versions/
schedule only) + ToRunnable + Memory default.
- contrib/store/: separate go.mod carrying modernc.org/sqlite, so the driver
never enters the core go.sum. db.Budget()/Personas()/Skills()/Audit() back
all four store seams (JSON-blob + indexed columns; round-trip tested).
Includes the verified gadfly #5 fixes (AppendVersion tx+UNIQUE+error,
Mark*ScheduledRun atomic json_set, busy_timeout, NaN guard).
- CI: builds + tests the nested module and asserts it owns the sqlite driver.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Security (all 3 models — HIGH): audit OnTool persisted raw tool args + results
verbatim for the very tools the OnStep narration-redaction flags as secret
(mcp_call/email_send/http_*) — the args/results are what CARRY the secret, so
they landed in skill_run_logs unredacted. Factored the predicate into
isSecretTool() (single source of truth) and OnTool now emits
args_redacted/result_redacted (+ lengths) for secret tools. Test asserts no
secret reaches the log. (persona) webhook_ip_allowlist entries are now
CIDR/IP-validated at load (malformed dropped + warned) instead of accepted raw.
Contract correctness (glm-5.2 + deepseek) — audit Memory now honors its
documented Storage contract: ListChildrenByParent/ListFinishedRunsBefore return
oldest-first; WalkParentChain returns root-first and honors MaxParentChainDepth;
ListRunsFiltered clamps limit (<=0 or >500 -> 50); ListFinishedRunsBefore with
limit<=0 returns none; an explicit RunFilter.Status (incl. "dry_run") matches
regardless of IncludeDryRun; LastRunBySkills counts only status=="ok" unless
includeFailed. (PurgeOlderThan's FinishedAt key is the SAFE behavior — in-flight
runs retained — so the doc was aligned to it, not the impl.)
Error-handling: appendLog now uses a bounded context (auditAppendTimeout=3s) so
a hung backend can't block the run goroutine on the hot path; Sink.StartRun
logs its (still best-effort) failure instead of swallowing it; budget Memory.Get
uses RLock (RWMutex); budget package doc fixed (was skillexec's); Check uses the
budgetWindow constant, not a duplicated literal.
Triaged false-positive: NewNoOpBudget returning BudgetTracker is assignable to
run.Budget (identical method sets) — no change needed.
Core go.sum still free of host/DB deps.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The headline P4 piece (clean redesign): the Agent persona noun, decoupled from
its Discord shell.
- agent.go/storage.go/builtin_loader.go moved from mort's pkg/logic/agents; the
Storage seam drops the Discord CommandBindingStorage embedding (a host
concern). The host-entangled files (commands, chatbot_provider, command-
binding dispatcher, personalization, system) stay in mort.
- runnable.go: Agent.ToRunnable() lowers a persona into run.RunnableAgent — the
bridge that lets run.Executor run a persona without importing this battery
(the inversion of agentexec.Run(*agents.Agent)).
- memory.go: NewMemory() — zero-dep in-process persona Storage (all 11 CRUD +
trigger-query methods).
Tests: ToRunnable field/phase mapping; Memory round-trip. CI invariant: core
imports ZERO from persona.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Second Tier-2 battery, plugging into run.Ports.Budget:
- budget.go: skillexec's BudgetTracker / NoOpBudget / DBBudget moved clean
(stdlib only). Check/Commit match run.Budget exactly (compile-time proof in
run.go: NoOpBudget and *DBBudget are run.Budget).
- storage.go: the BudgetStorage seam + SkillBudget domain, split out of mort's
GORM file (the GORM impl stays in mort).
- memory.go: NewMemory() — zero-dependency in-process BudgetStorage with the
7-day rolling-window rollover in Add.
Tests: per-user cap enforced, window rolls over after 7 days, NoOp always
allows. CI invariant: core imports ZERO from the budget battery.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
First Tier-2 battery, plugging into run.Ports.Audit:
- storage.go/writer.go: skillaudit's Storage interface + per-run Writer moved
clean (only utils->fmt); the Writer already matches run.RunRecorder's shape.
- sink.go: Sink adapts a Storage to run.Audit (StartRun -> a run row + a Writer
wrapped as run.RunRecorder, converting run.RunStats on Close). NewSink(nil) is
equivalent to no audit. Compile-time proofs: Sink is run.Audit, recorder is
run.RunRecorder.
- memory.go: NewMemory() — a zero-dependency, queryable in-process Storage
(retains runs + logs; all 17 read/filter/purge/walk methods) so a light host
gets run history with no setup. Mort keeps its GORM Storage; contrib/store
adds durable SQLite at P4.
End-to-end test: wire audit.NewSink(audit.NewMemory()) into the executor, run an
agent, and the run is recorded with terminal status/output and queryable by
caller. CI invariant verified: core imports ZERO from the audit battery (proper
battery direction; battery imports core, never the reverse).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All 3 cloud models converged on a real access-control bug; fixed it + the
other genuine findings (the false-positives were dropped):
Security (HIGH — all 3 models):
- create_file_url skipped ValidateScope: a same-skill caller could mint a
PUBLIC url for a file scoped to another user/run. Now runs ValidateScope
(admin-aware), skipped only for the descendant-grant case — mirroring the
read tools.
Other real fixes:
- ValidateScope hard-coded `false` at every call site (admin branch dead) ->
pass inv.CallerIsAdmin (the executor sets it via the host AdminPolicy; still
false/fail-closed when no admin). Stale "no admin flag" comment corrected.
- create_file_url: ExpiresInSeconds clamped BEFORE the *time.Second multiply
(huge values overflowed to a negative duration that slipped under the cap,
minting already-expired tokens); swallowed json.Marshal error now returned.
- RegisterMeta: build the default budget WITH the configured MaxPerRun (was
NewInMemorySearchBudget(nil) -> hardcoded 10, ignoring MetaDeps.MaxPerRun).
- classify: all-zero scores no longer return a false-positive top-1 winner;
coerceClassifyScore uses strconv.ParseFloat (rejects trailing garbage like
"50extra" that fmt.Sscanf silently accepted).
- file_delete: honor the descendant grant (parent can clean up a worker's
artifacts) — was the lone cross-skill-reject-outright file tool.
- meta tools: input caps truncate at a UTF-8 rune boundary (truncateUTF8), not
mid-rune.
- think: removed the dead `var _ = fmt.Errorf` import-keeper; file_save default
aligned to 16 MiB (matched RegisterStore).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
RegisterStore(reg, StoreDeps) registers the persistent-memory tools over the
host's KV and/or File backends:
- kv_get/set/list/delete (KVStorage seam)
- file_save/get/get_text/get_metadata/list/delete (FileStorage seam), plus
file_search (FileSearcher) and create_file_url (FileTokenMinter) when wired.
Near-zero-config: Quota defaults to a generous static cap (staticQuota), the
per-value/per-file caps default, and the kv vs file groups register
independently (a host can take just one). Seams moved clean (interface-only):
kv_storage.go, quota_provider.go, file_descendant_grant.go. The default
in-memory KV/File backends come with contrib/store at P4.
Core go.sum still free of gorm/redis/discordgo/sqlite.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Grow executus/tools into a real generic tool library:
- Register(reg): the always-available, zero-config tools — think, now (UTC
unless a CurrentTimeProvider is wired), cite (inert unless a CitationStorage
is wired). All nil-safe; a light host calls Register and is useful.
- RegisterMeta(reg, MetaDeps): the LLM-backed meta tools — classify,
extract_entities, summarize — over the llmmeta helper. Budget defaults to the
shipped in-memory per-run cap; Files optional; caps default.
- Seams moved (interface/type-only, no host coupling): research_providers.go
(CurrentTimeProvider/CitationStorage/SearchBudget/PageExtractor/PDFFetcher/…)
and file_storage.go (FileStorage + FileDomainMeta). Plus the in-memory budget
default (research_defaults.go) and scope_validate.go.
calculate deferred (drags github.com/Krognol/go-wolfram + a module-path replace
— not worth it in the lean core for one tool). Core go.sum still free of
gorm/redis/discordgo/sqlite/wolfram.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stand up executus/tools — the generic, host-agnostic tool library — and prove
the full pattern end to end:
- tools/tools.go: Register(reg) adds the always-available zero-dependency tools
(currently `think`). A light host calls it and is immediately useful; backed
tools (web/store/meta groups) will register via grouped registrars with
nil-safe Deps as they land.
- tools/think.go: the `think` tool moved from mort (imports only executus/tool).
- tools/integration_test.go: end-to-end proof that the executor runs an agent
which CALLS a registered tool — the fake model emits a `think` tool call, the
executor dispatches it through the registry, the model finalises, and the step
instrumentation captures the `think` step. Exercises the full tool-dispatch
loop through run.Executor.
Stacked on phase-2-run-kernel (P3 needs run.Executor). Remaining P3: the
meta/web/net/store/compose groups + their Deps + default backends (splitting
mort's default.go grab-bag).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bump the gadfly image to sha-d0de034 (adds GADFLY_PROVIDER_LENS_CONCURRENCY)
and move ollama-cloud's concurrency from the MODEL axis to the LENS axis:
- GADFLY_PROVIDER_CONCURRENCY: ollama-cloud=1 (one model at a time)
- GADFLY_PROVIDER_LENS_CONCURRENCY: ollama-cloud=3 (its 3 lenses concurrent)
Net: still 3 models, but reviewed serially — the first model's consolidated
comment lands sooner and each model finishes faster, while the other two
models' comments arrive in series after it (instead of all 3 in parallel).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Measured on the P2 review: the local Macs (m1/m5) took 26–29 min with lens
timeouts and found ZERO real bugs, while the two cloud models found every
genuine finding in 6–12 min. Drop the Macs; add glm-5.2:cloud as a third
cloud reviewer. Net: faster (~29→~12 min) and higher signal.
Models: minimax-m3:cloud, deepseek-v4-flash:cloud, glm-5.2:cloud
(ollama-cloud=3 concurrency). timeout-minutes 90→30.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Independently verified all 18 gadfly findings against the code (18-agent
fan-out). Fixed the 9 real ones; the other 9 were false-positive /
hallucinated / valid-tradeoff (no change).
High:
- F1 nil model: a Models resolver returning (ctx,nil,nil) flowed into the
agent loop and nil-panicked. Now a clean error (Run never panics). +test.
- F9 compactor data-leak: renderTranscript sent tool-call args verbatim to
the summarizer (a possibly-different provider/tier); secret-bearing tool
args (mcp_call/email_send/http_*/webhook_*) are now redacted, with a doc
note that result bodies still flow (summary needs them).
Medium/minor:
- F2 compactor error path returned the folded slice, not the original msgs
(contradicting the documented non-fatal contract) -> return msgs.
- F3 RunStats.Status only ok/error; now timeout (DeadlineExceeded) /
cancelled (Canceled) via statusFor. +test.
- F4 step-zip emitted empty-name "ghost" steps when results>calls; now pairs
min(calls,results) only.
- F5 SetIteration was never called -> RunState.Iteration always 0; the step
observer now updates it each loop.
- F6 matchPending fallback was LIFO; now FIFO (matches the per-key queue).
- F7 estimateTokens had no default arm (future Part kinds counted as 0);
unknown parts now counted conservatively.
- F8 cloud_sync silently truncated >1MiB responses -> opaque JSON error; now
a clear "response exceeded N bytes" via readCapped.
- F12 step observer captured the caller ctx; now the merged runCtx.
- F13 compaction onFire was nil (doc claimed it logged); now wired to
audit LogEvent("compaction_fired").
- F11 (no pre-dispatch hook in majordomo) documented honestly as a known
limitation; F18 UsageSink doc clarified cache tokens are subsets of input.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>