P1: model layer (convar->config inversion) + llmmeta

Lifts mort's pkg/logic/llms into executus/model, decoupled from mort: - tiers.go: the tier resolver now reads a host-supplied config.Source under "model.tier.<name>" with host-supplied fallbacks (Configure(cfg, defaults, ttl)), instead of convar.Manager. Tier NAMES + specs are host config; the resolution mechanism (cache, reasoning-suffix dialect, chain validation) is generic. No tier names hard-coded in the harness. - sink.go: usage/trace recording inverted off mort's llmusage/llmtrace into UsageSink / TraceSink seams + a model-owned Span, with nil-safe context attribution helpers (WithModel/WithTraceID/WithUsageTool/WithUsageUser). Both sinks optional (nil = off) so a light host records nothing. - lane decoration repointed to executus/lane; utils.Errorf -> fmt.Errorf. - call.go keeps GenerateWith[T] (instrumented structured output) — this is the structured-output primitive; no separate structured/ package. - llmmeta moved over model/ (the meta-LLM helper: tier allowlist + JSON retry + ledger). Its tests configure a minimal tier table via TestMain. New tests cover the inversion: config overrides fallback, tier registration, reasoning-suffix survival, nested-tier rejection, nil-sink no-ops. Full module: go build/vet/test -race green; core go.sum still free of gorm/redis/discordgo/sqlite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 19:47:13 -04:00
parent 741d7816ed
commit b424261aca
17 changed files with 3698 additions and 3 deletions
@@ -0,0 +1,615 @@
+// Package llmmeta is the shared meta-LLM helper used by the v12
+// authoring tools (summarize, translate, extract_entities, classify).
+//
+// Why a dedicated package: each of those four tools makes "one fast-tier
+// LLM call → typed result", with shared concerns (tier allowlist,
+// ledger row, JSON-retry on malformed output). Centralising the pattern
+// stops every tool from re-implementing the surrounding bookkeeping and
+// keeps the audit trail uniform.
+//
+// The helper itself does NOT know about the four tools — it just exposes
+// a Call(ctx, CallSpec) → CallResult shape. Each tool builds its own
+// prompt + parses the typed result. The helper records the meta-call
+// ledger row on every call, success or failure.
+//
+// Concurrency / lanes: the helper resolves the tier to an llm.Model via
+// model.ParseModelForContext and uses model.Generate. Lane routing is
+// already baked in at the LLM transport layer (see
+// pkg/logic/llms/lane_transport.go) so each Generate call automatically
+// goes through the right lane without further plumbing. Usage recording
+// is automatic too: parsed models are instrumented by pkg/logic/llms,
+// so the helper does NOT call model.RecordUsage itself.
+//
+// Tier allowlist: convar `skills.llm_meta.allowed_tiers` (default
+// `["fast"]`) controls which tiers a meta-tool may use. A request for
+// a disallowed tier returns error_kind="tier_not_allowed" WITHOUT
+// making the call AND WITHOUT recording a ledger row (the call did
+// not happen).
+//
+// Test: helper_test.go covers tier allowed, tier rejected, JSON
+// retry path, malformed-twice path, and ledger-row emission semantics.
+package llmmeta
+
+import (
+	"context"
+	"encoding/json"
+	"fmt"
+	"strings"
+	"time"
+
+	llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
+	"github.com/google/uuid"
+
+	"gitea.stevedudenhoeffer.com/steve/executus/model"
+)
+
+// MetaCall is the domain row written to skill_llm_meta_calls on every
+// helper call.
+//
+// Why a dedicated table (not skill_run_logs): per-skill token
+// aggregation is cleaner with typed columns. Folding meta-calls into
+// the generic event log would force a SUM-from-JSON path on every
+// dashboard query.
+//
+// Why the field set is tight (no payload columns): the request bodies
+// can be 32KB+. The agent's main run already captures system_prompt
+// + user_message in the trace; storing them again here would double
+// the audit footprint with no diagnostic value (the meta-call's
+// inputs are derivable from the parent run's tool-call args).
+type MetaCall struct {
+	ID           string
+	RunID        string
+	SkillID      string
+	ToolName     string
+	TierUsed     string // "fast" / "standard"
+	ModelUsed    string // resolved provider/model
+	InputTokens  int
+	OutputTokens int
+	DurationMs   int
+	Success      bool
+	ErrorKind    string // empty on success; one of the sentinel kinds otherwise
+	CreatedAt    time.Time
+}
+
+// Storage is the narrow surface the helper uses to persist meta-call
+// ledger rows. Production wires a thin adapter around the skills GORM
+// storage; tests substitute a fake.
+//
+// Why an interface (vs depending on pkg/logic/skills.Storage): the
+// skills package imports skilltools (tool registry); having
+// skilltools/llmmeta depend back on skills would form an import
+// cycle. A narrow interface mirrored across the boundary is the
+// project's standard cycle-break pattern (see KVStorage / FileStorage
+// in pkg/skilltools/tools/).
+type Storage interface {
+	RecordMetaCall(ctx context.Context, call MetaCall) error
+}
+
+// ConvarReader is the narrow surface the helper uses to read
+// `skills.llm_meta.allowed_tiers`. The convar package is database-
+// backed; tests pass a static fake.
+//
+// Why an interface (vs reading convars directly): unit tests want to
+// fake the allowlist without spinning up a convar manager.
+type ConvarReader interface {
+	// AllowedTiers returns the list of tier names a meta-tool may use.
+	// Default ["fast"].
+	AllowedTiers(ctx context.Context) []string
+}
+
+// ConvarReaderFunc adapts a closure into a ConvarReader. Useful in
+// production wiring (mort.go) where the underlying access is a
+// single line of logic.
+type ConvarReaderFunc func(ctx context.Context) []string
+
+// AllowedTiers satisfies ConvarReader.
+func (f ConvarReaderFunc) AllowedTiers(ctx context.Context) []string {
+	if f == nil {
+		return []string{"fast"}
+	}
+	return f(ctx)
+}
+
+// Helper makes one fast-tier LLM call with surrounding bookkeeping
+// (tier allowlist, JSON retry, ledger row).
+//
+// Construct once at boot; all four meta-tools share the same Helper.
+type Helper struct {
+	storage Storage
+	convars ConvarReader
+}
+
+// New constructs a Helper. storage MUST be non-nil; passing nil makes
+// every Call write a no-op ledger row (callers that need a fully no-op
+// helper should instead avoid registering the tool).
+//
+// convars may be nil — the helper falls back to the default allowlist
+// `["fast"]`.
+//
+// Why a constructor with explicit deps (vs Helper{...} struct
+// initialiser): forces the deployment-time decision about which
+// dependencies are wired vs nil-safe at the construction call site,
+// not at the call site of each tool.
+func New(storage Storage, convars ConvarReader) *Helper {
+	return &Helper{
+		storage: storage,
+		convars: convars,
+	}
+}
+
+// CallSpec is the per-call input.
+//
+// Why every field is explicit (vs builder pattern): the four meta-tools
+// each populate the spec in one place; a struct literal at the call
+// site is more readable than chained setters.
+type CallSpec struct {
+	// Tier is the tier alias to use ("fast" / "standard"). Empty falls
+	// back to "fast". Disallowed tiers (per the convar allowlist) cause
+	// Call to return CallResult{Success: false, ErrorKind:
+	// "tier_not_allowed"} WITHOUT making the LLM call AND without
+	// writing a ledger row (the call did not happen).
+	Tier string
+
+	// SystemPrompt is the system message. May be empty.
+	SystemPrompt string
+
+	// UserPrompt is the user message. Required.
+	UserPrompt string
+
+	// MaxOutputTokens caps the response. 0 disables the cap (provider
+	// default). The helper uses this both to bound the cost estimate
+	// AND to set llm.WithMaxTokens on the request.
+	MaxOutputTokens int
+
+	// ResponseFormat is "text" or "json". When "json", the helper
+	// attempts to parse the response into JSON. Other values fall
+	// through as "text".
+	ResponseFormat string
+
+	// RetryOnMalformedJSON, when true and ResponseFormat=="json",
+	// retries the call ONCE with a stricter JSON-only prompt prefix
+	// when the first response fails to parse. Second-failure returns
+	// CallResult{Success: true, Parsed: nil, ErrorKind:
+	// "malformed_json"} so callers can fall back to result.Text.
+	RetryOnMalformedJSON bool
+
+	// ToolName is the meta-tool name recorded in the ledger row
+	// ("summarize", "translate", "extract_entities", "classify"). The
+	// helper does not branch on this value.
+	ToolName string
+
+	// RunID is the calling skill run ID. Recorded in the ledger row;
+	// also used by the cost-cap callback to find the running 7-day
+	// total.
+	RunID string
+
+	// SkillID is the calling skill ID. Recorded in the ledger row;
+	// passed to the cost-cap callback.
+	SkillID string
+
+	// CallerID is the Discord member ID that triggered the parent
+	// skill run. Passed to the cost-cap callback so the per-user
+	// 7-day cap can be evaluated.
+	CallerID string
+}
+
+// CallResult is the per-call output.
+//
+// Why text + parsed (vs only one): JSON-format calls expose both the
+// raw response (in .Text) and the parsed map (in .Parsed). Text-format
+// calls leave .Parsed nil. Callers requesting JSON that fails to parse
+// twice get .Text populated and ErrorKind="malformed_json" so they
+// can fall back to text-mode without an error path.
+type CallResult struct {
+	// Text is the raw response text from the LLM. Populated on every
+	// successful call (success=true) AND when JSON parsing failed
+	// twice (success=true, parsed=nil, error_kind="malformed_json").
+	// Empty on tier_not_allowed rejections (no LLM call happened).
+	Text string
+
+	// Parsed is the JSON-decoded response. nil for text-format calls,
+	// nil for failed JSON parses, populated for successful JSON
+	// responses. The interior shape is whatever the LLM returned; the
+	// caller is responsible for asserting a typed view.
+	Parsed any
+
+	// InputTokens is the tokens billed against the input. 0 when the
+	// provider didn't surface usage.
+	InputTokens int
+
+	// OutputTokens is the tokens billed against the output. 0 when the
+	// provider didn't surface usage.
+	OutputTokens int
+
+	// DurationMs is wall-clock duration of the LLM call (or call+retry
+	// in the JSON-retry case).
+	DurationMs int
+
+	// ModelUsed is the resolved provider/model string ("anthropic/
+	// claude-haiku-4-5-20251001"). Populated on every actual LLM call;
+	// empty on tier_not_allowed rejections.
+	ModelUsed string
+
+	// Success reports whether the LLM call returned a usable response.
+	// True on happy-path AND on malformed-json second-failure (the
+	// caller can fall back to .Text). False on transport errors,
+	// tier_not_allowed, llm_unavailable.
+	Success bool
+
+	// ErrorKind, when non-empty, is one of:
+	//   - "tier_not_allowed" → no call, no ledger row
+	//   - "llm_unavailable"  → call attempted, ledger row written
+	//   - "malformed_json"   → call succeeded but JSON parse failed
+	ErrorKind string
+}
+
+// Sentinel error_kind values for CallResult.ErrorKind.
+const (
+	ErrorKindTierNotAllowed = "tier_not_allowed"
+	ErrorKindLLMUnavailable = "llm_unavailable"
+	ErrorKindMalformedJSON  = "malformed_json"
+)
+
+// Call performs the meta-LLM call and returns a typed CallResult.
+//
+// Why no error return (vs an error second value): every meaningful
+// failure is captured as a CallResult.ErrorKind so the caller's branch
+// logic stays single-pathed. Internal transport errors are surfaced
+// as ErrorKind=llm_unavailable. The function only returns a non-nil
+// error for argument-validation failures (empty UserPrompt) — a
+// programmer error the caller would have to fix anyway.
+//
+// Test: helper_test.go covers all outcomes (tier_not_allowed, happy
+// text, happy json, malformed_json retry-pass, malformed_json
+// retry-fail, llm_unavailable).
+func (h *Helper) Call(ctx context.Context, spec CallSpec) (CallResult, error) {
+	if strings.TrimSpace(spec.UserPrompt) == "" {
+		return CallResult{}, fmt.Errorf("llmmeta: user_prompt required")
+	}
+	tier := strings.TrimSpace(spec.Tier)
+	if tier == "" {
+		tier = "fast"
+	}
+
+	// Tier allowlist: rejected tiers do NOT make the call AND do NOT
+	// record a ledger row.
+	if !h.tierAllowed(ctx, tier) {
+		return CallResult{
+			Success:   false,
+			ErrorKind: ErrorKindTierNotAllowed,
+		}, nil
+	}
+
+	resolvedModel := model.ResolveModelName(tier)
+
+	// Resolve model. ParseModelForContext attaches the resolved model
+	// name to ctx (for usage attribution) AND returns the llm.Model
+	// whose Generate already routes through the lane wrapper.
+	ctx, model, err := model.ParseModelForContext(ctx, tier)
+	if err != nil {
+		// Tier convar mis-set: surface as tier_not_allowed to the
+		// caller (the agent's recovery path is the same as for an
+		// admin-disabled tier) but DO record the failure for the
+		// admin who needs to fix the convar.
+		h.recordLedger(ctx, MetaCall{
+			ID:        uuid.NewString(),
+			RunID:     spec.RunID,
+			SkillID:   spec.SkillID,
+			ToolName:  spec.ToolName,
+			TierUsed:  tier,
+			ModelUsed: resolvedModel,
+			Success:   false,
+			ErrorKind: ErrorKindTierNotAllowed,
+			CreatedAt: time.Now(),
+		})
+		return CallResult{
+			Success:   false,
+			ErrorKind: ErrorKindTierNotAllowed,
+		}, nil
+	}
+
+	// First call.
+	start := time.Now()
+	systemPrompt := spec.SystemPrompt
+	userMessage := spec.UserPrompt
+	opts := []llm.Option{}
+	if spec.MaxOutputTokens > 0 {
+		opts = append(opts, llm.WithMaxTokens(spec.MaxOutputTokens))
+	}
+	text, usage, llmErr := h.complete(ctx, model, systemPrompt, userMessage, opts)
+	if llmErr != nil {
+		duration := int(time.Since(start) / time.Millisecond)
+		h.recordLedger(ctx, MetaCall{
+			ID:           uuid.NewString(),
+			RunID:        spec.RunID,
+			SkillID:      spec.SkillID,
+			ToolName:     spec.ToolName,
+			TierUsed:     tier,
+			ModelUsed:    resolvedModel,
+			InputTokens:  usage.InputTokens,
+			OutputTokens: usage.OutputTokens,
+			DurationMs:   duration,
+			Success:      false,
+			ErrorKind:    ErrorKindLLMUnavailable,
+			CreatedAt:    time.Now(),
+		})
+		return CallResult{
+			Success:      false,
+			ErrorKind:    ErrorKindLLMUnavailable,
+			ModelUsed:    resolvedModel,
+			DurationMs:   duration,
+			InputTokens:  usage.InputTokens,
+			OutputTokens: usage.OutputTokens,
+		}, nil
+	}
+
+	// Determine outcome based on response format.
+	parsed, parsedOK := tryParseJSON(text, spec.ResponseFormat)
+	wantJSON := strings.EqualFold(spec.ResponseFormat, "json")
+
+	if !wantJSON || parsedOK {
+		// Happy path (text mode OR JSON mode that parsed first try).
+		duration := int(time.Since(start) / time.Millisecond)
+		h.recordLedger(ctx, MetaCall{
+			ID:           uuid.NewString(),
+			RunID:        spec.RunID,
+			SkillID:      spec.SkillID,
+			ToolName:     spec.ToolName,
+			TierUsed:     tier,
+			ModelUsed:    resolvedModel,
+			InputTokens:  usage.InputTokens,
+			OutputTokens: usage.OutputTokens,
+			DurationMs:   duration,
+			Success:      true,
+			CreatedAt:    time.Now(),
+		})
+		return CallResult{
+			Text:         text,
+			Parsed:       parsed,
+			Success:      true,
+			ModelUsed:    resolvedModel,
+			InputTokens:  usage.InputTokens,
+			OutputTokens: usage.OutputTokens,
+			DurationMs:   duration,
+		}, nil
+	}
+
+	// JSON requested but first response failed to parse.
+	if !spec.RetryOnMalformedJSON {
+		duration := int(time.Since(start) / time.Millisecond)
+		h.recordLedger(ctx, MetaCall{
+			ID:           uuid.NewString(),
+			RunID:        spec.RunID,
+			SkillID:      spec.SkillID,
+			ToolName:     spec.ToolName,
+			TierUsed:     tier,
+			ModelUsed:    resolvedModel,
+			InputTokens:  usage.InputTokens,
+			OutputTokens: usage.OutputTokens,
+			DurationMs:   duration,
+			Success:      true,
+			ErrorKind:    ErrorKindMalformedJSON,
+			CreatedAt:    time.Now(),
+		})
+		return CallResult{
+			Text:         text,
+			Success:      true,
+			ErrorKind:    ErrorKindMalformedJSON,
+			ModelUsed:    resolvedModel,
+			InputTokens:  usage.InputTokens,
+			OutputTokens: usage.OutputTokens,
+			DurationMs:   duration,
+		}, nil
+	}
+
+	// Retry once with stricter JSON-only prompt prefix.
+	stricterPrompt := "Return ONLY valid JSON. No prose, no markdown fencing.\n\n" + userMessage
+	text2, usage2, llmErr2 := h.complete(ctx, model, systemPrompt, stricterPrompt, opts)
+	combinedUsage := Tokens{
+		InputTokens:  usage.InputTokens + usage2.InputTokens,
+		OutputTokens: usage.OutputTokens + usage2.OutputTokens,
+	}
+	duration := int(time.Since(start) / time.Millisecond)
+	if llmErr2 != nil {
+		// Retry call itself failed transport-wise. Record the round-
+		// trip tokens and surface llm_unavailable.
+		h.recordLedger(ctx, MetaCall{
+			ID:           uuid.NewString(),
+			RunID:        spec.RunID,
+			SkillID:      spec.SkillID,
+			ToolName:     spec.ToolName,
+			TierUsed:     tier,
+			ModelUsed:    resolvedModel,
+			InputTokens:  combinedUsage.InputTokens,
+			OutputTokens: combinedUsage.OutputTokens,
+			DurationMs:   duration,
+			Success:      false,
+			ErrorKind:    ErrorKindLLMUnavailable,
+			CreatedAt:    time.Now(),
+		})
+		return CallResult{
+			Text:         text,
+			Success:      false,
+			ErrorKind:    ErrorKindLLMUnavailable,
+			ModelUsed:    resolvedModel,
+			InputTokens:  combinedUsage.InputTokens,
+			OutputTokens: combinedUsage.OutputTokens,
+			DurationMs:   duration,
+		}, nil
+	}
+
+	parsed2, parsedOK2 := tryParseJSON(text2, "json")
+	if parsedOK2 {
+		h.recordLedger(ctx, MetaCall{
+			ID:           uuid.NewString(),
+			RunID:        spec.RunID,
+			SkillID:      spec.SkillID,
+			ToolName:     spec.ToolName,
+			TierUsed:     tier,
+			ModelUsed:    resolvedModel,
+			InputTokens:  combinedUsage.InputTokens,
+			OutputTokens: combinedUsage.OutputTokens,
+			DurationMs:   duration,
+			Success:      true,
+			CreatedAt:    time.Now(),
+		})
+		return CallResult{
+			Text:         text2,
+			Parsed:       parsed2,
+			Success:      true,
+			ModelUsed:    resolvedModel,
+			InputTokens:  combinedUsage.InputTokens,
+			OutputTokens: combinedUsage.OutputTokens,
+			DurationMs:   duration,
+		}, nil
+	}
+
+	// Second-failure path. Caller can fall back to result.Text.
+	h.recordLedger(ctx, MetaCall{
+		ID:           uuid.NewString(),
+		RunID:        spec.RunID,
+		SkillID:      spec.SkillID,
+		ToolName:     spec.ToolName,
+		TierUsed:     tier,
+		ModelUsed:    resolvedModel,
+		InputTokens:  combinedUsage.InputTokens,
+		OutputTokens: combinedUsage.OutputTokens,
+		DurationMs:   duration,
+		Success:      true,
+		ErrorKind:    ErrorKindMalformedJSON,
+		CreatedAt:    time.Now(),
+	})
+	return CallResult{
+		Text:         text2,
+		Success:      true,
+		ErrorKind:    ErrorKindMalformedJSON,
+		ModelUsed:    resolvedModel,
+		InputTokens:  combinedUsage.InputTokens,
+		OutputTokens: combinedUsage.OutputTokens,
+		DurationMs:   duration,
+	}, nil
+}
+
+// Tokens is the input/output token count returned by the LLM round-
+// trip. Mirrors llm.Usage's two cost-bearing fields. Exported so
+// downstream test code (the four meta-tools' tests, integration
+// tests) can use SetCompleteForTest.
+type Tokens struct {
+	InputTokens  int
+	OutputTokens int
+}
+
+// CompleteFn is the seam used by tests to fake the LLM round-trip
+// without spinning up a real provider. Exported for tests in other
+// packages (the four meta-tools live in pkg/skilltools/tools/).
+type CompleteFn func(ctx context.Context, model llm.Model, systemPrompt, userMessage string, opts []llm.Option) (string, Tokens, error)
+
+// completeOverride is set in tests via SetCompleteForTest. nil falls
+// back to the real model.Generate path.
+var completeOverride CompleteFn
+
+// complete is the actual LLM round-trip. Calls model.Generate (which
+// already routes through the lane transport wrapper) and returns the
+// text + usage + error.
+//
+// Why not call model.SimpleCall: SimpleCall doesn't surface Usage; we
+// need the input/output token counts for the ledger row.
+//
+// Usage attribution to the per-user / per-skill dashboards is handled
+// by the instrumented model that model.ParseModelForContext returns —
+// a manual model.RecordUsage here would double-count.
+func (h *Helper) complete(ctx context.Context, model llm.Model, systemPrompt, userMessage string, opts []llm.Option) (string, Tokens, error) {
+	if completeOverride != nil {
+		return completeOverride(ctx, model, systemPrompt, userMessage, opts)
+	}
+	req := llm.Request{
+		System:   systemPrompt,
+		Messages: []llm.Message{llm.UserText(userMessage)},
+	}
+	resp, err := model.Generate(ctx, req, opts...)
+	if err != nil {
+		return "", Tokens{}, err
+	}
+	usage := Tokens{
+		InputTokens:  resp.Usage.InputTokens,
+		OutputTokens: resp.Usage.OutputTokens,
+	}
+	return resp.Text(), usage, nil
+}
+
+// SetCompleteForTest installs a fake completer used by Call. Returns a
+// restore function that the test deferes to revert the override.
+//
+// Why exported (vs in a _test.go file): the four meta-tools' tests live
+// in pkg/skilltools/tools/, in a different package than the helper.
+// They need a way to fake the LLM without depending on a real model.
+func SetCompleteForTest(fn CompleteFn) func() {
+	prev := completeOverride
+	completeOverride = fn
+	return func() { completeOverride = prev }
+}
+
+// tierAllowed reports whether the given tier appears in the configured
+// allowlist. Empty allowlist defaults to ["fast"].
+func (h *Helper) tierAllowed(ctx context.Context, tier string) bool {
+	var allowed []string
+	if h.convars != nil {
+		allowed = h.convars.AllowedTiers(ctx)
+	}
+	if len(allowed) == 0 {
+		allowed = []string{"fast"}
+	}
+	for _, t := range allowed {
+		if strings.EqualFold(strings.TrimSpace(t), tier) {
+			return true
+		}
+	}
+	return false
+}
+
+// recordLedger writes one meta-call row. Storage failures are logged
+// at the storage layer; the helper does not propagate them — meta-call
+// accounting MUST NOT break user-visible execution.
+func (h *Helper) recordLedger(ctx context.Context, call MetaCall) {
+	if h.storage == nil {
+		return
+	}
+	_ = h.storage.RecordMetaCall(ctx, call)
+}
+
+// tryParseJSON attempts to decode text as JSON. Returns the parsed
+// value (any) and ok=true on success. ok=false on failure or when
+// format is not "json".
+//
+// Why we accept arbitrary JSON shapes (vs requiring an object): the
+// extract_entities tool returns objects, but classify returns objects
+// with arrays inside. Accepting `any` keeps the helper agnostic to the
+// caller's downstream typing.
+//
+// Tolerance: strips a leading "```json" code fence + matching closing
+// fence so the agent can include surrounding markdown without
+// breaking parse. The stricter retry prompt explicitly asks for no
+// fence; this tolerance is for the first-attempt path.
+func tryParseJSON(text, format string) (any, bool) {
+	if !strings.EqualFold(format, "json") {
+		return nil, false
+	}
+	trimmed := strings.TrimSpace(text)
+	// Strip optional ```json ... ``` fence.
+	if strings.HasPrefix(trimmed, "```") {
+		// Drop opening fence (with or without language tag).
+		if idx := strings.Index(trimmed, "\n"); idx >= 0 {
+			trimmed = trimmed[idx+1:]
+		}
+		// Drop trailing fence.
+		if idx := strings.LastIndex(trimmed, "```"); idx >= 0 {
+			trimmed = trimmed[:idx]
+		}
+		trimmed = strings.TrimSpace(trimmed)
+	}
+	var parsed any
+	if err := json.Unmarshal([]byte(trimmed), &parsed); err != nil {
+		return nil, false
+	}
+	return parsed, true
+}
@@ -0,0 +1,282 @@
+package llmmeta
+
+import (
+	"context"
+	"errors"
+	"strings"
+	"sync"
+	"testing"
+
+	llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
+)
+
+// fakeStorage records every MetaCall handed to RecordMetaCall and
+// makes them available to tests via the captured slice.
+type fakeStorage struct {
+	mu    sync.Mutex
+	calls []MetaCall
+	err   error
+}
+
+func (f *fakeStorage) RecordMetaCall(_ context.Context, call MetaCall) error {
+	f.mu.Lock()
+	defer f.mu.Unlock()
+	f.calls = append(f.calls, call)
+	return f.err
+}
+
+func (f *fakeStorage) snapshot() []MetaCall {
+	f.mu.Lock()
+	defer f.mu.Unlock()
+	out := make([]MetaCall, len(f.calls))
+	copy(out, f.calls)
+	return out
+}
+
+// TestCall_TierNotAllowed: a tier not in the allowlist returns the
+// rejection without recording a ledger row — the call did not happen.
+func TestCall_TierNotAllowed(t *testing.T) {
+	store := &fakeStorage{}
+	convars := ConvarReaderFunc(func(_ context.Context) []string {
+		return []string{"fast"}
+	})
+	h := New(store, convars)
+
+	res, err := h.Call(context.Background(), CallSpec{
+		Tier:       "thinking",
+		UserPrompt: "hello",
+		ToolName:   "summarize",
+	})
+	if err != nil {
+		t.Fatalf("unexpected err: %v", err)
+	}
+	if res.Success {
+		t.Errorf("expected Success=false")
+	}
+	if res.ErrorKind != ErrorKindTierNotAllowed {
+		t.Errorf("ErrorKind = %q, want %q", res.ErrorKind, ErrorKindTierNotAllowed)
+	}
+	if len(store.snapshot()) != 0 {
+		t.Errorf("expected NO ledger row for tier_not_allowed, got %d", len(store.snapshot()))
+	}
+}
+
+// TestCall_TierAllowedHappyText: a permitted tier yields a successful
+// text call AND records a ledger row.
+func TestCall_TierAllowedHappyText(t *testing.T) {
+	store := &fakeStorage{}
+	convars := ConvarReaderFunc(func(_ context.Context) []string {
+		return []string{"fast"}
+	})
+	h := New(store, convars)
+	restore := SetCompleteForTest(func(_ context.Context, _ llm.Model, _, _ string, _ []llm.Option) (string, Tokens, error) {
+		return "summary text here", Tokens{InputTokens: 50, OutputTokens: 12}, nil
+	})
+	defer restore()
+
+	res, err := h.Call(context.Background(), CallSpec{
+		Tier:           "fast",
+		UserPrompt:     "summarise the following ...",
+		ToolName:       "summarize",
+		ResponseFormat: "text",
+		RunID:          "run-1",
+		SkillID:        "sk-1",
+	})
+	if err != nil {
+		t.Fatalf("unexpected err: %v", err)
+	}
+	if !res.Success {
+		t.Errorf("expected Success=true; got ErrorKind=%q", res.ErrorKind)
+	}
+	if res.Text != "summary text here" {
+		t.Errorf("Text = %q, want %q", res.Text, "summary text here")
+	}
+	if res.InputTokens != 50 || res.OutputTokens != 12 {
+		t.Errorf("token counts wrong: in=%d out=%d", res.InputTokens, res.OutputTokens)
+	}
+	if got := len(store.snapshot()); got != 1 {
+		t.Fatalf("expected 1 ledger row, got %d", got)
+	}
+	row := store.snapshot()[0]
+	if !row.Success {
+		t.Errorf("ledger Success = false, want true")
+	}
+	if row.ToolName != "summarize" {
+		t.Errorf("ledger ToolName = %q", row.ToolName)
+	}
+	if row.RunID != "run-1" {
+		t.Errorf("ledger RunID = %q", row.RunID)
+	}
+	if row.InputTokens != 50 || row.OutputTokens != 12 {
+		t.Errorf("ledger token counts wrong: in=%d out=%d",
+			row.InputTokens, row.OutputTokens)
+	}
+}
+
+// TestCall_JSONFirstAttemptParses: JSON-format request, response is
+// valid JSON on first try; result.Parsed populated.
+func TestCall_JSONFirstAttemptParses(t *testing.T) {
+	store := &fakeStorage{}
+	h := New(store, nil)
+	restore := SetCompleteForTest(func(_ context.Context, _ llm.Model, _, _ string, _ []llm.Option) (string, Tokens, error) {
+		return `{"foo":"bar","n":42}`, Tokens{InputTokens: 10, OutputTokens: 5}, nil
+	})
+	defer restore()
+
+	res, _ := h.Call(context.Background(), CallSpec{
+		UserPrompt:           "extract entities",
+		ToolName:             "extract_entities",
+		ResponseFormat:       "json",
+		RetryOnMalformedJSON: true,
+		SkillID:              "sk-2",
+	})
+	if !res.Success || res.ErrorKind != "" {
+		t.Fatalf("expected success, got %+v", res)
+	}
+	m, ok := res.Parsed.(map[string]any)
+	if !ok {
+		t.Fatalf("Parsed not a map: %T %v", res.Parsed, res.Parsed)
+	}
+	if m["foo"] != "bar" {
+		t.Errorf("Parsed[foo] = %v", m["foo"])
+	}
+}
+
+// TestCall_JSONRetryPath: first response is malformed JSON; second
+// response (after stricter prompt) parses cleanly.
+func TestCall_JSONRetryPath(t *testing.T) {
+	store := &fakeStorage{}
+	h := New(store, nil)
+	calls := 0
+	restore := SetCompleteForTest(func(_ context.Context, _ llm.Model, _, prompt string, _ []llm.Option) (string, Tokens, error) {
+		calls++
+		if calls == 1 {
+			return "Here is your JSON: {oh no I forgot to format it", Tokens{InputTokens: 8, OutputTokens: 12}, nil
+		}
+		// Verify stricter prompt prefix appeared on retry.
+		if !strings.Contains(prompt, "Return ONLY valid JSON") {
+			t.Errorf("retry prompt missing stricter prefix: %q", prompt)
+		}
+		return `{"key":"value"}`, Tokens{InputTokens: 14, OutputTokens: 6}, nil
+	})
+	defer restore()
+
+	res, _ := h.Call(context.Background(), CallSpec{
+		UserPrompt:           "extract",
+		ToolName:             "extract_entities",
+		ResponseFormat:       "json",
+		RetryOnMalformedJSON: true,
+	})
+	if !res.Success || res.ErrorKind != "" {
+		t.Fatalf("expected success, got %+v", res)
+	}
+	if calls != 2 {
+		t.Errorf("expected 2 LLM calls, got %d", calls)
+	}
+	m, _ := res.Parsed.(map[string]any)
+	if m["key"] != "value" {
+		t.Errorf("Parsed = %v", res.Parsed)
+	}
+	// Token counts should reflect both attempts.
+	if res.InputTokens != 22 || res.OutputTokens != 18 {
+		t.Errorf("combined tokens wrong: in=%d out=%d", res.InputTokens, res.OutputTokens)
+	}
+}
+
+// TestCall_JSONRetryFailsTwice: second attempt also fails to parse.
+// Surfaces ErrorKind=malformed_json AND keeps Success=true so the
+// caller can fall back to result.Text.
+func TestCall_JSONRetryFailsTwice(t *testing.T) {
+	store := &fakeStorage{}
+	h := New(store, nil)
+	restore := SetCompleteForTest(func(_ context.Context, _ llm.Model, _, _ string, _ []llm.Option) (string, Tokens, error) {
+		return "still not JSON", Tokens{InputTokens: 10, OutputTokens: 4}, nil
+	})
+	defer restore()
+
+	res, _ := h.Call(context.Background(), CallSpec{
+		UserPrompt:           "extract",
+		ToolName:             "extract_entities",
+		ResponseFormat:       "json",
+		RetryOnMalformedJSON: true,
+	})
+	if !res.Success {
+		t.Errorf("expected Success=true (fall-back-to-text), got Success=false")
+	}
+	if res.ErrorKind != ErrorKindMalformedJSON {
+		t.Errorf("ErrorKind = %q, want %q", res.ErrorKind, ErrorKindMalformedJSON)
+	}
+	if res.Parsed != nil {
+		t.Errorf("Parsed = %v, want nil after failed retry", res.Parsed)
+	}
+	rows := store.snapshot()
+	if len(rows) != 1 {
+		t.Fatalf("expected 1 ledger row, got %d", len(rows))
+	}
+	if !rows[0].Success || rows[0].ErrorKind != ErrorKindMalformedJSON {
+		t.Errorf("ledger row mismatch: %+v", rows[0])
+	}
+}
+
+// TestCall_LLMUnavailable: transport error from the model.Generate
+// call is surfaced as ErrorKind=llm_unavailable AND records a ledger
+// row.
+func TestCall_LLMUnavailable(t *testing.T) {
+	store := &fakeStorage{}
+	h := New(store, nil)
+	restore := SetCompleteForTest(func(_ context.Context, _ llm.Model, _, _ string, _ []llm.Option) (string, Tokens, error) {
+		return "", Tokens{}, errors.New("network error")
+	})
+	defer restore()
+
+	res, _ := h.Call(context.Background(), CallSpec{
+		UserPrompt: "hi",
+		ToolName:   "summarize",
+	})
+	if res.Success {
+		t.Errorf("expected Success=false")
+	}
+	if res.ErrorKind != ErrorKindLLMUnavailable {
+		t.Errorf("ErrorKind = %q, want %q", res.ErrorKind, ErrorKindLLMUnavailable)
+	}
+	rows := store.snapshot()
+	if len(rows) != 1 {
+		t.Fatalf("expected 1 ledger row, got %d", len(rows))
+	}
+}
+
+// TestCall_EmptyUserPromptErrors: programmer-error guard.
+func TestCall_EmptyUserPromptErrors(t *testing.T) {
+	h := New(&fakeStorage{}, nil)
+	_, err := h.Call(context.Background(), CallSpec{ToolName: "summarize"})
+	if err == nil {
+		t.Fatal("expected error for empty user_prompt")
+	}
+}
+
+// TestCall_JSONWithCodeFenceParses: tolerance for the first-attempt
+// response wrapped in a ```json ... ``` fence. The retry path uses a
+// stricter prompt; this test pins the first-attempt tolerance so
+// callers don't waste a round-trip on a benign formatting wrapper.
+func TestCall_JSONWithCodeFenceParses(t *testing.T) {
+	store := &fakeStorage{}
+	h := New(store, nil)
+	restore := SetCompleteForTest(func(_ context.Context, _ llm.Model, _, _ string, _ []llm.Option) (string, Tokens, error) {
+		return "```json\n{\"x\":1}\n```", Tokens{InputTokens: 5, OutputTokens: 4}, nil
+	})
+	defer restore()
+
+	res, _ := h.Call(context.Background(), CallSpec{
+		UserPrompt:           "extract",
+		ToolName:             "extract_entities",
+		ResponseFormat:       "json",
+		RetryOnMalformedJSON: true,
+	})
+	if res.ErrorKind != "" {
+		t.Errorf("unexpected ErrorKind %q (fenced JSON should parse on first attempt)", res.ErrorKind)
+	}
+	m, _ := res.Parsed.(map[string]any)
+	if m["x"] != float64(1) {
+		t.Errorf("Parsed[x] = %v, want 1", m["x"])
+	}
+}
@@ -0,0 +1,21 @@
+package llmmeta
+
+import (
+	"os"
+	"testing"
+	"time"
+
+	"gitea.stevedudenhoeffer.com/steve/executus/model"
+)
+
+// TestMain configures a minimal model tier table so the helper's
+// model.ParseModelForContext("fast"/"standard") resolves. The actual LLM call
+// is stubbed per-test via SetCompleteForTest, so these specs are only parsed
+// (anthropic registers with an empty key and errors at call time, not parse).
+func TestMain(m *testing.M) {
+	model.Configure(nil, map[string]string{
+		"fast":     "anthropic/claude-haiku-4-5",
+		"standard": "anthropic/claude-sonnet-4-6",
+	}, time.Minute)
+	os.Exit(m.Run())
+}