P1: model layer (convar->config inversion) + llmmeta
executus CI / test (pull_request) Successful in 58s
Adversarial Review (Gadfly) / review (pull_request) Successful in 26m27s
executus CI / test (push) Successful in 1m2s

Lifts mort's pkg/logic/llms into executus/model, decoupled from mort:

- tiers.go: the tier resolver now reads a host-supplied config.Source under
  "model.tier.<name>" with host-supplied fallbacks (Configure(cfg, defaults,
  ttl)), instead of convar.Manager. Tier NAMES + specs are host config; the
  resolution mechanism (cache, reasoning-suffix dialect, chain validation) is
  generic. No tier names hard-coded in the harness.
- sink.go: usage/trace recording inverted off mort's llmusage/llmtrace into
  UsageSink / TraceSink seams + a model-owned Span, with nil-safe context
  attribution helpers (WithModel/WithTraceID/WithUsageTool/WithUsageUser).
  Both sinks optional (nil = off) so a light host records nothing.
- lane decoration repointed to executus/lane; utils.Errorf -> fmt.Errorf.
- call.go keeps GenerateWith[T] (instrumented structured output) — this is the
  structured-output primitive; no separate structured/ package.
- llmmeta moved over model/ (the meta-LLM helper: tier allowlist + JSON retry
  + ledger). Its tests configure a minimal tier table via TestMain.

New tests cover the inversion: config overrides fallback, tier registration,
reasoning-suffix survival, nested-tier rejection, nil-sink no-ops.

Full module: go build/vet/test -race green; core go.sum still free of
gorm/redis/discordgo/sqlite.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit was merged in pull request #1.
This commit is contained in:
2026-06-26 19:47:13 -04:00
parent 741d7816ed
commit b424261aca
17 changed files with 3698 additions and 3 deletions
+615
View File
@@ -0,0 +1,615 @@
// Package llmmeta is the shared meta-LLM helper used by the v12
// authoring tools (summarize, translate, extract_entities, classify).
//
// Why a dedicated package: each of those four tools makes "one fast-tier
// LLM call → typed result", with shared concerns (tier allowlist,
// ledger row, JSON-retry on malformed output). Centralising the pattern
// stops every tool from re-implementing the surrounding bookkeeping and
// keeps the audit trail uniform.
//
// The helper itself does NOT know about the four tools — it just exposes
// a Call(ctx, CallSpec) → CallResult shape. Each tool builds its own
// prompt + parses the typed result. The helper records the meta-call
// ledger row on every call, success or failure.
//
// Concurrency / lanes: the helper resolves the tier to an llm.Model via
// model.ParseModelForContext and uses model.Generate. Lane routing is
// already baked in at the LLM transport layer (see
// pkg/logic/llms/lane_transport.go) so each Generate call automatically
// goes through the right lane without further plumbing. Usage recording
// is automatic too: parsed models are instrumented by pkg/logic/llms,
// so the helper does NOT call model.RecordUsage itself.
//
// Tier allowlist: convar `skills.llm_meta.allowed_tiers` (default
// `["fast"]`) controls which tiers a meta-tool may use. A request for
// a disallowed tier returns error_kind="tier_not_allowed" WITHOUT
// making the call AND WITHOUT recording a ledger row (the call did
// not happen).
//
// Test: helper_test.go covers tier allowed, tier rejected, JSON
// retry path, malformed-twice path, and ledger-row emission semantics.
package llmmeta
import (
"context"
"encoding/json"
"fmt"
"strings"
"time"
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
"github.com/google/uuid"
"gitea.stevedudenhoeffer.com/steve/executus/model"
)
// MetaCall is the domain row written to skill_llm_meta_calls on every
// helper call.
//
// Why a dedicated table (not skill_run_logs): per-skill token
// aggregation is cleaner with typed columns. Folding meta-calls into
// the generic event log would force a SUM-from-JSON path on every
// dashboard query.
//
// Why the field set is tight (no payload columns): the request bodies
// can be 32KB+. The agent's main run already captures system_prompt
// + user_message in the trace; storing them again here would double
// the audit footprint with no diagnostic value (the meta-call's
// inputs are derivable from the parent run's tool-call args).
type MetaCall struct {
ID string
RunID string
SkillID string
ToolName string
TierUsed string // "fast" / "standard"
ModelUsed string // resolved provider/model
InputTokens int
OutputTokens int
DurationMs int
Success bool
ErrorKind string // empty on success; one of the sentinel kinds otherwise
CreatedAt time.Time
}
// Storage is the narrow surface the helper uses to persist meta-call
// ledger rows. Production wires a thin adapter around the skills GORM
// storage; tests substitute a fake.
//
// Why an interface (vs depending on pkg/logic/skills.Storage): the
// skills package imports skilltools (tool registry); having
// skilltools/llmmeta depend back on skills would form an import
// cycle. A narrow interface mirrored across the boundary is the
// project's standard cycle-break pattern (see KVStorage / FileStorage
// in pkg/skilltools/tools/).
type Storage interface {
RecordMetaCall(ctx context.Context, call MetaCall) error
}
// ConvarReader is the narrow surface the helper uses to read
// `skills.llm_meta.allowed_tiers`. The convar package is database-
// backed; tests pass a static fake.
//
// Why an interface (vs reading convars directly): unit tests want to
// fake the allowlist without spinning up a convar manager.
type ConvarReader interface {
// AllowedTiers returns the list of tier names a meta-tool may use.
// Default ["fast"].
AllowedTiers(ctx context.Context) []string
}
// ConvarReaderFunc adapts a closure into a ConvarReader. Useful in
// production wiring (mort.go) where the underlying access is a
// single line of logic.
type ConvarReaderFunc func(ctx context.Context) []string
// AllowedTiers satisfies ConvarReader.
func (f ConvarReaderFunc) AllowedTiers(ctx context.Context) []string {
if f == nil {
return []string{"fast"}
}
return f(ctx)
}
// Helper makes one fast-tier LLM call with surrounding bookkeeping
// (tier allowlist, JSON retry, ledger row).
//
// Construct once at boot; all four meta-tools share the same Helper.
type Helper struct {
storage Storage
convars ConvarReader
}
// New constructs a Helper. storage MUST be non-nil; passing nil makes
// every Call write a no-op ledger row (callers that need a fully no-op
// helper should instead avoid registering the tool).
//
// convars may be nil — the helper falls back to the default allowlist
// `["fast"]`.
//
// Why a constructor with explicit deps (vs Helper{...} struct
// initialiser): forces the deployment-time decision about which
// dependencies are wired vs nil-safe at the construction call site,
// not at the call site of each tool.
func New(storage Storage, convars ConvarReader) *Helper {
return &Helper{
storage: storage,
convars: convars,
}
}
// CallSpec is the per-call input.
//
// Why every field is explicit (vs builder pattern): the four meta-tools
// each populate the spec in one place; a struct literal at the call
// site is more readable than chained setters.
type CallSpec struct {
// Tier is the tier alias to use ("fast" / "standard"). Empty falls
// back to "fast". Disallowed tiers (per the convar allowlist) cause
// Call to return CallResult{Success: false, ErrorKind:
// "tier_not_allowed"} WITHOUT making the LLM call AND without
// writing a ledger row (the call did not happen).
Tier string
// SystemPrompt is the system message. May be empty.
SystemPrompt string
// UserPrompt is the user message. Required.
UserPrompt string
// MaxOutputTokens caps the response. 0 disables the cap (provider
// default). The helper uses this both to bound the cost estimate
// AND to set llm.WithMaxTokens on the request.
MaxOutputTokens int
// ResponseFormat is "text" or "json". When "json", the helper
// attempts to parse the response into JSON. Other values fall
// through as "text".
ResponseFormat string
// RetryOnMalformedJSON, when true and ResponseFormat=="json",
// retries the call ONCE with a stricter JSON-only prompt prefix
// when the first response fails to parse. Second-failure returns
// CallResult{Success: true, Parsed: nil, ErrorKind:
// "malformed_json"} so callers can fall back to result.Text.
RetryOnMalformedJSON bool
// ToolName is the meta-tool name recorded in the ledger row
// ("summarize", "translate", "extract_entities", "classify"). The
// helper does not branch on this value.
ToolName string
// RunID is the calling skill run ID. Recorded in the ledger row;
// also used by the cost-cap callback to find the running 7-day
// total.
RunID string
// SkillID is the calling skill ID. Recorded in the ledger row;
// passed to the cost-cap callback.
SkillID string
// CallerID is the Discord member ID that triggered the parent
// skill run. Passed to the cost-cap callback so the per-user
// 7-day cap can be evaluated.
CallerID string
}
// CallResult is the per-call output.
//
// Why text + parsed (vs only one): JSON-format calls expose both the
// raw response (in .Text) and the parsed map (in .Parsed). Text-format
// calls leave .Parsed nil. Callers requesting JSON that fails to parse
// twice get .Text populated and ErrorKind="malformed_json" so they
// can fall back to text-mode without an error path.
type CallResult struct {
// Text is the raw response text from the LLM. Populated on every
// successful call (success=true) AND when JSON parsing failed
// twice (success=true, parsed=nil, error_kind="malformed_json").
// Empty on tier_not_allowed rejections (no LLM call happened).
Text string
// Parsed is the JSON-decoded response. nil for text-format calls,
// nil for failed JSON parses, populated for successful JSON
// responses. The interior shape is whatever the LLM returned; the
// caller is responsible for asserting a typed view.
Parsed any
// InputTokens is the tokens billed against the input. 0 when the
// provider didn't surface usage.
InputTokens int
// OutputTokens is the tokens billed against the output. 0 when the
// provider didn't surface usage.
OutputTokens int
// DurationMs is wall-clock duration of the LLM call (or call+retry
// in the JSON-retry case).
DurationMs int
// ModelUsed is the resolved provider/model string ("anthropic/
// claude-haiku-4-5-20251001"). Populated on every actual LLM call;
// empty on tier_not_allowed rejections.
ModelUsed string
// Success reports whether the LLM call returned a usable response.
// True on happy-path AND on malformed-json second-failure (the
// caller can fall back to .Text). False on transport errors,
// tier_not_allowed, llm_unavailable.
Success bool
// ErrorKind, when non-empty, is one of:
// - "tier_not_allowed" → no call, no ledger row
// - "llm_unavailable" → call attempted, ledger row written
// - "malformed_json" → call succeeded but JSON parse failed
ErrorKind string
}
// Sentinel error_kind values for CallResult.ErrorKind.
const (
ErrorKindTierNotAllowed = "tier_not_allowed"
ErrorKindLLMUnavailable = "llm_unavailable"
ErrorKindMalformedJSON = "malformed_json"
)
// Call performs the meta-LLM call and returns a typed CallResult.
//
// Why no error return (vs an error second value): every meaningful
// failure is captured as a CallResult.ErrorKind so the caller's branch
// logic stays single-pathed. Internal transport errors are surfaced
// as ErrorKind=llm_unavailable. The function only returns a non-nil
// error for argument-validation failures (empty UserPrompt) — a
// programmer error the caller would have to fix anyway.
//
// Test: helper_test.go covers all outcomes (tier_not_allowed, happy
// text, happy json, malformed_json retry-pass, malformed_json
// retry-fail, llm_unavailable).
func (h *Helper) Call(ctx context.Context, spec CallSpec) (CallResult, error) {
if strings.TrimSpace(spec.UserPrompt) == "" {
return CallResult{}, fmt.Errorf("llmmeta: user_prompt required")
}
tier := strings.TrimSpace(spec.Tier)
if tier == "" {
tier = "fast"
}
// Tier allowlist: rejected tiers do NOT make the call AND do NOT
// record a ledger row.
if !h.tierAllowed(ctx, tier) {
return CallResult{
Success: false,
ErrorKind: ErrorKindTierNotAllowed,
}, nil
}
resolvedModel := model.ResolveModelName(tier)
// Resolve model. ParseModelForContext attaches the resolved model
// name to ctx (for usage attribution) AND returns the llm.Model
// whose Generate already routes through the lane wrapper.
ctx, model, err := model.ParseModelForContext(ctx, tier)
if err != nil {
// Tier convar mis-set: surface as tier_not_allowed to the
// caller (the agent's recovery path is the same as for an
// admin-disabled tier) but DO record the failure for the
// admin who needs to fix the convar.
h.recordLedger(ctx, MetaCall{
ID: uuid.NewString(),
RunID: spec.RunID,
SkillID: spec.SkillID,
ToolName: spec.ToolName,
TierUsed: tier,
ModelUsed: resolvedModel,
Success: false,
ErrorKind: ErrorKindTierNotAllowed,
CreatedAt: time.Now(),
})
return CallResult{
Success: false,
ErrorKind: ErrorKindTierNotAllowed,
}, nil
}
// First call.
start := time.Now()
systemPrompt := spec.SystemPrompt
userMessage := spec.UserPrompt
opts := []llm.Option{}
if spec.MaxOutputTokens > 0 {
opts = append(opts, llm.WithMaxTokens(spec.MaxOutputTokens))
}
text, usage, llmErr := h.complete(ctx, model, systemPrompt, userMessage, opts)
if llmErr != nil {
duration := int(time.Since(start) / time.Millisecond)
h.recordLedger(ctx, MetaCall{
ID: uuid.NewString(),
RunID: spec.RunID,
SkillID: spec.SkillID,
ToolName: spec.ToolName,
TierUsed: tier,
ModelUsed: resolvedModel,
InputTokens: usage.InputTokens,
OutputTokens: usage.OutputTokens,
DurationMs: duration,
Success: false,
ErrorKind: ErrorKindLLMUnavailable,
CreatedAt: time.Now(),
})
return CallResult{
Success: false,
ErrorKind: ErrorKindLLMUnavailable,
ModelUsed: resolvedModel,
DurationMs: duration,
InputTokens: usage.InputTokens,
OutputTokens: usage.OutputTokens,
}, nil
}
// Determine outcome based on response format.
parsed, parsedOK := tryParseJSON(text, spec.ResponseFormat)
wantJSON := strings.EqualFold(spec.ResponseFormat, "json")
if !wantJSON || parsedOK {
// Happy path (text mode OR JSON mode that parsed first try).
duration := int(time.Since(start) / time.Millisecond)
h.recordLedger(ctx, MetaCall{
ID: uuid.NewString(),
RunID: spec.RunID,
SkillID: spec.SkillID,
ToolName: spec.ToolName,
TierUsed: tier,
ModelUsed: resolvedModel,
InputTokens: usage.InputTokens,
OutputTokens: usage.OutputTokens,
DurationMs: duration,
Success: true,
CreatedAt: time.Now(),
})
return CallResult{
Text: text,
Parsed: parsed,
Success: true,
ModelUsed: resolvedModel,
InputTokens: usage.InputTokens,
OutputTokens: usage.OutputTokens,
DurationMs: duration,
}, nil
}
// JSON requested but first response failed to parse.
if !spec.RetryOnMalformedJSON {
duration := int(time.Since(start) / time.Millisecond)
h.recordLedger(ctx, MetaCall{
ID: uuid.NewString(),
RunID: spec.RunID,
SkillID: spec.SkillID,
ToolName: spec.ToolName,
TierUsed: tier,
ModelUsed: resolvedModel,
InputTokens: usage.InputTokens,
OutputTokens: usage.OutputTokens,
DurationMs: duration,
Success: true,
ErrorKind: ErrorKindMalformedJSON,
CreatedAt: time.Now(),
})
return CallResult{
Text: text,
Success: true,
ErrorKind: ErrorKindMalformedJSON,
ModelUsed: resolvedModel,
InputTokens: usage.InputTokens,
OutputTokens: usage.OutputTokens,
DurationMs: duration,
}, nil
}
// Retry once with stricter JSON-only prompt prefix.
stricterPrompt := "Return ONLY valid JSON. No prose, no markdown fencing.\n\n" + userMessage
text2, usage2, llmErr2 := h.complete(ctx, model, systemPrompt, stricterPrompt, opts)
combinedUsage := Tokens{
InputTokens: usage.InputTokens + usage2.InputTokens,
OutputTokens: usage.OutputTokens + usage2.OutputTokens,
}
duration := int(time.Since(start) / time.Millisecond)
if llmErr2 != nil {
// Retry call itself failed transport-wise. Record the round-
// trip tokens and surface llm_unavailable.
h.recordLedger(ctx, MetaCall{
ID: uuid.NewString(),
RunID: spec.RunID,
SkillID: spec.SkillID,
ToolName: spec.ToolName,
TierUsed: tier,
ModelUsed: resolvedModel,
InputTokens: combinedUsage.InputTokens,
OutputTokens: combinedUsage.OutputTokens,
DurationMs: duration,
Success: false,
ErrorKind: ErrorKindLLMUnavailable,
CreatedAt: time.Now(),
})
return CallResult{
Text: text,
Success: false,
ErrorKind: ErrorKindLLMUnavailable,
ModelUsed: resolvedModel,
InputTokens: combinedUsage.InputTokens,
OutputTokens: combinedUsage.OutputTokens,
DurationMs: duration,
}, nil
}
parsed2, parsedOK2 := tryParseJSON(text2, "json")
if parsedOK2 {
h.recordLedger(ctx, MetaCall{
ID: uuid.NewString(),
RunID: spec.RunID,
SkillID: spec.SkillID,
ToolName: spec.ToolName,
TierUsed: tier,
ModelUsed: resolvedModel,
InputTokens: combinedUsage.InputTokens,
OutputTokens: combinedUsage.OutputTokens,
DurationMs: duration,
Success: true,
CreatedAt: time.Now(),
})
return CallResult{
Text: text2,
Parsed: parsed2,
Success: true,
ModelUsed: resolvedModel,
InputTokens: combinedUsage.InputTokens,
OutputTokens: combinedUsage.OutputTokens,
DurationMs: duration,
}, nil
}
// Second-failure path. Caller can fall back to result.Text.
h.recordLedger(ctx, MetaCall{
ID: uuid.NewString(),
RunID: spec.RunID,
SkillID: spec.SkillID,
ToolName: spec.ToolName,
TierUsed: tier,
ModelUsed: resolvedModel,
InputTokens: combinedUsage.InputTokens,
OutputTokens: combinedUsage.OutputTokens,
DurationMs: duration,
Success: true,
ErrorKind: ErrorKindMalformedJSON,
CreatedAt: time.Now(),
})
return CallResult{
Text: text2,
Success: true,
ErrorKind: ErrorKindMalformedJSON,
ModelUsed: resolvedModel,
InputTokens: combinedUsage.InputTokens,
OutputTokens: combinedUsage.OutputTokens,
DurationMs: duration,
}, nil
}
// Tokens is the input/output token count returned by the LLM round-
// trip. Mirrors llm.Usage's two cost-bearing fields. Exported so
// downstream test code (the four meta-tools' tests, integration
// tests) can use SetCompleteForTest.
type Tokens struct {
InputTokens int
OutputTokens int
}
// CompleteFn is the seam used by tests to fake the LLM round-trip
// without spinning up a real provider. Exported for tests in other
// packages (the four meta-tools live in pkg/skilltools/tools/).
type CompleteFn func(ctx context.Context, model llm.Model, systemPrompt, userMessage string, opts []llm.Option) (string, Tokens, error)
// completeOverride is set in tests via SetCompleteForTest. nil falls
// back to the real model.Generate path.
var completeOverride CompleteFn
// complete is the actual LLM round-trip. Calls model.Generate (which
// already routes through the lane transport wrapper) and returns the
// text + usage + error.
//
// Why not call model.SimpleCall: SimpleCall doesn't surface Usage; we
// need the input/output token counts for the ledger row.
//
// Usage attribution to the per-user / per-skill dashboards is handled
// by the instrumented model that model.ParseModelForContext returns —
// a manual model.RecordUsage here would double-count.
func (h *Helper) complete(ctx context.Context, model llm.Model, systemPrompt, userMessage string, opts []llm.Option) (string, Tokens, error) {
if completeOverride != nil {
return completeOverride(ctx, model, systemPrompt, userMessage, opts)
}
req := llm.Request{
System: systemPrompt,
Messages: []llm.Message{llm.UserText(userMessage)},
}
resp, err := model.Generate(ctx, req, opts...)
if err != nil {
return "", Tokens{}, err
}
usage := Tokens{
InputTokens: resp.Usage.InputTokens,
OutputTokens: resp.Usage.OutputTokens,
}
return resp.Text(), usage, nil
}
// SetCompleteForTest installs a fake completer used by Call. Returns a
// restore function that the test deferes to revert the override.
//
// Why exported (vs in a _test.go file): the four meta-tools' tests live
// in pkg/skilltools/tools/, in a different package than the helper.
// They need a way to fake the LLM without depending on a real model.
func SetCompleteForTest(fn CompleteFn) func() {
prev := completeOverride
completeOverride = fn
return func() { completeOverride = prev }
}
// tierAllowed reports whether the given tier appears in the configured
// allowlist. Empty allowlist defaults to ["fast"].
func (h *Helper) tierAllowed(ctx context.Context, tier string) bool {
var allowed []string
if h.convars != nil {
allowed = h.convars.AllowedTiers(ctx)
}
if len(allowed) == 0 {
allowed = []string{"fast"}
}
for _, t := range allowed {
if strings.EqualFold(strings.TrimSpace(t), tier) {
return true
}
}
return false
}
// recordLedger writes one meta-call row. Storage failures are logged
// at the storage layer; the helper does not propagate them — meta-call
// accounting MUST NOT break user-visible execution.
func (h *Helper) recordLedger(ctx context.Context, call MetaCall) {
if h.storage == nil {
return
}
_ = h.storage.RecordMetaCall(ctx, call)
}
// tryParseJSON attempts to decode text as JSON. Returns the parsed
// value (any) and ok=true on success. ok=false on failure or when
// format is not "json".
//
// Why we accept arbitrary JSON shapes (vs requiring an object): the
// extract_entities tool returns objects, but classify returns objects
// with arrays inside. Accepting `any` keeps the helper agnostic to the
// caller's downstream typing.
//
// Tolerance: strips a leading "```json" code fence + matching closing
// fence so the agent can include surrounding markdown without
// breaking parse. The stricter retry prompt explicitly asks for no
// fence; this tolerance is for the first-attempt path.
func tryParseJSON(text, format string) (any, bool) {
if !strings.EqualFold(format, "json") {
return nil, false
}
trimmed := strings.TrimSpace(text)
// Strip optional ```json ... ``` fence.
if strings.HasPrefix(trimmed, "```") {
// Drop opening fence (with or without language tag).
if idx := strings.Index(trimmed, "\n"); idx >= 0 {
trimmed = trimmed[idx+1:]
}
// Drop trailing fence.
if idx := strings.LastIndex(trimmed, "```"); idx >= 0 {
trimmed = trimmed[:idx]
}
trimmed = strings.TrimSpace(trimmed)
}
var parsed any
if err := json.Unmarshal([]byte(trimmed), &parsed); err != nil {
return nil, false
}
return parsed, true
}
+282
View File
@@ -0,0 +1,282 @@
package llmmeta
import (
"context"
"errors"
"strings"
"sync"
"testing"
llm "gitea.stevedudenhoeffer.com/steve/majordomo/llm"
)
// fakeStorage records every MetaCall handed to RecordMetaCall and
// makes them available to tests via the captured slice.
type fakeStorage struct {
mu sync.Mutex
calls []MetaCall
err error
}
func (f *fakeStorage) RecordMetaCall(_ context.Context, call MetaCall) error {
f.mu.Lock()
defer f.mu.Unlock()
f.calls = append(f.calls, call)
return f.err
}
func (f *fakeStorage) snapshot() []MetaCall {
f.mu.Lock()
defer f.mu.Unlock()
out := make([]MetaCall, len(f.calls))
copy(out, f.calls)
return out
}
// TestCall_TierNotAllowed: a tier not in the allowlist returns the
// rejection without recording a ledger row — the call did not happen.
func TestCall_TierNotAllowed(t *testing.T) {
store := &fakeStorage{}
convars := ConvarReaderFunc(func(_ context.Context) []string {
return []string{"fast"}
})
h := New(store, convars)
res, err := h.Call(context.Background(), CallSpec{
Tier: "thinking",
UserPrompt: "hello",
ToolName: "summarize",
})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if res.Success {
t.Errorf("expected Success=false")
}
if res.ErrorKind != ErrorKindTierNotAllowed {
t.Errorf("ErrorKind = %q, want %q", res.ErrorKind, ErrorKindTierNotAllowed)
}
if len(store.snapshot()) != 0 {
t.Errorf("expected NO ledger row for tier_not_allowed, got %d", len(store.snapshot()))
}
}
// TestCall_TierAllowedHappyText: a permitted tier yields a successful
// text call AND records a ledger row.
func TestCall_TierAllowedHappyText(t *testing.T) {
store := &fakeStorage{}
convars := ConvarReaderFunc(func(_ context.Context) []string {
return []string{"fast"}
})
h := New(store, convars)
restore := SetCompleteForTest(func(_ context.Context, _ llm.Model, _, _ string, _ []llm.Option) (string, Tokens, error) {
return "summary text here", Tokens{InputTokens: 50, OutputTokens: 12}, nil
})
defer restore()
res, err := h.Call(context.Background(), CallSpec{
Tier: "fast",
UserPrompt: "summarise the following ...",
ToolName: "summarize",
ResponseFormat: "text",
RunID: "run-1",
SkillID: "sk-1",
})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if !res.Success {
t.Errorf("expected Success=true; got ErrorKind=%q", res.ErrorKind)
}
if res.Text != "summary text here" {
t.Errorf("Text = %q, want %q", res.Text, "summary text here")
}
if res.InputTokens != 50 || res.OutputTokens != 12 {
t.Errorf("token counts wrong: in=%d out=%d", res.InputTokens, res.OutputTokens)
}
if got := len(store.snapshot()); got != 1 {
t.Fatalf("expected 1 ledger row, got %d", got)
}
row := store.snapshot()[0]
if !row.Success {
t.Errorf("ledger Success = false, want true")
}
if row.ToolName != "summarize" {
t.Errorf("ledger ToolName = %q", row.ToolName)
}
if row.RunID != "run-1" {
t.Errorf("ledger RunID = %q", row.RunID)
}
if row.InputTokens != 50 || row.OutputTokens != 12 {
t.Errorf("ledger token counts wrong: in=%d out=%d",
row.InputTokens, row.OutputTokens)
}
}
// TestCall_JSONFirstAttemptParses: JSON-format request, response is
// valid JSON on first try; result.Parsed populated.
func TestCall_JSONFirstAttemptParses(t *testing.T) {
store := &fakeStorage{}
h := New(store, nil)
restore := SetCompleteForTest(func(_ context.Context, _ llm.Model, _, _ string, _ []llm.Option) (string, Tokens, error) {
return `{"foo":"bar","n":42}`, Tokens{InputTokens: 10, OutputTokens: 5}, nil
})
defer restore()
res, _ := h.Call(context.Background(), CallSpec{
UserPrompt: "extract entities",
ToolName: "extract_entities",
ResponseFormat: "json",
RetryOnMalformedJSON: true,
SkillID: "sk-2",
})
if !res.Success || res.ErrorKind != "" {
t.Fatalf("expected success, got %+v", res)
}
m, ok := res.Parsed.(map[string]any)
if !ok {
t.Fatalf("Parsed not a map: %T %v", res.Parsed, res.Parsed)
}
if m["foo"] != "bar" {
t.Errorf("Parsed[foo] = %v", m["foo"])
}
}
// TestCall_JSONRetryPath: first response is malformed JSON; second
// response (after stricter prompt) parses cleanly.
func TestCall_JSONRetryPath(t *testing.T) {
store := &fakeStorage{}
h := New(store, nil)
calls := 0
restore := SetCompleteForTest(func(_ context.Context, _ llm.Model, _, prompt string, _ []llm.Option) (string, Tokens, error) {
calls++
if calls == 1 {
return "Here is your JSON: {oh no I forgot to format it", Tokens{InputTokens: 8, OutputTokens: 12}, nil
}
// Verify stricter prompt prefix appeared on retry.
if !strings.Contains(prompt, "Return ONLY valid JSON") {
t.Errorf("retry prompt missing stricter prefix: %q", prompt)
}
return `{"key":"value"}`, Tokens{InputTokens: 14, OutputTokens: 6}, nil
})
defer restore()
res, _ := h.Call(context.Background(), CallSpec{
UserPrompt: "extract",
ToolName: "extract_entities",
ResponseFormat: "json",
RetryOnMalformedJSON: true,
})
if !res.Success || res.ErrorKind != "" {
t.Fatalf("expected success, got %+v", res)
}
if calls != 2 {
t.Errorf("expected 2 LLM calls, got %d", calls)
}
m, _ := res.Parsed.(map[string]any)
if m["key"] != "value" {
t.Errorf("Parsed = %v", res.Parsed)
}
// Token counts should reflect both attempts.
if res.InputTokens != 22 || res.OutputTokens != 18 {
t.Errorf("combined tokens wrong: in=%d out=%d", res.InputTokens, res.OutputTokens)
}
}
// TestCall_JSONRetryFailsTwice: second attempt also fails to parse.
// Surfaces ErrorKind=malformed_json AND keeps Success=true so the
// caller can fall back to result.Text.
func TestCall_JSONRetryFailsTwice(t *testing.T) {
store := &fakeStorage{}
h := New(store, nil)
restore := SetCompleteForTest(func(_ context.Context, _ llm.Model, _, _ string, _ []llm.Option) (string, Tokens, error) {
return "still not JSON", Tokens{InputTokens: 10, OutputTokens: 4}, nil
})
defer restore()
res, _ := h.Call(context.Background(), CallSpec{
UserPrompt: "extract",
ToolName: "extract_entities",
ResponseFormat: "json",
RetryOnMalformedJSON: true,
})
if !res.Success {
t.Errorf("expected Success=true (fall-back-to-text), got Success=false")
}
if res.ErrorKind != ErrorKindMalformedJSON {
t.Errorf("ErrorKind = %q, want %q", res.ErrorKind, ErrorKindMalformedJSON)
}
if res.Parsed != nil {
t.Errorf("Parsed = %v, want nil after failed retry", res.Parsed)
}
rows := store.snapshot()
if len(rows) != 1 {
t.Fatalf("expected 1 ledger row, got %d", len(rows))
}
if !rows[0].Success || rows[0].ErrorKind != ErrorKindMalformedJSON {
t.Errorf("ledger row mismatch: %+v", rows[0])
}
}
// TestCall_LLMUnavailable: transport error from the model.Generate
// call is surfaced as ErrorKind=llm_unavailable AND records a ledger
// row.
func TestCall_LLMUnavailable(t *testing.T) {
store := &fakeStorage{}
h := New(store, nil)
restore := SetCompleteForTest(func(_ context.Context, _ llm.Model, _, _ string, _ []llm.Option) (string, Tokens, error) {
return "", Tokens{}, errors.New("network error")
})
defer restore()
res, _ := h.Call(context.Background(), CallSpec{
UserPrompt: "hi",
ToolName: "summarize",
})
if res.Success {
t.Errorf("expected Success=false")
}
if res.ErrorKind != ErrorKindLLMUnavailable {
t.Errorf("ErrorKind = %q, want %q", res.ErrorKind, ErrorKindLLMUnavailable)
}
rows := store.snapshot()
if len(rows) != 1 {
t.Fatalf("expected 1 ledger row, got %d", len(rows))
}
}
// TestCall_EmptyUserPromptErrors: programmer-error guard.
func TestCall_EmptyUserPromptErrors(t *testing.T) {
h := New(&fakeStorage{}, nil)
_, err := h.Call(context.Background(), CallSpec{ToolName: "summarize"})
if err == nil {
t.Fatal("expected error for empty user_prompt")
}
}
// TestCall_JSONWithCodeFenceParses: tolerance for the first-attempt
// response wrapped in a ```json ... ``` fence. The retry path uses a
// stricter prompt; this test pins the first-attempt tolerance so
// callers don't waste a round-trip on a benign formatting wrapper.
func TestCall_JSONWithCodeFenceParses(t *testing.T) {
store := &fakeStorage{}
h := New(store, nil)
restore := SetCompleteForTest(func(_ context.Context, _ llm.Model, _, _ string, _ []llm.Option) (string, Tokens, error) {
return "```json\n{\"x\":1}\n```", Tokens{InputTokens: 5, OutputTokens: 4}, nil
})
defer restore()
res, _ := h.Call(context.Background(), CallSpec{
UserPrompt: "extract",
ToolName: "extract_entities",
ResponseFormat: "json",
RetryOnMalformedJSON: true,
})
if res.ErrorKind != "" {
t.Errorf("unexpected ErrorKind %q (fenced JSON should parse on first attempt)", res.ErrorKind)
}
m, _ := res.Parsed.(map[string]any)
if m["x"] != float64(1) {
t.Errorf("Parsed[x] = %v, want 1", m["x"])
}
}
+21
View File
@@ -0,0 +1,21 @@
package llmmeta
import (
"os"
"testing"
"time"
"gitea.stevedudenhoeffer.com/steve/executus/model"
)
// TestMain configures a minimal model tier table so the helper's
// model.ParseModelForContext("fast"/"standard") resolves. The actual LLM call
// is stubbed per-test via SetCompleteForTest, so these specs are only parsed
// (anthropic registers with an empty key and errors at call time, not parse).
func TestMain(m *testing.M) {
model.Configure(nil, map[string]string{
"fast": "anthropic/claude-haiku-4-5",
"standard": "anthropic/claude-sonnet-4-6",
}, time.Minute)
os.Exit(m.Run())
}