P1: model layer (convar→config inversion) + llmmeta #1

Merged
steve merged 1 commits from phase-1-model-layer into main 2026-06-27 00:30:39 +00:00
Owner

Phase 1 of the executus extraction (see CLAUDE.md roadmap). Lifts mort's model-access layer into executus/model, fully decoupled from mort.

What's here

  • tool/ (already on main) — the skilltools registry core moved verbatim (registry, 3-stage permissions, gated-tool, ssrf, hmac, encryption). Clean, no mort coupling.
  • model/ — mort's pkg/logic/llms ported with the key inversions:
    • convar → config.Source: the tier resolver reads model.tier.<name> from a host-supplied config source with host-supplied fallbacks (Configure(cfg, defaults, ttl)). Tier names + specs are host config; the resolution mechanism (cache, reasoning-suffix dialect, chain validation) is generic. No tier names hard-coded in the harness.
    • llmusage/llmtrace → UsageSink/TraceSink seams + a model-owned Span, with nil-safe context attribution helpers. Both sinks optional (nil = off) so a light host (gadfly) records nothing.
    • lane decoration repointed to executus/lane; utils.Errorffmt.Errorf.
    • GenerateWith[T] (instrumented structured output) retained — this is the structured-output primitive, no separate package.
  • llmmeta/ — the meta-LLM helper (tier allowlist + JSON retry + ledger) over model/.

Verification

  • go build ./... && go vet ./... && go test -race ./... all green.
  • CI invariant holds: core go.sum free of gorm/redis/discordgo/sqlite.
  • New tests cover the inversion: config overrides fallback, tier registration, reasoning-suffix survival, nested-tier rejection, nil-sink no-ops.

Notes for review

Some doc comments inside model/ still reference mort specifics (env key names, "convar", "mort's executors") — they're carried over and slated for a cleanup pass; flag any that are misleading. This is a simple-system gadfly review — findings are advisory and will be double-checked.

🤖 Generated with Claude Code

Phase 1 of the executus extraction (see `CLAUDE.md` roadmap). Lifts mort's model-access layer into `executus/model`, fully decoupled from mort. ## What's here - **`tool/`** (already on main) — the skilltools registry core moved verbatim (registry, 3-stage permissions, gated-tool, ssrf, hmac, encryption). Clean, no mort coupling. - **`model/`** — mort's `pkg/logic/llms` ported with the key inversions: - **convar → `config.Source`**: the tier resolver reads `model.tier.<name>` from a host-supplied config source with host-supplied fallbacks (`Configure(cfg, defaults, ttl)`). Tier *names* + specs are host config; the resolution *mechanism* (cache, reasoning-suffix dialect, chain validation) is generic. **No tier names hard-coded in the harness.** - **llmusage/llmtrace → `UsageSink`/`TraceSink`** seams + a model-owned `Span`, with nil-safe context attribution helpers. Both sinks optional (nil = off) so a light host (gadfly) records nothing. - lane decoration repointed to `executus/lane`; `utils.Errorf` → `fmt.Errorf`. - `GenerateWith[T]` (instrumented structured output) retained — this is the structured-output primitive, no separate package. - **`llmmeta/`** — the meta-LLM helper (tier allowlist + JSON retry + ledger) over `model/`. ## Verification - `go build ./... && go vet ./... && go test -race ./...` all green. - CI invariant holds: core `go.sum` free of gorm/redis/discordgo/sqlite. - New tests cover the inversion: config overrides fallback, tier registration, reasoning-suffix survival, nested-tier rejection, nil-sink no-ops. ## Notes for review Some doc comments inside `model/` still reference mort specifics (env key names, "convar", "mort's executors") — they're carried over and slated for a cleanup pass; flag any that are misleading. This is a **simple-system gadfly review** — findings are advisory and will be double-checked. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
steve added 1 commit 2026-06-26 23:47:38 +00:00
P1: model layer (convar->config inversion) + llmmeta
executus CI / test (pull_request) Successful in 58s
Adversarial Review (Gadfly) / review (pull_request) Successful in 26m27s
executus CI / test (push) Successful in 1m2s
b424261aca
Lifts mort's pkg/logic/llms into executus/model, decoupled from mort:

- tiers.go: the tier resolver now reads a host-supplied config.Source under
  "model.tier.<name>" with host-supplied fallbacks (Configure(cfg, defaults,
  ttl)), instead of convar.Manager. Tier NAMES + specs are host config; the
  resolution mechanism (cache, reasoning-suffix dialect, chain validation) is
  generic. No tier names hard-coded in the harness.
- sink.go: usage/trace recording inverted off mort's llmusage/llmtrace into
  UsageSink / TraceSink seams + a model-owned Span, with nil-safe context
  attribution helpers (WithModel/WithTraceID/WithUsageTool/WithUsageUser).
  Both sinks optional (nil = off) so a light host records nothing.
- lane decoration repointed to executus/lane; utils.Errorf -> fmt.Errorf.
- call.go keeps GenerateWith[T] (instrumented structured output) — this is the
  structured-output primitive; no separate structured/ package.
- llmmeta moved over model/ (the meta-LLM helper: tier allowlist + JSON retry
  + ledger). Its tests configure a minimal tier table via TestMain.

New tests cover the inversion: config overrides fallback, tier registration,
reasoning-suffix survival, nested-tier rejection, nil-sink no-ops.

Full module: go build/vet/test -race green; core go.sum still free of
gorm/redis/discordgo/sqlite.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

🪰 Gadfly review — m1pro/qwen3:14b (m1pro)

Verdict: Blocking issues found · ⚠️ 1/3 lens(es) errored — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

No material issues found

After a thorough review of the code changes through the security lens, no material security issues were found. The code appears to handle input validation, usage tracking, and secure data handling appropriately. The use of UUIDs, proper error handling, and reliance on underlying libraries for secure operations (like LLM injection prevention) are in place. The code also includes tests for various security-related scenarios, such as JSON parsing and tier restrictions.

🎯 Correctness — Blocking issues found

Blocking issues found

  • llmmeta/helper.go:72: Incorrect error classification when model resolution fails. When ParseModelForContext fails (e.g., due to a misconfigured convar), the code records a ledger row with ErrorKindTierNotAllowed, but this is a model resolution error, not a tier disallowance. This misclassification could lead to incorrect error handling and auditing. Fix: Introduce a distinct error kind like model_resolution_error and update the ledger row accordingly.

  • llmmeta/helper.go:19: Documentation discrepancy. The New constructor's comment claims that passing a nil storage will cause every Call to write a no-op ledger row, but the actual code skips ledger recording when storage is nil. This inconsistency could lead to incorrect assumptions about behavior. Fix: Update the comment to match the implementation, or modify the code to record no-op rows when storage is nil (if that's the intended behavior).

🧯 Error handling & edge cases⚠️ could not complete

⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
m1pro/qwen3:14b: ollama m1pro/qwen3:14b: do request: Post "http://192.168.0.175:11434/api/chat": context deadline exceeded

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 26m 26s

<!-- gadfly-review:ollama:m1pro/qwen3:14b --> ### 🪰 Gadfly review — `m1pro/qwen3:14b` (m1pro) **Verdict: Blocking issues found · ⚠️ 1/3 lens(es) errored** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> # No material issues found After a thorough review of the code changes through the security lens, no material security issues were found. The code appears to handle input validation, usage tracking, and secure data handling appropriately. The use of UUIDs, proper error handling, and reliance on underlying libraries for secure operations (like LLM injection prevention) are in place. The code also includes tests for various security-related scenarios, such as JSON parsing and tier restrictions. </details> <details><summary><b>🎯 Correctness</b> — Blocking issues found</summary> Blocking issues found - **`llmmeta/helper.go:72`**: Incorrect error classification when model resolution fails. When `ParseModelForContext` fails (e.g., due to a misconfigured convar), the code records a ledger row with `ErrorKindTierNotAllowed`, but this is a model resolution error, not a tier disallowance. This misclassification could lead to incorrect error handling and auditing. Fix: Introduce a distinct error kind like `model_resolution_error` and update the ledger row accordingly. - **`llmmeta/helper.go:19`**: Documentation discrepancy. The `New` constructor's comment claims that passing a nil `storage` will cause every `Call` to write a no-op ledger row, but the actual code skips ledger recording when `storage` is nil. This inconsistency could lead to incorrect assumptions about behavior. Fix: Update the comment to match the implementation, or modify the code to record no-op rows when `storage` is nil (if that's the intended behavior). </details> <details><summary><b>🧯 Error handling & edge cases</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed m1pro/qwen3:14b: ollama m1pro/qwen3:14b: do request: Post "http://192.168.0.175:11434/api/chat": context deadline exceeded </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 26m 26s</sub>

🪰 Gadfly review — deepseek-v4-flash:cloud (ollama-cloud)

Verdict: Minor issues — 3 reviewers: security, correctness, error-handling

🔒 Security — No material issues found

🔒 Security Review

Verdict: No material issues found

After tracing the full data flow through the new code, I find no exploitable security vulnerabilities in this PR. Here's what I verified:

What I checked

API key handling (model/llms.go:144-175, model/cloud_sync.go:430-435): API keys are read from environment variables at boot time via os.Getenv() and passed directly to provider constructors. The CloudOllamaLimitCache.applyAuth() sets a Bearer token on outbound requests. Keys are stored in-memory as plain strings — standard Go practice, no credential leakage path in this code.

SSRF risk (model/cloud_sync.go:306-382): The CloudOllamaLimitCache makes HTTP requests to a configurable endpoint. However, the endpoint is set once at construction time by the host (boot-time configuration), not from user input. The modelName parameter in fetchContextLength is used only in a properly JSON-encoded POST body (json.Marshal), never in URL construction — no injection vector. The /api/tags response is parsed and its model names are used only in subsequent /api/show POST bodies, again properly encoded.

Injection surfaces (llmmeta/helper.go:200-610): The CallSpec.UserPrompt and SystemPrompt flow to the LLM API — expected behavior, not a SQL/command injection vector. The MetaCall fields (RunID, SkillID, CallerID, ToolName, Tier) are passed to the Storage.RecordMetaCall interface; injection risk depends on the storage implementation (GORM adapter), not this code.

Tier allowlist bypass (llmmeta/helper.go:560-575): The tierAllowed check uses strings.EqualFold for case-insensitive comparison against the configured allowlist. No bypass possible — the comparison is correct and the default allowlist ["fast"] is applied when the ConvarReader is nil.

Test hooks (llmmeta/helper.go:540-550): SetCompleteForTest is exported and modifies a global completeOverride. In production this variable stays nil, so the real model.Generate path is always used. No production code path sets it.

Information disclosure (model/call.go:247-248): The recordUsage function calls debug.Stack() in a slog.Warn when both model and tool are "unknown". This could leak internal code paths to logs in an edge case. Minor — the condition is rare (only fires when usage attribution is completely missing) and the logs are operator-facing, not user-facing.

No other issues found in: tryParseJSON (safe json.Unmarshal into any), marshalMessages/marshalToolCalls (tracing serialization, by-design data capture), buildRegistry (standard env-var provider wiring), or LoadEnv (majordomo's LLM_* DSN parsing).

🎯 Correctness — Minor issues

🎯 Correctness — Verdict: Minor issues

Finding 1: llmmeta/helper.go:283resolvedModel computed before ParseModelForContext may diverge from actual serving model on error path

Verified by reading llmmeta/helper.go:283-300 and model/llms.go:413-460.

At line 283, resolvedModel is computed via model.ResolveModelName(tier). Then at line 288, model.ParseModelForContext(ctx, tier) is called. If ParseModelForContext succeeds, the actual model used is the one returned by ParseModelRequest, which goes through Registry().Parse(clean) — a path that can resolve tier aliases, legacy shortcuts, and provider/model lookups differently than ResolveModelName does in edge cases (e.g., when a tier resolves to a chain and the first element's model name extraction differs between the two functions).

Impact: The ledger row's ModelUsed field and the CallResult.ModelUsed could report a different model name than the one that actually served the request. This is a data-integrity issue for cost accounting and audit trails.

Suggested fix: Compute resolvedModel after ParseModelForContext succeeds, using the resolved model's identity. For example, extract it from the returned llm.Model or from the context after WithModel stamps it:

ctx, llmModel, err := model.ParseModelForContext(ctx, tier)
if err != nil { ... }
resolvedModel := model.ResolveModelName(tier) // or better: derive from llmModel

Finding 2: model/call.go:109GenerateWith[T] usage is recorded by instrumentedModel but captureModel wraps it, creating a fragile layering dependency

Verified by reading model/call.go:101-158 and model/call.go:36-46.

GenerateWith[T] creates a captureModel{inner: model} where model is the caller-supplied llm.Model. The docstring says "models returned by ParseModelRequest / ParseModelForContext record usage automatically" — but GenerateWith does not enforce this. If a caller passes a model that is not an instrumentedModel (e.g., a raw model from Registry().Parse()), usage is silently not recorded with no warning.

Impact: Silent under-counting of token usage for any caller that uses GenerateWith with a non-instrumented model. The captureModel only captures the response for tracing; it does not call recordUsage.

Suggested fix: Either:

  1. Have captureModel itself call recordUsage (making it self-contained), or
  2. Document that GenerateWith requires an instrumented model and add a runtime check / panic, or
  3. Wrap the model in instrumentedModel inside GenerateWith unconditionally.

Finding 3: model/call.go:109majordomo.Generate[T] is called but the function's existence in the majordomo package is unverifiable from this repo

Unverified — I cannot check the majordomo module's API. The call majordomo.Generate[T](ctx, capture, req, opts...) assumes a generic function Generate[T any](ctx context.Context, model llm.Model, req llm.Request, opts ...llm.Option) (T, error) exists in the majordomo package. If this function does not exist or has a different signature, this code will not compile. The go build ./... claim in the PR description suggests it does exist, but I cannot confirm from the checked-out code.


Finding 4: model/llms.go:255-262 — Registry resolver closure captures defaultResolver by variable, not by value — safe today but fragile

Verified by reading model/llms.go:255-262 and model/tiers.go:125-134.

The closure:

r.RegisterResolver(majordomo.ResolverFunc(func(name string) (string, bool) {
    res := defaultResolver
    if res == nil { return "", false }
    spec, _, ok := res.Resolve(name)
    return spec, ok
}))

This reads defaultResolver at call time, not at registration time. This is intentional — Configure (tiers.go:132) replaces defaultResolver after buildRegistry runs. However, the res := defaultResolver local copy is unnecessary since defaultResolver is never set to nil after Configure (it's always a valid *tierResolver). The nil check is dead code after the first Configure call. Not a bug, but misleading.


Finding 5: model/llms.go:297-322ParseModelRequest calls splitReasoningSpec then separately re-resolves the tier for its reasoning level, creating a double-resolve

Verified by reading model/llms.go:297-322 and model/tiers.go:87-110.

splitReasoningSpec(spec) strips reasoning suffixes and returns (clean, level). Then if level == "", the code calls defaultResolver.Resolve(clean) again to get the tier's default reasoning level. This means the tier is resolved twice — once implicitly via Registry().Parse(clean) (which calls the resolver closure) and once explicitly here. The two resolves are not guaranteed to be consistent if the config source changes between them (though the TTL cache mitigates this).

Impact: In theory, a racing config update could cause the reasoning level to be read from a different tier value than the one used for the actual model resolution. Extremely unlikely in practice (config changes are rare and the TTL is 30s), but a semantic inconsistency.

Suggested fix: Resolve the tier once and pass both the spec and level to the registry, or restructure to avoid the double-resolve.


Summary

No blocking correctness issues found. The code is well-structured and the inversion pattern (convar→config.Source, UsageSink/TraceSink seams) is correctly implemented. The findings above are minor: a potential ledger attribution skew (finding 1), a silent under-counting risk (finding 2), and a theoretical race (finding 5). Finding 3 is unverifiable from this repo alone.

🧯 Error handling & edge cases — Minor issues

All five findings have been independently verified against the actual code. Here is the corrected review:

🧯 Error handling & edge cases — Review

Verdict: Minor issues


1. ParseModelForContext failure mislabeled as tier_not_allowed — conflates two distinct failure modes

llmmeta/helper.go:288-308 — When model.ParseModelForContext(ctx, tier) fails, the code records a ledger row and returns CallResult with ErrorKind: ErrorKindTierNotAllowed. But the tier allowlist check (tierAllowed) already passed at line 276, so the tier is allowed. The failure is in resolution/parsing — e.g., the tier resolved to an invalid spec like "anthropic/nonexistent-model", or the registry couldn't parse it.

This conflates "tier not in the allowlist" (a policy rejection) with "tier resolution/parsing failed" (a configuration or infrastructure error). An operator seeing tier_not_allowed would look at the allowlist convar, not at the tier's model spec — wasting debugging time.

Suggested fix: Use a distinct error kind for the resolution-failure path, e.g. "tier_resolution_failed" or reuse ErrorKindLLMUnavailable. The ledger row should carry the actual error from ParseModelForContext (or at least a more precise kind) so operators can distinguish "tier not allowed" from "tier spec is broken."


2. recordUsage calls usageSink.Record even when model name is empty/"unknown"

model/call.go:244-254 — When resolvedModelName returns "" or "unknown", the function logs a slog.Warn but still calls usageSink.Record(ctx, model, ...) with that empty/unknown string. If the host's UsageSink implementation doesn't guard against empty model names (e.g., a DB insert with a non-nullable column), this could cause a silent data integrity failure or a panic in the sink.

Suggested fix: Either return after the warning when model == "" (skip recording for unidentifiable models), or document on UsageSink that the model parameter may be empty and sinks must handle it. The current code does neither.


3. tryParseJSON fence-stripping: opening fence with no newline silently produces empty string

llmmeta/helper.go:599-608 — If the response text is just ```json (no newline, no content), strings.Index(trimmed, "\n") returns -1, so the if body is skipped. Then strings.LastIndex(trimmed, "```") finds the opening fence at index 0, and trimmed = trimmed[:0] produces "". json.Unmarshal([]byte(""), &parsed) fails, returning (nil, false). This is safe (graceful degradation), but the fence-stripping logic has a subtle off-by-one: when idx == -1, trimmed[idx+1:] is trimmed[0:] which is the whole string, not the content after the fence. The code works by accident because the subsequent LastIndex re-strips the same fence.

Suggested fix: Add an explicit guard: if idx < 0 { return nil, false } after strings.Index(trimmed, "\n") to make the intent clear and avoid relying on the accidental double-strip.


4. SetCompleteForTest has no synchronization — data race if tests run in parallel

llmmeta/helper.go:546-550 — The package-level completeOverride variable is written and read without any mutex or atomic. If any test in the llmmeta package (or in downstream packages that import llmmeta and call SetCompleteForTest) runs with t.Parallel(), this is a data race. The current tests don't use t.Parallel(), so this is latent, not active — but it's a ticking bomb for future test authors.

Suggested fix: Either document that SetCompleteForTest is not safe for parallel tests, or add a sync.Mutex guard. A simple sync.Mutex is cheap and prevents the surprise.


5. recordLedger silently drops storage errors — intentional but undocumented contract with Storage

llmmeta/helper.go:573-578 — The function does _ = h.storage.RecordMetaCall(ctx, call), discarding the error. The comment says "Storage failures are logged at the storage layer." But the Storage interface's RecordMetaCall returns an error — there's no contract that the implementation must log the error itself. If a future Storage implementation returns an error without logging it (e.g., a simple in-memory buffer that returns an error on overflow), the error is silently lost.

Suggested fix: At minimum, log the error at the call site: if err := h.storage.RecordMetaCall(ctx, call); err != nil { slog.Warn("llmmeta: failed to record ledger row", "err", err) }. This makes the helper self-diagnosing regardless of the storage implementation.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 15m 4s

<!-- gadfly-review:ollama:deepseek-v4-flash:cloud --> ### 🪰 Gadfly review — `deepseek-v4-flash:cloud` (ollama-cloud) **Verdict: Minor issues** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — No material issues found</summary> ## 🔒 Security Review **Verdict: No material issues found** After tracing the full data flow through the new code, I find no exploitable security vulnerabilities in this PR. Here's what I verified: ### What I checked **API key handling** (`model/llms.go:144-175`, `model/cloud_sync.go:430-435`): API keys are read from environment variables at boot time via `os.Getenv()` and passed directly to provider constructors. The `CloudOllamaLimitCache.applyAuth()` sets a Bearer token on outbound requests. Keys are stored in-memory as plain strings — standard Go practice, no credential leakage path in this code. **SSRF risk** (`model/cloud_sync.go:306-382`): The `CloudOllamaLimitCache` makes HTTP requests to a configurable `endpoint`. However, the endpoint is set once at construction time by the host (boot-time configuration), not from user input. The `modelName` parameter in `fetchContextLength` is used only in a properly JSON-encoded POST body (`json.Marshal`), never in URL construction — no injection vector. The `/api/tags` response is parsed and its model names are used only in subsequent `/api/show` POST bodies, again properly encoded. **Injection surfaces** (`llmmeta/helper.go:200-610`): The `CallSpec.UserPrompt` and `SystemPrompt` flow to the LLM API — expected behavior, not a SQL/command injection vector. The `MetaCall` fields (`RunID`, `SkillID`, `CallerID`, `ToolName`, `Tier`) are passed to the `Storage.RecordMetaCall` interface; injection risk depends on the storage implementation (GORM adapter), not this code. **Tier allowlist bypass** (`llmmeta/helper.go:560-575`): The `tierAllowed` check uses `strings.EqualFold` for case-insensitive comparison against the configured allowlist. No bypass possible — the comparison is correct and the default allowlist `["fast"]` is applied when the `ConvarReader` is nil. **Test hooks** (`llmmeta/helper.go:540-550`): `SetCompleteForTest` is exported and modifies a global `completeOverride`. In production this variable stays nil, so the real `model.Generate` path is always used. No production code path sets it. **Information disclosure** (`model/call.go:247-248`): The `recordUsage` function calls `debug.Stack()` in a `slog.Warn` when both model and tool are "unknown". This could leak internal code paths to logs in an edge case. Minor — the condition is rare (only fires when usage attribution is completely missing) and the logs are operator-facing, not user-facing. **No other issues found** in: `tryParseJSON` (safe `json.Unmarshal` into `any`), `marshalMessages`/`marshalToolCalls` (tracing serialization, by-design data capture), `buildRegistry` (standard env-var provider wiring), or `LoadEnv` (majordomo's `LLM_*` DSN parsing). </details> <details><summary><b>🎯 Correctness</b> — Minor issues</summary> ## 🎯 Correctness — Verdict: Minor issues ### Finding 1: `llmmeta/helper.go:283` — `resolvedModel` computed before `ParseModelForContext` may diverge from actual serving model on error path **Verified by reading** `llmmeta/helper.go:283-300` and `model/llms.go:413-460`. At line 283, `resolvedModel` is computed via `model.ResolveModelName(tier)`. Then at line 288, `model.ParseModelForContext(ctx, tier)` is called. If `ParseModelForContext` **succeeds**, the actual model used is the one returned by `ParseModelRequest`, which goes through `Registry().Parse(clean)` — a path that can resolve tier aliases, legacy shortcuts, and provider/model lookups differently than `ResolveModelName` does in edge cases (e.g., when a tier resolves to a chain and the first element's model name extraction differs between the two functions). **Impact:** The ledger row's `ModelUsed` field and the `CallResult.ModelUsed` could report a different model name than the one that actually served the request. This is a data-integrity issue for cost accounting and audit trails. **Suggested fix:** Compute `resolvedModel` *after* `ParseModelForContext` succeeds, using the resolved model's identity. For example, extract it from the returned `llm.Model` or from the context after `WithModel` stamps it: ```go ctx, llmModel, err := model.ParseModelForContext(ctx, tier) if err != nil { ... } resolvedModel := model.ResolveModelName(tier) // or better: derive from llmModel ``` --- ### Finding 2: `model/call.go:109` — `GenerateWith[T]` usage is recorded by `instrumentedModel` but `captureModel` wraps it, creating a fragile layering dependency **Verified by reading** `model/call.go:101-158` and `model/call.go:36-46`. `GenerateWith[T]` creates a `captureModel{inner: model}` where `model` is the caller-supplied `llm.Model`. The docstring says "models returned by `ParseModelRequest` / `ParseModelForContext` record usage automatically" — but `GenerateWith` does not enforce this. If a caller passes a model that is **not** an `instrumentedModel` (e.g., a raw model from `Registry().Parse()`), usage is silently **not recorded** with no warning. **Impact:** Silent under-counting of token usage for any caller that uses `GenerateWith` with a non-instrumented model. The `captureModel` only captures the response for tracing; it does not call `recordUsage`. **Suggested fix:** Either: 1. Have `captureModel` itself call `recordUsage` (making it self-contained), or 2. Document that `GenerateWith` requires an instrumented model and add a runtime check / panic, or 3. Wrap the model in `instrumentedModel` inside `GenerateWith` unconditionally. --- ### Finding 3: `model/call.go:109` — `majordomo.Generate[T]` is called but the function's existence in the `majordomo` package is unverifiable from this repo **Unverified** — I cannot check the `majordomo` module's API. The call `majordomo.Generate[T](ctx, capture, req, opts...)` assumes a generic function `Generate[T any](ctx context.Context, model llm.Model, req llm.Request, opts ...llm.Option) (T, error)` exists in the `majordomo` package. If this function does not exist or has a different signature, this code will not compile. The `go build ./...` claim in the PR description suggests it does exist, but I cannot confirm from the checked-out code. --- ### Finding 4: `model/llms.go:255-262` — Registry resolver closure captures `defaultResolver` by variable, not by value — safe today but fragile **Verified by reading** `model/llms.go:255-262` and `model/tiers.go:125-134`. The closure: ```go r.RegisterResolver(majordomo.ResolverFunc(func(name string) (string, bool) { res := defaultResolver if res == nil { return "", false } spec, _, ok := res.Resolve(name) return spec, ok })) ``` This reads `defaultResolver` at call time, not at registration time. This is intentional — `Configure` (tiers.go:132) replaces `defaultResolver` after `buildRegistry` runs. However, the `res := defaultResolver` local copy is unnecessary since `defaultResolver` is never set to nil after `Configure` (it's always a valid `*tierResolver`). The nil check is dead code after the first `Configure` call. Not a bug, but misleading. --- ### Finding 5: `model/llms.go:297-322` — `ParseModelRequest` calls `splitReasoningSpec` then separately re-resolves the tier for its reasoning level, creating a double-resolve **Verified by reading** `model/llms.go:297-322` and `model/tiers.go:87-110`. `splitReasoningSpec(spec)` strips reasoning suffixes and returns `(clean, level)`. Then if `level == ""`, the code calls `defaultResolver.Resolve(clean)` again to get the tier's default reasoning level. This means the tier is resolved **twice** — once implicitly via `Registry().Parse(clean)` (which calls the resolver closure) and once explicitly here. The two resolves are not guaranteed to be consistent if the config source changes between them (though the TTL cache mitigates this). **Impact:** In theory, a racing config update could cause the reasoning level to be read from a different tier value than the one used for the actual model resolution. Extremely unlikely in practice (config changes are rare and the TTL is 30s), but a semantic inconsistency. **Suggested fix:** Resolve the tier once and pass both the spec and level to the registry, or restructure to avoid the double-resolve. --- ### Summary No blocking correctness issues found. The code is well-structured and the inversion pattern (convar→config.Source, UsageSink/TraceSink seams) is correctly implemented. The findings above are minor: a potential ledger attribution skew (finding 1), a silent under-counting risk (finding 2), and a theoretical race (finding 5). Finding 3 is unverifiable from this repo alone. </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> All five findings have been independently verified against the actual code. Here is the corrected review: ## 🧯 Error handling & edge cases — Review **Verdict: Minor issues** --- ### 1. `ParseModelForContext` failure mislabeled as `tier_not_allowed` — conflates two distinct failure modes **`llmmeta/helper.go:288-308`** — When `model.ParseModelForContext(ctx, tier)` fails, the code records a ledger row and returns `CallResult` with `ErrorKind: ErrorKindTierNotAllowed`. But the tier allowlist check (`tierAllowed`) already passed at line 276, so the tier **is** allowed. The failure is in resolution/parsing — e.g., the tier resolved to an invalid spec like `"anthropic/nonexistent-model"`, or the registry couldn't parse it. This conflates "tier not in the allowlist" (a policy rejection) with "tier resolution/parsing failed" (a configuration or infrastructure error). An operator seeing `tier_not_allowed` would look at the allowlist convar, not at the tier's model spec — wasting debugging time. **Suggested fix**: Use a distinct error kind for the resolution-failure path, e.g. `"tier_resolution_failed"` or reuse `ErrorKindLLMUnavailable`. The ledger row should carry the actual error from `ParseModelForContext` (or at least a more precise kind) so operators can distinguish "tier not allowed" from "tier spec is broken." --- ### 2. `recordUsage` calls `usageSink.Record` even when model name is empty/"unknown" **`model/call.go:244-254`** — When `resolvedModelName` returns `""` or `"unknown"`, the function logs a `slog.Warn` but **still** calls `usageSink.Record(ctx, model, ...)` with that empty/unknown string. If the host's `UsageSink` implementation doesn't guard against empty model names (e.g., a DB insert with a non-nullable column), this could cause a silent data integrity failure or a panic in the sink. **Suggested fix**: Either `return` after the warning when `model == ""` (skip recording for unidentifiable models), or document on `UsageSink` that the model parameter may be empty and sinks must handle it. The current code does neither. --- ### 3. `tryParseJSON` fence-stripping: opening fence with no newline silently produces empty string **`llmmeta/helper.go:599-608`** — If the response text is just `` ```json `` (no newline, no content), `strings.Index(trimmed, "\n")` returns -1, so the `if` body is skipped. Then `strings.LastIndex(trimmed, "```")` finds the opening fence at index 0, and `trimmed = trimmed[:0]` produces `""`. `json.Unmarshal([]byte(""), &parsed)` fails, returning `(nil, false)`. This is **safe** (graceful degradation), but the fence-stripping logic has a subtle off-by-one: when `idx == -1`, `trimmed[idx+1:]` is `trimmed[0:]` which is the whole string, not the content after the fence. The code works by accident because the subsequent `LastIndex` re-strips the same fence. **Suggested fix**: Add an explicit guard: `if idx < 0 { return nil, false }` after `strings.Index(trimmed, "\n")` to make the intent clear and avoid relying on the accidental double-strip. --- ### 4. `SetCompleteForTest` has no synchronization — data race if tests run in parallel **`llmmeta/helper.go:546-550`** — The package-level `completeOverride` variable is written and read without any mutex or atomic. If any test in the `llmmeta` package (or in downstream packages that import `llmmeta` and call `SetCompleteForTest`) runs with `t.Parallel()`, this is a data race. The current tests don't use `t.Parallel()`, so this is latent, not active — but it's a ticking bomb for future test authors. **Suggested fix**: Either document that `SetCompleteForTest` is not safe for parallel tests, or add a `sync.Mutex` guard. A simple `sync.Mutex` is cheap and prevents the surprise. --- ### 5. `recordLedger` silently drops storage errors — intentional but undocumented contract with `Storage` **`llmmeta/helper.go:573-578`** — The function does `_ = h.storage.RecordMetaCall(ctx, call)`, discarding the error. The comment says "Storage failures are logged at the storage layer." But the `Storage` interface's `RecordMetaCall` returns an `error` — there's no contract that the implementation must log the error itself. If a future `Storage` implementation returns an error without logging it (e.g., a simple in-memory buffer that returns an error on overflow), the error is silently lost. **Suggested fix**: At minimum, log the error at the call site: `if err := h.storage.RecordMetaCall(ctx, call); err != nil { slog.Warn("llmmeta: failed to record ledger row", "err", err) }`. This makes the helper self-diagnosing regardless of the storage implementation. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 15m 4s</sub>

🪰 Gadfly review — m5max/qwen3.6:35b-mlx (m5max)

Review incomplete — all lenses errored — 3 reviewers: security, correctness, error-handling

🔒 Security⚠️ could not complete

⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
m5max/qwen3.6:35b-mlx: ollama m5max/qwen3.6:35b-mlx: do request: Post "http://192.168.0.173:11434/api/chat": dial tcp 192.168.0.173:11434: connect: no route to host

🎯 Correctness⚠️ could not complete

⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
m5max/qwen3.6:35b-mlx: skipped (backed off until 23:47:51.517)

🧯 Error handling & edge cases⚠️ could not complete

⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
m5max/qwen3.6:35b-mlx: skipped (backed off until 23:47:51.517)

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 6s

<!-- gadfly-review:ollama:m5max/qwen3.6:35b-mlx --> ### 🪰 Gadfly review — `m5max/qwen3.6:35b-mlx` (m5max) **Review incomplete — all lenses errored** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed m5max/qwen3.6:35b-mlx: ollama m5max/qwen3.6:35b-mlx: do request: Post "http://192.168.0.173:11434/api/chat": dial tcp 192.168.0.173:11434: connect: no route to host </details> <details><summary><b>🎯 Correctness</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed m5max/qwen3.6:35b-mlx: skipped (backed off until 23:47:51.517) </details> <details><summary><b>🧯 Error handling & edge cases</b> — ⚠️ could not complete</summary> ⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed m5max/qwen3.6:35b-mlx: skipped (backed off until 23:47:51.517) </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 6s</sub>

🪰 Gadfly review — minimax-m3:cloud (ollama-cloud)

Verdict: Minor issues — 3 reviewers: security, correctness, error-handling

🔒 Security — Minor issues

VERDICT: Minor issues

Confirmed against model/cloud_sync.go:

  • model/cloud_sync.go:317 / :370 — unbounded response read. fetchTags (line 317) and fetchContextLength (line 370) both call io.ReadAll(resp.Body) with no http.MaxBytesReader and no explicit Content-Length cap. The client is only bounded by the 15-second wall-clock timeout (NewCloudOllamaLimitCache, line 80). A hijacked or misbehaving endpoint can stream an unbounded payload before the timeout fires, exhausting memory on a long-lived process. Suggested fix: cap with io.LimitReader(resp.Body, 1<<20) (or similar) before ReadAll.

  • model/cloud_sync.go:74-91 — endpoint URL accepted without scheme allowlist. NewCloudOllamaLimitCache(endpoint, apiKey, httpClient) accepts any non-empty string for endpoint; it only checks non-empty and trims a trailing /. There is no https:// requirement and no host allowlist. applyAuth (lines 430-435) then attaches Authorization: Bearer <apiKey> to whatever URL is configured. A misconfiguration — e.g. a typo (http://…), a convar override pointed at an internal HTTP service, or anyone able to set the env var — silently downgrades the channel and exfiltrates the bearer token in cleartext. The operational default https://ollama.com (line 41) is safe but is not enforced. Suggested fix: reject endpoints whose scheme is not https, with a constructor-time error.

🎯 Correctness — No material issues found

Correctness Review — P1 model + llmmeta

VERDICT: No material issues found

(Note: the two issues claimed in the draft — resolvedModel returning the bare tier alias and the JSON-fence stripping with backticks in string values — were the only substantive items, but neither survived verification against the actual code: model.ResolveModelName("fast") returns the resolved model portion via defaultResolver.Resolve and strings.Index(spec, "/"), not the bare alias; and the fence-stripping behavior, while a real tolerance trade-off, is documented as such and only fires on responses that start with a leading fence. All other draft items were either restating correct behavior or non-issues.)

🧯 Error handling & edge cases — Minor issues

VERDICT: Minor issues

Findings (error handling & edge cases lens)

  • model/call.go:240 — cache-only token usage is silently dropped. recordUsage short-circuits when u.InputTokens == 0 && u.OutputTokens == 0, but a response carrying only CacheReadTokens/CacheWriteTokens (cache hit, no fresh tokens) still hits the early return and never reaches usageSink.Record. On providers that price by cache reads (Anthropic prompt caching, OpenAI cached-input billing), a fully-cached call records zero usage. Fix: extend the guard to if u.InputTokens == 0 && u.OutputTokens == 0 && u.CacheReadTokens == 0 && u.CacheWriteTokens == 0 — or drop the guard entirely, since the sink is allowed to be a no-op for zero usage.

  • model/call.go:243-254model == "unknown" is recorded, not rejected. When resolvedModelName returns "unknown" or "" (no WithModel, no resp.Model), the code logs a warning and then still calls usageSink.Record(ctx, model, …). The warning suggests this is a misconfiguration, but the sink now aggregates every unattributed call under the same "unknown" key — silently merging billing-relevant rows into one bucket. Fix: either skip recording (returning early after the warning) or attribute to a per-process caller (e.g. compose model + toolFromContext) so rows are distinguishable.

  • llmmeta/helper.go:276-281 vs :283-309tier_not_allowed ledger-row semantics are inconsistent. The package docstring and CallSpec.Tier comment say disallowed tiers "do not write a ledger row (the call did not happen)". The early tierAllowed rejection at helper.go:276 honours this. The later path at helper.go:288-309 (where model.ParseModelForContext errors on a mis-set tier) writes a row with ErrorKind: "tier_not_allowed" and surfaces ErrorKindTierNotAllowed to the caller. TestCall_TierNotAllowed only pins the first path's "no row" behaviour, so the divergence is uncaught. Pick one: either both paths record (with a reason sub-field to distinguish), or neither does — and update the test/docstring to match.

  • llmmeta/helper.go:567-578recordLedger silently swallows storage errors. The comment "Storage failures are logged at the storage layer" is enforced by nothing in the Storage interface. A host's Storage implementation that forgets to log silently drops audit rows on DB hiccups. Either: (a) add an explicit contract in the interface comment that errors must be logged before being returned, or (b) log at this layer (slog.Warn("llmmeta: ledger write failed", "err", err)) and document that the helper does so.

  • model/tiers.go:91,108Resolve cache expires is computed from a now captured before the locks. now := r.now() is read at line 91 (before the RLock peek and the subsequent upgrade to write Lock). The fresh cache entry at line 108 stores expires: now.Add(r.ttl) using the same pre-lock now. Both expiry checks (lines 94 and 102) use the same now so the round-trip in this caller is internally consistent, but a different goroutine calling Resolve shortly after will read its own (later) now and may find the freshly-stored entry "expired" if lock wait > TTL. Not a real-world bug at the 30s default, but the comment should pin this invariant so a future refactor that recomputes now after the lock doesn't silently introduce immediate-stale entries.

  • model/bench.go:67-75UnbenchModel clears manual AND automatic benches unconditionally. Health().Unbench(model) removes whatever bench the tracker holds (manual or automatic); delete(manualBenches, model) then drops the manual marker. Calling UnbenchModel on a model that was auto-benched (e.g. transient outage) silently removes the auto-backoff too, potentially masking a still-degraded provider. The return value wasBenched reports the pre-clear state but the caller can't tell manual-vs-auto from it. Document this as "clears ALL benches" or return a separate flag.

  • llmmeta/helper.go:593-610tryParseJSON code-fence stripping is fragile on opening-only fences. When the response starts with ```json (with no newline between the language tag and the body), strings.Index(trimmed, "\n") returns -1, the opening fence isn't stripped, and strings.LastIndex(trimmed, "```") then matches the opening fence itself — trimmed becomes empty and parse fails, firing the retry unnecessarily. Edge-case but documented "code fence tolerance" implies it should handle the no-newline form.

Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5m 48s

<!-- gadfly-review:ollama:minimax-m3:cloud --> ### 🪰 Gadfly review — `minimax-m3:cloud` (ollama-cloud) **Verdict: Minor issues** — 3 reviewers: security, correctness, error-handling <details><summary><b>🔒 Security</b> — Minor issues</summary> **VERDICT: Minor issues** Confirmed against `model/cloud_sync.go`: - **`model/cloud_sync.go:317` / `:370` — unbounded response read.** `fetchTags` (line 317) and `fetchContextLength` (line 370) both call `io.ReadAll(resp.Body)` with no `http.MaxBytesReader` and no explicit `Content-Length` cap. The client is only bounded by the 15-second wall-clock timeout (`NewCloudOllamaLimitCache`, line 80). A hijacked or misbehaving endpoint can stream an unbounded payload before the timeout fires, exhausting memory on a long-lived process. Suggested fix: cap with `io.LimitReader(resp.Body, 1<<20)` (or similar) before `ReadAll`. - **`model/cloud_sync.go:74-91` — endpoint URL accepted without scheme allowlist.** `NewCloudOllamaLimitCache(endpoint, apiKey, httpClient)` accepts any non-empty string for `endpoint`; it only checks non-empty and trims a trailing `/`. There is no `https://` requirement and no host allowlist. `applyAuth` (lines 430-435) then attaches `Authorization: Bearer <apiKey>` to whatever URL is configured. A misconfiguration — e.g. a typo (`http://…`), a convar override pointed at an internal HTTP service, or anyone able to set the env var — silently downgrades the channel and exfiltrates the bearer token in cleartext. The operational default `https://ollama.com` (line 41) is safe but is not enforced. Suggested fix: reject endpoints whose scheme is not `https`, with a constructor-time error. </details> <details><summary><b>🎯 Correctness</b> — No material issues found</summary> # Correctness Review — P1 model + llmmeta VERDICT: No material issues found (Note: the two issues claimed in the draft — `resolvedModel` returning the bare tier alias and the JSON-fence stripping with backticks in string values — were the only substantive items, but neither survived verification against the actual code: `model.ResolveModelName("fast")` returns the resolved model portion via `defaultResolver.Resolve` and `strings.Index(spec, "/")`, not the bare alias; and the fence-stripping behavior, while a real tolerance trade-off, is documented as such and only fires on responses that start with a leading fence. All other draft items were either restating correct behavior or non-issues.) </details> <details><summary><b>🧯 Error handling & edge cases</b> — Minor issues</summary> ## VERDICT: Minor issues ### Findings (error handling & edge cases lens) - **`model/call.go:240` — cache-only token usage is silently dropped.** `recordUsage` short-circuits when `u.InputTokens == 0 && u.OutputTokens == 0`, but a response carrying only `CacheReadTokens`/`CacheWriteTokens` (cache hit, no fresh tokens) still hits the early return and never reaches `usageSink.Record`. On providers that price by cache reads (Anthropic prompt caching, OpenAI cached-input billing), a fully-cached call records zero usage. Fix: extend the guard to `if u.InputTokens == 0 && u.OutputTokens == 0 && u.CacheReadTokens == 0 && u.CacheWriteTokens == 0` — or drop the guard entirely, since the sink is allowed to be a no-op for zero usage. - **`model/call.go:243-254` — `model == "unknown"` is recorded, not rejected.** When `resolvedModelName` returns `"unknown"` or `""` (no `WithModel`, no `resp.Model`), the code logs a warning *and then still calls* `usageSink.Record(ctx, model, …)`. The warning suggests this is a misconfiguration, but the sink now aggregates every unattributed call under the same `"unknown"` key — silently merging billing-relevant rows into one bucket. Fix: either skip recording (returning early after the warning) or attribute to a per-process caller (e.g. compose `model` + `toolFromContext`) so rows are distinguishable. - **`llmmeta/helper.go:276-281` vs `:283-309` — `tier_not_allowed` ledger-row semantics are inconsistent.** The package docstring and `CallSpec.Tier` comment say disallowed tiers "do not write a ledger row (the call did not happen)". The early `tierAllowed` rejection at `helper.go:276` honours this. The later path at `helper.go:288-309` (where `model.ParseModelForContext` errors on a mis-set tier) writes a row with `ErrorKind: "tier_not_allowed"` *and* surfaces `ErrorKindTierNotAllowed` to the caller. `TestCall_TierNotAllowed` only pins the first path's "no row" behaviour, so the divergence is uncaught. Pick one: either both paths record (with a `reason` sub-field to distinguish), or neither does — and update the test/docstring to match. - **`llmmeta/helper.go:567-578` — `recordLedger` silently swallows storage errors.** The comment "Storage failures are logged at the storage layer" is enforced by nothing in the `Storage` interface. A host's `Storage` implementation that forgets to log silently drops audit rows on DB hiccups. Either: (a) add an explicit contract in the interface comment that errors *must* be logged before being returned, or (b) log at this layer (`slog.Warn("llmmeta: ledger write failed", "err", err)`) and document that the helper does so. - **`model/tiers.go:91,108` — `Resolve` cache `expires` is computed from a `now` captured before the locks.** `now := r.now()` is read at line 91 (before the RLock peek and the subsequent upgrade to write Lock). The fresh cache entry at line 108 stores `expires: now.Add(r.ttl)` using the same pre-lock `now`. Both expiry checks (lines 94 and 102) use the same `now` so the round-trip in this caller is internally consistent, but a different goroutine calling `Resolve` shortly after will read its own (later) `now` and may find the freshly-stored entry "expired" if lock wait > TTL. Not a real-world bug at the 30s default, but the comment should pin this invariant so a future refactor that recomputes `now` after the lock doesn't silently introduce immediate-stale entries. - **`model/bench.go:67-75` — `UnbenchModel` clears manual AND automatic benches unconditionally.** `Health().Unbench(model)` removes whatever bench the tracker holds (manual or automatic); `delete(manualBenches, model)` then drops the manual marker. Calling `UnbenchModel` on a model that was auto-benched (e.g. transient outage) silently removes the auto-backoff too, potentially masking a still-degraded provider. The return value `wasBenched` reports the pre-clear state but the caller can't tell manual-vs-auto from it. Document this as "clears ALL benches" or return a separate flag. - **`llmmeta/helper.go:593-610` — `tryParseJSON` code-fence stripping is fragile on opening-only fences.** When the response starts with ```` ```json ```` (with no newline between the language tag and the body), `strings.Index(trimmed, "\n")` returns `-1`, the opening fence isn't stripped, and `strings.LastIndex(trimmed, "```")` then matches the opening fence itself — `trimmed` becomes empty and parse fails, firing the retry unnecessarily. Edge-case but documented "code fence tolerance" implies it should handle the no-newline form. </details> <sub>Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5m 48s</sub>
steve merged commit b424261aca into main 2026-06-27 00:30:39 +00:00
steve deleted branch phase-1-model-layer 2026-06-27 00:30:39 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: steve/executus#1