P1: model layer (convar→config inversion) + llmmeta #1
Reference in New Issue
Block a user
Delete Branch "phase-1-model-layer"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Phase 1 of the executus extraction (see
CLAUDE.mdroadmap). Lifts mort's model-access layer intoexecutus/model, fully decoupled from mort.What's here
tool/(already on main) — the skilltools registry core moved verbatim (registry, 3-stage permissions, gated-tool, ssrf, hmac, encryption). Clean, no mort coupling.model/— mort'spkg/logic/llmsported with the key inversions:config.Source: the tier resolver readsmodel.tier.<name>from a host-supplied config source with host-supplied fallbacks (Configure(cfg, defaults, ttl)). Tier names + specs are host config; the resolution mechanism (cache, reasoning-suffix dialect, chain validation) is generic. No tier names hard-coded in the harness.UsageSink/TraceSinkseams + a model-ownedSpan, with nil-safe context attribution helpers. Both sinks optional (nil = off) so a light host (gadfly) records nothing.executus/lane;utils.Errorf→fmt.Errorf.GenerateWith[T](instrumented structured output) retained — this is the structured-output primitive, no separate package.llmmeta/— the meta-LLM helper (tier allowlist + JSON retry + ledger) overmodel/.Verification
go build ./... && go vet ./... && go test -race ./...all green.go.sumfree of gorm/redis/discordgo/sqlite.Notes for review
Some doc comments inside
model/still reference mort specifics (env key names, "convar", "mort's executors") — they're carried over and slated for a cleanup pass; flag any that are misleading. This is a simple-system gadfly review — findings are advisory and will be double-checked.🤖 Generated with Claude Code
🪰 Gadfly review —
m1pro/qwen3:14b(m1pro)Verdict: Blocking issues found · ⚠️ 1/3 lens(es) errored — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
No material issues found
After a thorough review of the code changes through the security lens, no material security issues were found. The code appears to handle input validation, usage tracking, and secure data handling appropriately. The use of UUIDs, proper error handling, and reliance on underlying libraries for secure operations (like LLM injection prevention) are in place. The code also includes tests for various security-related scenarios, such as JSON parsing and tier restrictions.
🎯 Correctness — Blocking issues found
Blocking issues found
llmmeta/helper.go:72: Incorrect error classification when model resolution fails. WhenParseModelForContextfails (e.g., due to a misconfigured convar), the code records a ledger row withErrorKindTierNotAllowed, but this is a model resolution error, not a tier disallowance. This misclassification could lead to incorrect error handling and auditing. Fix: Introduce a distinct error kind likemodel_resolution_errorand update the ledger row accordingly.llmmeta/helper.go:19: Documentation discrepancy. TheNewconstructor's comment claims that passing a nilstoragewill cause everyCallto write a no-op ledger row, but the actual code skips ledger recording whenstorageis nil. This inconsistency could lead to incorrect assumptions about behavior. Fix: Update the comment to match the implementation, or modify the code to record no-op rows whenstorageis nil (if that's the intended behavior).🧯 Error handling & edge cases — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
m1pro/qwen3:14b: ollama m1pro/qwen3:14b: do request: Post "http://192.168.0.175:11434/api/chat": context deadline exceeded
Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 26m 26s
🪰 Gadfly review —
deepseek-v4-flash:cloud(ollama-cloud)Verdict: Minor issues — 3 reviewers: security, correctness, error-handling
🔒 Security — No material issues found
🔒 Security Review
Verdict: No material issues found
After tracing the full data flow through the new code, I find no exploitable security vulnerabilities in this PR. Here's what I verified:
What I checked
API key handling (
model/llms.go:144-175,model/cloud_sync.go:430-435): API keys are read from environment variables at boot time viaos.Getenv()and passed directly to provider constructors. TheCloudOllamaLimitCache.applyAuth()sets a Bearer token on outbound requests. Keys are stored in-memory as plain strings — standard Go practice, no credential leakage path in this code.SSRF risk (
model/cloud_sync.go:306-382): TheCloudOllamaLimitCachemakes HTTP requests to a configurableendpoint. However, the endpoint is set once at construction time by the host (boot-time configuration), not from user input. ThemodelNameparameter infetchContextLengthis used only in a properly JSON-encoded POST body (json.Marshal), never in URL construction — no injection vector. The/api/tagsresponse is parsed and its model names are used only in subsequent/api/showPOST bodies, again properly encoded.Injection surfaces (
llmmeta/helper.go:200-610): TheCallSpec.UserPromptandSystemPromptflow to the LLM API — expected behavior, not a SQL/command injection vector. TheMetaCallfields (RunID,SkillID,CallerID,ToolName,Tier) are passed to theStorage.RecordMetaCallinterface; injection risk depends on the storage implementation (GORM adapter), not this code.Tier allowlist bypass (
llmmeta/helper.go:560-575): ThetierAllowedcheck usesstrings.EqualFoldfor case-insensitive comparison against the configured allowlist. No bypass possible — the comparison is correct and the default allowlist["fast"]is applied when theConvarReaderis nil.Test hooks (
llmmeta/helper.go:540-550):SetCompleteForTestis exported and modifies a globalcompleteOverride. In production this variable stays nil, so the realmodel.Generatepath is always used. No production code path sets it.Information disclosure (
model/call.go:247-248): TherecordUsagefunction callsdebug.Stack()in aslog.Warnwhen both model and tool are "unknown". This could leak internal code paths to logs in an edge case. Minor — the condition is rare (only fires when usage attribution is completely missing) and the logs are operator-facing, not user-facing.No other issues found in:
tryParseJSON(safejson.Unmarshalintoany),marshalMessages/marshalToolCalls(tracing serialization, by-design data capture),buildRegistry(standard env-var provider wiring), orLoadEnv(majordomo'sLLM_*DSN parsing).🎯 Correctness — Minor issues
🎯 Correctness — Verdict: Minor issues
Finding 1:
llmmeta/helper.go:283—resolvedModelcomputed beforeParseModelForContextmay diverge from actual serving model on error pathVerified by reading
llmmeta/helper.go:283-300andmodel/llms.go:413-460.At line 283,
resolvedModelis computed viamodel.ResolveModelName(tier). Then at line 288,model.ParseModelForContext(ctx, tier)is called. IfParseModelForContextsucceeds, the actual model used is the one returned byParseModelRequest, which goes throughRegistry().Parse(clean)— a path that can resolve tier aliases, legacy shortcuts, and provider/model lookups differently thanResolveModelNamedoes in edge cases (e.g., when a tier resolves to a chain and the first element's model name extraction differs between the two functions).Impact: The ledger row's
ModelUsedfield and theCallResult.ModelUsedcould report a different model name than the one that actually served the request. This is a data-integrity issue for cost accounting and audit trails.Suggested fix: Compute
resolvedModelafterParseModelForContextsucceeds, using the resolved model's identity. For example, extract it from the returnedllm.Modelor from the context afterWithModelstamps it:Finding 2:
model/call.go:109—GenerateWith[T]usage is recorded byinstrumentedModelbutcaptureModelwraps it, creating a fragile layering dependencyVerified by reading
model/call.go:101-158andmodel/call.go:36-46.GenerateWith[T]creates acaptureModel{inner: model}wheremodelis the caller-suppliedllm.Model. The docstring says "models returned byParseModelRequest/ParseModelForContextrecord usage automatically" — butGenerateWithdoes not enforce this. If a caller passes a model that is not aninstrumentedModel(e.g., a raw model fromRegistry().Parse()), usage is silently not recorded with no warning.Impact: Silent under-counting of token usage for any caller that uses
GenerateWithwith a non-instrumented model. ThecaptureModelonly captures the response for tracing; it does not callrecordUsage.Suggested fix: Either:
captureModelitself callrecordUsage(making it self-contained), orGenerateWithrequires an instrumented model and add a runtime check / panic, orinstrumentedModelinsideGenerateWithunconditionally.Finding 3:
model/call.go:109—majordomo.Generate[T]is called but the function's existence in themajordomopackage is unverifiable from this repoUnverified — I cannot check the
majordomomodule's API. The callmajordomo.Generate[T](ctx, capture, req, opts...)assumes a generic functionGenerate[T any](ctx context.Context, model llm.Model, req llm.Request, opts ...llm.Option) (T, error)exists in themajordomopackage. If this function does not exist or has a different signature, this code will not compile. Thego build ./...claim in the PR description suggests it does exist, but I cannot confirm from the checked-out code.Finding 4:
model/llms.go:255-262— Registry resolver closure capturesdefaultResolverby variable, not by value — safe today but fragileVerified by reading
model/llms.go:255-262andmodel/tiers.go:125-134.The closure:
This reads
defaultResolverat call time, not at registration time. This is intentional —Configure(tiers.go:132) replacesdefaultResolverafterbuildRegistryruns. However, theres := defaultResolverlocal copy is unnecessary sincedefaultResolveris never set to nil afterConfigure(it's always a valid*tierResolver). The nil check is dead code after the firstConfigurecall. Not a bug, but misleading.Finding 5:
model/llms.go:297-322—ParseModelRequestcallssplitReasoningSpecthen separately re-resolves the tier for its reasoning level, creating a double-resolveVerified by reading
model/llms.go:297-322andmodel/tiers.go:87-110.splitReasoningSpec(spec)strips reasoning suffixes and returns(clean, level). Then iflevel == "", the code callsdefaultResolver.Resolve(clean)again to get the tier's default reasoning level. This means the tier is resolved twice — once implicitly viaRegistry().Parse(clean)(which calls the resolver closure) and once explicitly here. The two resolves are not guaranteed to be consistent if the config source changes between them (though the TTL cache mitigates this).Impact: In theory, a racing config update could cause the reasoning level to be read from a different tier value than the one used for the actual model resolution. Extremely unlikely in practice (config changes are rare and the TTL is 30s), but a semantic inconsistency.
Suggested fix: Resolve the tier once and pass both the spec and level to the registry, or restructure to avoid the double-resolve.
Summary
No blocking correctness issues found. The code is well-structured and the inversion pattern (convar→config.Source, UsageSink/TraceSink seams) is correctly implemented. The findings above are minor: a potential ledger attribution skew (finding 1), a silent under-counting risk (finding 2), and a theoretical race (finding 5). Finding 3 is unverifiable from this repo alone.
🧯 Error handling & edge cases — Minor issues
All five findings have been independently verified against the actual code. Here is the corrected review:
🧯 Error handling & edge cases — Review
Verdict: Minor issues
1.
ParseModelForContextfailure mislabeled astier_not_allowed— conflates two distinct failure modesllmmeta/helper.go:288-308— Whenmodel.ParseModelForContext(ctx, tier)fails, the code records a ledger row and returnsCallResultwithErrorKind: ErrorKindTierNotAllowed. But the tier allowlist check (tierAllowed) already passed at line 276, so the tier is allowed. The failure is in resolution/parsing — e.g., the tier resolved to an invalid spec like"anthropic/nonexistent-model", or the registry couldn't parse it.This conflates "tier not in the allowlist" (a policy rejection) with "tier resolution/parsing failed" (a configuration or infrastructure error). An operator seeing
tier_not_allowedwould look at the allowlist convar, not at the tier's model spec — wasting debugging time.Suggested fix: Use a distinct error kind for the resolution-failure path, e.g.
"tier_resolution_failed"or reuseErrorKindLLMUnavailable. The ledger row should carry the actual error fromParseModelForContext(or at least a more precise kind) so operators can distinguish "tier not allowed" from "tier spec is broken."2.
recordUsagecallsusageSink.Recordeven when model name is empty/"unknown"model/call.go:244-254— WhenresolvedModelNamereturns""or"unknown", the function logs aslog.Warnbut still callsusageSink.Record(ctx, model, ...)with that empty/unknown string. If the host'sUsageSinkimplementation doesn't guard against empty model names (e.g., a DB insert with a non-nullable column), this could cause a silent data integrity failure or a panic in the sink.Suggested fix: Either
returnafter the warning whenmodel == ""(skip recording for unidentifiable models), or document onUsageSinkthat the model parameter may be empty and sinks must handle it. The current code does neither.3.
tryParseJSONfence-stripping: opening fence with no newline silently produces empty stringllmmeta/helper.go:599-608— If the response text is just```json(no newline, no content),strings.Index(trimmed, "\n")returns -1, so theifbody is skipped. Thenstrings.LastIndex(trimmed, "```")finds the opening fence at index 0, andtrimmed = trimmed[:0]produces"".json.Unmarshal([]byte(""), &parsed)fails, returning(nil, false). This is safe (graceful degradation), but the fence-stripping logic has a subtle off-by-one: whenidx == -1,trimmed[idx+1:]istrimmed[0:]which is the whole string, not the content after the fence. The code works by accident because the subsequentLastIndexre-strips the same fence.Suggested fix: Add an explicit guard:
if idx < 0 { return nil, false }afterstrings.Index(trimmed, "\n")to make the intent clear and avoid relying on the accidental double-strip.4.
SetCompleteForTesthas no synchronization — data race if tests run in parallelllmmeta/helper.go:546-550— The package-levelcompleteOverridevariable is written and read without any mutex or atomic. If any test in thellmmetapackage (or in downstream packages that importllmmetaand callSetCompleteForTest) runs witht.Parallel(), this is a data race. The current tests don't uset.Parallel(), so this is latent, not active — but it's a ticking bomb for future test authors.Suggested fix: Either document that
SetCompleteForTestis not safe for parallel tests, or add async.Mutexguard. A simplesync.Mutexis cheap and prevents the surprise.5.
recordLedgersilently drops storage errors — intentional but undocumented contract withStoragellmmeta/helper.go:573-578— The function does_ = h.storage.RecordMetaCall(ctx, call), discarding the error. The comment says "Storage failures are logged at the storage layer." But theStorageinterface'sRecordMetaCallreturns anerror— there's no contract that the implementation must log the error itself. If a futureStorageimplementation returns an error without logging it (e.g., a simple in-memory buffer that returns an error on overflow), the error is silently lost.Suggested fix: At minimum, log the error at the call site:
if err := h.storage.RecordMetaCall(ctx, call); err != nil { slog.Warn("llmmeta: failed to record ledger row", "err", err) }. This makes the helper self-diagnosing regardless of the storage implementation.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 15m 4s
🪰 Gadfly review —
m5max/qwen3.6:35b-mlx(m5max)Review incomplete — all lenses errored — 3 reviewers: security, correctness, error-handling
🔒 Security — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
m5max/qwen3.6:35b-mlx: ollama m5max/qwen3.6:35b-mlx: do request: Post "http://192.168.0.173:11434/api/chat": dial tcp 192.168.0.173:11434: connect: no route to host
🎯 Correctness — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
m5max/qwen3.6:35b-mlx: skipped (backed off until 23:47:51.517)
🧯 Error handling & edge cases — ⚠️ could not complete
⚠️ This reviewer failed to complete: agent: step 0: all chain targets failed
m5max/qwen3.6:35b-mlx: skipped (backed off until 23:47:51.517)
Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 6s
🪰 Gadfly review —
minimax-m3:cloud(ollama-cloud)Verdict: Minor issues — 3 reviewers: security, correctness, error-handling
🔒 Security — Minor issues
VERDICT: Minor issues
Confirmed against
model/cloud_sync.go:model/cloud_sync.go:317/:370— unbounded response read.fetchTags(line 317) andfetchContextLength(line 370) both callio.ReadAll(resp.Body)with nohttp.MaxBytesReaderand no explicitContent-Lengthcap. The client is only bounded by the 15-second wall-clock timeout (NewCloudOllamaLimitCache, line 80). A hijacked or misbehaving endpoint can stream an unbounded payload before the timeout fires, exhausting memory on a long-lived process. Suggested fix: cap withio.LimitReader(resp.Body, 1<<20)(or similar) beforeReadAll.model/cloud_sync.go:74-91— endpoint URL accepted without scheme allowlist.NewCloudOllamaLimitCache(endpoint, apiKey, httpClient)accepts any non-empty string forendpoint; it only checks non-empty and trims a trailing/. There is nohttps://requirement and no host allowlist.applyAuth(lines 430-435) then attachesAuthorization: Bearer <apiKey>to whatever URL is configured. A misconfiguration — e.g. a typo (http://…), a convar override pointed at an internal HTTP service, or anyone able to set the env var — silently downgrades the channel and exfiltrates the bearer token in cleartext. The operational defaulthttps://ollama.com(line 41) is safe but is not enforced. Suggested fix: reject endpoints whose scheme is nothttps, with a constructor-time error.🎯 Correctness — No material issues found
Correctness Review — P1 model + llmmeta
VERDICT: No material issues found
(Note: the two issues claimed in the draft —
resolvedModelreturning the bare tier alias and the JSON-fence stripping with backticks in string values — were the only substantive items, but neither survived verification against the actual code:model.ResolveModelName("fast")returns the resolved model portion viadefaultResolver.Resolveandstrings.Index(spec, "/"), not the bare alias; and the fence-stripping behavior, while a real tolerance trade-off, is documented as such and only fires on responses that start with a leading fence. All other draft items were either restating correct behavior or non-issues.)🧯 Error handling & edge cases — Minor issues
VERDICT: Minor issues
Findings (error handling & edge cases lens)
model/call.go:240— cache-only token usage is silently dropped.recordUsageshort-circuits whenu.InputTokens == 0 && u.OutputTokens == 0, but a response carrying onlyCacheReadTokens/CacheWriteTokens(cache hit, no fresh tokens) still hits the early return and never reachesusageSink.Record. On providers that price by cache reads (Anthropic prompt caching, OpenAI cached-input billing), a fully-cached call records zero usage. Fix: extend the guard toif u.InputTokens == 0 && u.OutputTokens == 0 && u.CacheReadTokens == 0 && u.CacheWriteTokens == 0— or drop the guard entirely, since the sink is allowed to be a no-op for zero usage.model/call.go:243-254—model == "unknown"is recorded, not rejected. WhenresolvedModelNamereturns"unknown"or""(noWithModel, noresp.Model), the code logs a warning and then still callsusageSink.Record(ctx, model, …). The warning suggests this is a misconfiguration, but the sink now aggregates every unattributed call under the same"unknown"key — silently merging billing-relevant rows into one bucket. Fix: either skip recording (returning early after the warning) or attribute to a per-process caller (e.g. composemodel+toolFromContext) so rows are distinguishable.llmmeta/helper.go:276-281vs:283-309—tier_not_allowedledger-row semantics are inconsistent. The package docstring andCallSpec.Tiercomment say disallowed tiers "do not write a ledger row (the call did not happen)". The earlytierAllowedrejection athelper.go:276honours this. The later path athelper.go:288-309(wheremodel.ParseModelForContexterrors on a mis-set tier) writes a row withErrorKind: "tier_not_allowed"and surfacesErrorKindTierNotAllowedto the caller.TestCall_TierNotAllowedonly pins the first path's "no row" behaviour, so the divergence is uncaught. Pick one: either both paths record (with areasonsub-field to distinguish), or neither does — and update the test/docstring to match.llmmeta/helper.go:567-578—recordLedgersilently swallows storage errors. The comment "Storage failures are logged at the storage layer" is enforced by nothing in theStorageinterface. A host'sStorageimplementation that forgets to log silently drops audit rows on DB hiccups. Either: (a) add an explicit contract in the interface comment that errors must be logged before being returned, or (b) log at this layer (slog.Warn("llmmeta: ledger write failed", "err", err)) and document that the helper does so.model/tiers.go:91,108—Resolvecacheexpiresis computed from anowcaptured before the locks.now := r.now()is read at line 91 (before the RLock peek and the subsequent upgrade to write Lock). The fresh cache entry at line 108 storesexpires: now.Add(r.ttl)using the same pre-locknow. Both expiry checks (lines 94 and 102) use the samenowso the round-trip in this caller is internally consistent, but a different goroutine callingResolveshortly after will read its own (later)nowand may find the freshly-stored entry "expired" if lock wait > TTL. Not a real-world bug at the 30s default, but the comment should pin this invariant so a future refactor that recomputesnowafter the lock doesn't silently introduce immediate-stale entries.model/bench.go:67-75—UnbenchModelclears manual AND automatic benches unconditionally.Health().Unbench(model)removes whatever bench the tracker holds (manual or automatic);delete(manualBenches, model)then drops the manual marker. CallingUnbenchModelon a model that was auto-benched (e.g. transient outage) silently removes the auto-backoff too, potentially masking a still-degraded provider. The return valuewasBenchedreports the pre-clear state but the caller can't tell manual-vs-auto from it. Document this as "clears ALL benches" or return a separate flag.llmmeta/helper.go:593-610—tryParseJSONcode-fence stripping is fragile on opening-only fences. When the response starts with```json(with no newline between the language tag and the body),strings.Index(trimmed, "\n")returns-1, the opening fence isn't stripped, andstrings.LastIndex(trimmed, "```")then matches the opening fence itself —trimmedbecomes empty and parse fails, firing the retry unnecessarily. Edge-case but documented "code fence tolerance" implies it should handle the no-newline form.Automated adversarial review by Gadfly. Advisory only — does not block merge. · ⏱️ reviewed in 5m 48s