dcd004289f
Phase 1 of the majordomo build: - llm/ canonical contract (messages, parts, tools, capabilities, streaming, Model/Provider, error classification) - health/ clock-injected tracker (threshold bench, exponential capped cooldown, reset-on-success) - root Registry + Parse (verbatim model ids, inline recursive alias expansion with cycle detection, chain dedup), LLM_* env-DSN providers (go-llm parity: lazy fallback + eager LoadEnv), health-aware chain executor behind the Model interface - provider/fake scriptable test provider; hermetic test suite incl. the trailing-thinking chain and foreman:// env loading - ADRs 0001-0008, CLAUDE.md, README (honest matrix), CI workflow, docs/phase-1-design.md Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2.3 KiB
2.3 KiB
ADR-0006: Model health tracking and backoff
Status: Accepted — 2026-06-10
Context
Ollama Cloud models intermittently return "high demand" errors. mort's behavior to preserve: one blip should not fail a request (retry); a model that keeps failing should be benched so chains skip it, then re-admitted after a cooldown. majordomo owns this (the "model health tracker").
Decision
In-memory, process-local, thread-safe tracker in health/, keyed by
"provider/model-id", with an injected clock (func() time.Time) so
every backoff path is unit-testable without sleeping.
- Classification (
llm.Classify, overridable viaChainConfig.Classify): transient = HTTP 408/429/5xx, network timeouts, connection refused/reset, DNS failures,context.DeadlineExceeded; permanent = HTTP 400/401/403/404/405/422,ErrModelNotFound,context.Canceled(the caller gave up — retrying defies intent). Unknown errors default to transient: failing over can only help availability, and a wrongly benched model self-heals via cooldown, while a wrongly fail-fasted request is lost. - Counting: every failed transient attempt increments the target's
consecutive-failure count; any success resets count and backoff
exponent. At threshold (default 2) the target is benched until
now + cooldown, with cooldown = base (default 5s) × multiplier (default 2) per consecutive backoff round, capped (default 5m). After the bench triggers, the count resets, so re-benching needs a fresh run of failures — but at the doubled cooldown. - All knobs (threshold, base/cap/multiplier, clock, classifier, retry count) are configuration with the above defaults baked in.
- No persistence, no interface. The tracker is a concrete type; health is process-local by design (out-of-scope guardrail). A consumer wanting shared state can wrap the registry; we do not build for it now.
Consequences
- Deterministic tests via fake clock; no
time.Sleepanywhere. - Two providers addressing the same upstream model (e.g.
m1/xandm5/x) track independently — correct, since the backends are different machines.
Alternatives considered
- Persistent/pluggable health store — explicitly out of scope. Rejected.
- Unknown→permanent default — drops availability on novel errors. Rejected.