dcd004289f
Phase 1 of the majordomo build: - llm/ canonical contract (messages, parts, tools, capabilities, streaming, Model/Provider, error classification) - health/ clock-injected tracker (threshold bench, exponential capped cooldown, reset-on-success) - root Registry + Parse (verbatim model ids, inline recursive alias expansion with cycle detection, chain dedup), LLM_* env-DSN providers (go-llm parity: lazy fallback + eager LoadEnv), health-aware chain executor behind the Model interface - provider/fake scriptable test provider; hermetic test suite incl. the trailing-thinking chain and foreman:// env loading - ADRs 0001-0008, CLAUDE.md, README (honest matrix), CI workflow, docs/phase-1-design.md Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
49 lines
2.3 KiB
Markdown
49 lines
2.3 KiB
Markdown
# ADR-0006: Model health tracking and backoff
|
||
|
||
**Status:** Accepted — 2026-06-10
|
||
|
||
## Context
|
||
|
||
Ollama Cloud models intermittently return "high demand" errors. mort's
|
||
behavior to preserve: one blip should not fail a request (retry); a model
|
||
that keeps failing should be benched so chains skip it, then re-admitted
|
||
after a cooldown. majordomo owns this (the "model health tracker").
|
||
|
||
## Decision
|
||
|
||
In-memory, process-local, thread-safe tracker in `health/`, keyed by
|
||
`"provider/model-id"`, with an **injected clock** (`func() time.Time`) so
|
||
every backoff path is unit-testable without sleeping.
|
||
|
||
- **Classification** (`llm.Classify`, overridable via `ChainConfig.Classify`):
|
||
transient = HTTP 408/429/5xx, network timeouts, connection refused/reset,
|
||
DNS failures, `context.DeadlineExceeded`; permanent = HTTP
|
||
400/401/403/404/405/422, `ErrModelNotFound`, `context.Canceled` (the
|
||
caller gave up — retrying defies intent). **Unknown errors default to
|
||
transient**: failing over can only help availability, and a wrongly
|
||
benched model self-heals via cooldown, while a wrongly fail-fasted request
|
||
is lost.
|
||
- **Counting:** every failed transient *attempt* increments the target's
|
||
consecutive-failure count; any success resets count **and** backoff
|
||
exponent. At threshold (default **2**) the target is benched until
|
||
`now + cooldown`, with cooldown = base (default **5s**) × multiplier
|
||
(default **2**) per consecutive backoff round, capped (default **5m**).
|
||
After the bench triggers, the count resets, so re-benching needs a fresh
|
||
run of failures — but at the doubled cooldown.
|
||
- All knobs (threshold, base/cap/multiplier, clock, classifier, retry count)
|
||
are configuration with the above defaults baked in.
|
||
- **No persistence, no interface.** The tracker is a concrete type; health
|
||
is process-local by design (out-of-scope guardrail). A consumer wanting
|
||
shared state can wrap the registry; we do not build for it now.
|
||
|
||
## Consequences
|
||
|
||
- Deterministic tests via fake clock; no `time.Sleep` anywhere.
|
||
- Two providers addressing the same upstream model (e.g. `m1/x` and `m5/x`)
|
||
track independently — correct, since the backends are different machines.
|
||
|
||
## Alternatives considered
|
||
|
||
- Persistent/pluggable health store — explicitly out of scope. Rejected.
|
||
- Unknown→permanent default — drops availability on novel errors. Rejected.
|