Files
majordomo/docs/adr/0006-health-and-backoff.md
T
steve dcd004289f feat: foundations — canonical types, Parse grammar, env DSNs, health, chains
Phase 1 of the majordomo build:
- llm/ canonical contract (messages, parts, tools, capabilities, streaming,
  Model/Provider, error classification)
- health/ clock-injected tracker (threshold bench, exponential capped
  cooldown, reset-on-success)
- root Registry + Parse (verbatim model ids, inline recursive alias
  expansion with cycle detection, chain dedup), LLM_* env-DSN providers
  (go-llm parity: lazy fallback + eager LoadEnv), health-aware chain
  executor behind the Model interface
- provider/fake scriptable test provider; hermetic test suite incl. the
  trailing-thinking chain and foreman:// env loading
- ADRs 0001-0008, CLAUDE.md, README (honest matrix), CI workflow,
  docs/phase-1-design.md

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 12:35:34 +02:00

49 lines
2.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0006: Model health tracking and backoff
**Status:** Accepted — 2026-06-10
## Context
Ollama Cloud models intermittently return "high demand" errors. mort's
behavior to preserve: one blip should not fail a request (retry); a model
that keeps failing should be benched so chains skip it, then re-admitted
after a cooldown. majordomo owns this (the "model health tracker").
## Decision
In-memory, process-local, thread-safe tracker in `health/`, keyed by
`"provider/model-id"`, with an **injected clock** (`func() time.Time`) so
every backoff path is unit-testable without sleeping.
- **Classification** (`llm.Classify`, overridable via `ChainConfig.Classify`):
transient = HTTP 408/429/5xx, network timeouts, connection refused/reset,
DNS failures, `context.DeadlineExceeded`; permanent = HTTP
400/401/403/404/405/422, `ErrModelNotFound`, `context.Canceled` (the
caller gave up — retrying defies intent). **Unknown errors default to
transient**: failing over can only help availability, and a wrongly
benched model self-heals via cooldown, while a wrongly fail-fasted request
is lost.
- **Counting:** every failed transient *attempt* increments the target's
consecutive-failure count; any success resets count **and** backoff
exponent. At threshold (default **2**) the target is benched until
`now + cooldown`, with cooldown = base (default **5s**) × multiplier
(default **2**) per consecutive backoff round, capped (default **5m**).
After the bench triggers, the count resets, so re-benching needs a fresh
run of failures — but at the doubled cooldown.
- All knobs (threshold, base/cap/multiplier, clock, classifier, retry count)
are configuration with the above defaults baked in.
- **No persistence, no interface.** The tracker is a concrete type; health
is process-local by design (out-of-scope guardrail). A consumer wanting
shared state can wrap the registry; we do not build for it now.
## Consequences
- Deterministic tests via fake clock; no `time.Sleep` anywhere.
- Two providers addressing the same upstream model (e.g. `m1/x` and `m5/x`)
track independently — correct, since the backends are different machines.
## Alternatives considered
- Persistent/pluggable health store — explicitly out of scope. Rejected.
- Unknown→permanent default — drops availability on novel errors. Rejected.