Files
majordomo/docs/adr/0006-health-and-backoff.md
T
steve dcd004289f feat: foundations — canonical types, Parse grammar, env DSNs, health, chains
Phase 1 of the majordomo build:
- llm/ canonical contract (messages, parts, tools, capabilities, streaming,
  Model/Provider, error classification)
- health/ clock-injected tracker (threshold bench, exponential capped
  cooldown, reset-on-success)
- root Registry + Parse (verbatim model ids, inline recursive alias
  expansion with cycle detection, chain dedup), LLM_* env-DSN providers
  (go-llm parity: lazy fallback + eager LoadEnv), health-aware chain
  executor behind the Model interface
- provider/fake scriptable test provider; hermetic test suite incl. the
  trailing-thinking chain and foreman:// env loading
- ADRs 0001-0008, CLAUDE.md, README (honest matrix), CI workflow,
  docs/phase-1-design.md

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 12:35:34 +02:00

2.3 KiB
Raw Blame History

ADR-0006: Model health tracking and backoff

Status: Accepted — 2026-06-10

Context

Ollama Cloud models intermittently return "high demand" errors. mort's behavior to preserve: one blip should not fail a request (retry); a model that keeps failing should be benched so chains skip it, then re-admitted after a cooldown. majordomo owns this (the "model health tracker").

Decision

In-memory, process-local, thread-safe tracker in health/, keyed by "provider/model-id", with an injected clock (func() time.Time) so every backoff path is unit-testable without sleeping.

  • Classification (llm.Classify, overridable via ChainConfig.Classify): transient = HTTP 408/429/5xx, network timeouts, connection refused/reset, DNS failures, context.DeadlineExceeded; permanent = HTTP 400/401/403/404/405/422, ErrModelNotFound, context.Canceled (the caller gave up — retrying defies intent). Unknown errors default to transient: failing over can only help availability, and a wrongly benched model self-heals via cooldown, while a wrongly fail-fasted request is lost.
  • Counting: every failed transient attempt increments the target's consecutive-failure count; any success resets count and backoff exponent. At threshold (default 2) the target is benched until now + cooldown, with cooldown = base (default 5s) × multiplier (default 2) per consecutive backoff round, capped (default 5m). After the bench triggers, the count resets, so re-benching needs a fresh run of failures — but at the doubled cooldown.
  • All knobs (threshold, base/cap/multiplier, clock, classifier, retry count) are configuration with the above defaults baked in.
  • No persistence, no interface. The tracker is a concrete type; health is process-local by design (out-of-scope guardrail). A consumer wanting shared state can wrap the registry; we do not build for it now.

Consequences

  • Deterministic tests via fake clock; no time.Sleep anywhere.
  • Two providers addressing the same upstream model (e.g. m1/x and m5/x) track independently — correct, since the backends are different machines.

Alternatives considered

  • Persistent/pluggable health store — explicitly out of scope. Rejected.
  • Unknown→permanent default — drops availability on novel errors. Rejected.