dcd004289f
Phase 1 of the majordomo build: - llm/ canonical contract (messages, parts, tools, capabilities, streaming, Model/Provider, error classification) - health/ clock-injected tracker (threshold bench, exponential capped cooldown, reset-on-success) - root Registry + Parse (verbatim model ids, inline recursive alias expansion with cycle detection, chain dedup), LLM_* env-DSN providers (go-llm parity: lazy fallback + eager LoadEnv), health-aware chain executor behind the Model interface - provider/fake scriptable test provider; hermetic test suite incl. the trailing-thinking chain and foreman:// env loading - ADRs 0001-0008, CLAUDE.md, README (honest matrix), CI workflow, docs/phase-1-design.md Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2.6 KiB
2.6 KiB
ADR-0008: Failover-chain execution semantics
Status: Accepted — 2026-06-10
Context
A parsed spec is an ordered chain of targets sharing the registry's health tracker. The executor must realize the kickoff's failover story (retry one blip; bench repeat offenders; skip benched targets; clear exhaustion errors) identically for chains of one and many.
Decision
For each request, iterate elements head-to-tail:
- Skip targets currently benched (recorded in the exhaustion error).
- Attempt the target. On success → report success (resets health), return.
- On error, classify:
- Permanent + model-not-found → advance, no health penalty.
- Permanent otherwise (auth, malformed) → fail fast by default —
failing over cannot fix a bad request;
ChainConfig.AdvanceOnPermanentflips this for callers who prefer availability. - Transient → report the failed attempt to the tracker; retry the
same target while attempts remain (
TransientRetries, default 1) unless the tracker just benched it, in which case advance immediately.
- All elements failed/skipped → return
errors.Join(ErrChainExhausted, per-target reasons...)naming every target and why.
Other decisions:
- Capabilities() = head element's capabilities. The head is the preferred target and the honest answer to "what should I prepare for?". Per-attempt media normalization (Phase 3) uses the actual target's capabilities, so fallbacks still get correctly-fitted inputs. Intersection semantics were rejected: a rarely-used tail fallback would artificially constrain every request.
- Streaming failover applies to stream establishment only. Once a stream is open, mid-stream errors propagate; silently restarting on another target would re-deliver partial output.
context.Canceledaborts the chain immediately between and during attempts.- Duplicate post-expansion elements were already dropped at Parse (ADR-0003).
Consequences
- "One transient error is fine" holds: blip → same-target retry succeeds, no failover, one health mark that the success immediately clears... and with default knobs (retries=1, threshold=2) a target whose retry also fails is benched in the same request and the chain advances — exactly the kickoff narrative.
- Single-target specs get the same retry/backoff behavior for free.
Alternatives considered
- Per-request (not per-attempt) failure counting — needs two failed requests to bench, letting a dead model eat the retry budget twice. Rejected as weaker than the kickoff's story.
- Intersection capabilities — see above. Rejected.