3 Commits

Author SHA1 Message Date
steve 1206261e6a feat(failover): package-level default observer for transparently-built chains
CI / Build, Test & Lint (push) Successful in 10m43s
The transparent comma-Parse path builds failover chains via NewFailoverModel
with no options, so defaultFailoverConfig() left the observer nil and observers
only fired when a caller passed WithFailoverObserver explicitly. Add a
package-level default observer (SetFailoverObserver / DefaultFailoverObserver),
guarded by the existing defaultsMu, and seed it in defaultFailoverConfig() so
chains built transparently still notify it. An explicit WithFailoverObserver
still overrides the default per-chain. mort sets this at boot to persist
failover events.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 00:43:24 +02:00
steve 361999550e fix(failover): preserve manual bench against automatic cooldown downgrade
CI / Build, Test & Lint (push) Successful in 10m48s
recordTransientFailure and benchNow unconditionally set manual=false and
reset until to now+cooldown. When the best-effort all-benched failover path
re-tries a model an operator manually benched via BenchModel, a subsequent
failure downgraded manual=true -> false and could shorten the operator's
window to the short auto cooldown.

Both functions now read existing state: if it is an active manual bench
(manual && now.Before(until)) they bump consecutiveFails but keep manual=true
and the later until. Non-manual or expired-manual state still gets the
automatic cooldown. Adds TestFailover_ManualBenchSurvivesAutomaticDowngrade
covering no-prior, prior-auto, active-manual, and expired-manual cases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 00:37:38 +02:00
steve ae8e194fad feat(failover): model failover chains via comma-separated specs
Parse("a,b,c") now returns one composite *llm.Model that tries each model
in order, retrying transient failures, benching dead models, and failing
over to the next. Comma-free specs are completely unchanged.

- classify.go: Classify(err) ErrKind + IsTransient(err) error classifier
  mapping anthropic (typed Is*Err helpers + RequestError status),
  openai-go (*openai.Error status), openaicompat.FeatureUnsupportedError,
  context errors, and ollama "HTTP <code>" strings to
  transient/auth-dead/request-specific/unknown.
- failover.go: failoverProvider (satisfies provider.Provider) wrapped into a
  *Model via NewClient. Process-wide mutex-guarded modelHealth bench
  registry keyed by concrete spec, with cooldowns and a control API
  (ListBenched/BenchModel/UnbenchModel/IsBenched). NewFailoverModel +
  ParseChain constructors, FailoverOption config, FailoverObserver (carries
  the full request), and configurable package-level defaults.
- parse.go: comma-aware Parse splits into a failover chain; alias/resolver
  targets that expand to comma chains are routed through the comma-aware
  path and flattened.

All access to global health is mutex-guarded; tests reset it via
resetHealthForTest and pass under go test -race.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 00:30:08 +02:00