Files
majordomo/llm/errors.go
T
steve 74474c6da0
CI / Tidy (push) Successful in 9m26s
CI / Build & Test (push) Successful in 10m29s
feat(chain): fail over on empty/degenerate responses
A failover chain previously treated a successful-but-empty completion (no
content parts and no tool calls — a "stop with nothing") as a valid result
and returned it. The agent loop then ended the run with empty output, and
the configured backup models were never tried because no error was raised.
This let a single flaky model silently terminate an agent/skill run with
no answer (observed in the wild with ollama-cloud/glm-5.2 returning empty
completions right after a large tool/think turn).

- Add llm.ErrEmptyResponse (classified transient) and Response.IsEmpty():
  true only when there are no tool calls and no meaningful content (no
  parts, or whitespace-only text). A media/image part counts as content,
  so image-only responses are NOT empty.
- chain.Generate converts an empty completion into ErrEmptyResponse so the
  chain fails over to the next target. Unlike an ordinary transient it is
  NOT retried on the same target (the model just produced it; these calls
  are expensive) — the chain penalizes health (so a persistently-empty
  target benches) and advances immediately.
- When every target returns empty the call fails with ErrChainExhausted
  joined to ErrEmptyResponse — a visible error instead of a hollow success.
  Single-element chains therefore also surface empties as errors.

Stream path is unchanged (can't inspect content before the consumer reads
it). Tests: Response.IsEmpty table; chain fails over past an empty head;
all-empty chain returns ErrChainExhausted/ErrEmptyResponse; repeated
empties bench the target across requests. Full suite green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 10:35:07 -04:00

141 lines
4.3 KiB
Go

package llm
import (
"context"
"errors"
"fmt"
"net"
"net/http"
"strings"
"syscall"
)
// ErrorClass buckets errors for retry/failover decisions.
type ErrorClass int
const (
// ClassTransient errors may succeed on retry or on another target:
// rate limits, server errors, timeouts, connection failures.
ClassTransient ErrorClass = iota
// ClassPermanent errors will not improve on retry of the same request:
// malformed requests, auth failures, model-not-found.
ClassPermanent
)
// ErrModelNotFound marks a permanent "this target does not know this model"
// condition. Chains advance past it without penalizing the target's health.
var ErrModelNotFound = errors.New("model not found")
// ErrUnsupported marks a request the target cannot serve by declaration —
// e.g. images that cannot be normalized to its capabilities, or a feature
// (tools, structured output) it does not support. Permanent for the target,
// but chains advance past it without penalizing health: another element may
// well be able to serve the request.
var ErrUnsupported = errors.New("request unsupported by target")
// ErrEmptyResponse marks a provider call that returned, without error, no
// usable output — no content parts and no tool calls (a "stop with
// nothing"). It is never a valid completion: an agent step needs either a
// final answer or a tool call, and a one-shot Generate needs content. A
// chain treats it as a per-target failure and fails over to the next
// element (benching the empty target) so a single flaky model cannot
// silently end a run with nothing. See chain.Generate / Response.IsEmpty.
var ErrEmptyResponse = errors.New("model returned an empty response")
// APIError is a structured provider error carrying enough context to
// classify it and to debug it.
type APIError struct {
// Provider and Model identify the target that failed.
Provider string
Model string
// Status is the HTTP status code, or 0 when the failure was not an HTTP
// response (connection error, decode error, ...).
Status int
// Code is the provider-specific error code, when one was supplied.
Code string
// Message is the provider's human-readable error message.
Message string
// Err is the wrapped underlying cause, if any.
Err error
}
func (e *APIError) Error() string {
var b strings.Builder
fmt.Fprintf(&b, "%s/%s", e.Provider, e.Model)
if e.Status != 0 {
fmt.Fprintf(&b, ": HTTP %d", e.Status)
}
if e.Code != "" {
fmt.Fprintf(&b, " [%s]", e.Code)
}
if e.Message != "" {
fmt.Fprintf(&b, ": %s", e.Message)
}
if e.Err != nil {
fmt.Fprintf(&b, ": %v", e.Err)
}
return b.String()
}
func (e *APIError) Unwrap() error {
if e.Err != nil {
return e.Err
}
if e.Status == http.StatusNotFound {
return ErrModelNotFound
}
return nil
}
// Classify buckets an error as transient or permanent.
//
// The default policy (overridable via health configuration):
// - context.Canceled is permanent — the caller gave up; retrying defies
// their intent. context.DeadlineExceeded is transient.
// - Network timeouts, refused/reset connections, and DNS failures are
// transient ("high demand" conditions).
// - HTTP 400/401/403/404/405/422 (and ErrModelNotFound) are permanent;
// 408/429 and all 5xx are transient.
// - Anything unrecognized is transient: when in doubt, failing over to the
// next target in a chain can only help availability.
func Classify(err error) ErrorClass {
if err == nil {
return ClassTransient
}
if errors.Is(err, context.Canceled) {
return ClassPermanent
}
if errors.Is(err, context.DeadlineExceeded) {
return ClassTransient
}
if errors.Is(err, ErrModelNotFound) || errors.Is(err, ErrUnsupported) {
return ClassPermanent
}
if errors.Is(err, ErrEmptyResponse) {
// An empty completion may be a one-off provider hiccup; another
// target (or, rarely, a retry) can produce real output.
return ClassTransient
}
if errors.Is(err, syscall.ECONNREFUSED) || errors.Is(err, syscall.ECONNRESET) {
return ClassTransient
}
if _, ok := errors.AsType[net.Error](err); ok {
return ClassTransient
}
if apiErr, ok := errors.AsType[*APIError](err); ok && apiErr.Status != 0 {
switch {
case apiErr.Status == http.StatusRequestTimeout, // 408
apiErr.Status == http.StatusTooManyRequests, // 429
apiErr.Status >= 500:
return ClassTransient
case apiErr.Status >= 400:
return ClassPermanent
}
}
return ClassTransient
}