Files
majordomo/docs/adr/0009-multimodal-strategy.md
steve 043249e0e1 feat: OpenAI, Anthropic, and native-Ollama providers + media pipeline
Phase 3:
- provider/openai: Chat Completions for OpenAI + compat endpoints (SSE
  streaming with by-index tool-call assembly, response_format json_schema,
  legacy max_tokens option, reasoning_effort)
- provider/anthropic: Messages API (tool_use/tool_result, GA structured
  output via output_config.format, full SSE event parser, 529 transient)
- provider/ollama: one native /api/chat client behind the ollama,
  ollama-cloud, and foreman built-ins (presets; NDJSON streaming tolerant
  of foreman's buffered single-object responses; object tool arguments;
  format-schema structured output; think mapping)
- media/: capability normalization (sniff, downscale, transcode, byte
  ladder, ErrUnsupported), wired into the chain executor per target with
  penalty-free advance past incapable elements
- registry: real provider + scheme wiring, WithHTTPClient option, required
  env-foreman TLS chat round-trip test
- ADR-0009 multimodal strategy, ADR-0010 tools/structured mapping; README
  matrix + CLAUDE.md synced

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 12:58:08 +02:00

2.7 KiB
Raw Permalink Blame History

ADR-0009: Multimodal strategy — normalize per target, enforce at the provider

Status: Accepted — 2026-06-10

Context

Every provider (and some models) imposes different image rules: max dimensions/bytes, allowed MIME types, max images per request. A caller must be able to attach an image without knowing the eventual target — especially with failover chains, where the serving target isn't known until runtime.

Decision

Two cooperating layers:

  1. media.Normalize(req, caps) — the transformation point. The chain executor calls it per target, per attempt, against the actual target's capabilities, before the provider sees the request:
    • The real format is sniffed from magic bytes and wins over the declared MIME (callers lie; jpeg/png/gif/webp recognized).
    • Already-fitting images pass through untouched (fast path: zero copies).
    • Oversize dimensions downscale (aspect-preserving) with a hand-rolled box-filter — stdlib has no scaler and x/image stays out per ADR-0007; box-average quality is ample for vision input.
    • Disallowed MIME re-encodes: original format if allowed, else JPEG (q85), else PNG, else the first allowed encodable type.
    • Byte budgets enforce via a quality ladder (jpeg 85→65→45→30) then dimension halving; ~6 attempts before giving up.
    • WebP cannot be decoded by stdlib: it passes through when it fits and is allowed; any needed transform is a clear error.
    • Everything that cannot be made to fit errors wrapping llm.ErrUnsupported — never silently dropped.
  2. Provider backstop — each provider cheaply enforces its effective capabilities at request time (image count/MIME/bytes, plus tools/structured/streaming support flags) and rejects with ErrUnsupported. This keeps providers honest for expert callers who build models directly without the registry.

Chain semantics: a normalization failure for one target advances to the next element with no health penalty (the target isn't sick, it's just incapable) — so fp/text-only,fp/vision serves an image request from the vision element automatically.

Canonical image content stays bytes + MIME (ADR-0002); no URL fetching.

Consequences

  • A 100×50 PNG sent at a 32px-cap target arrives as a 32×16 PNG; the same request served by an 8000px target arrives untouched.
  • Conditional provider rules (e.g. Anthropic's 2000px cap above 20 images) are approximated by the flat declared caps — conservative and simple.

Alternatives considered

  • Normalize once against chain-intersection caps: over-restricts every request for the sake of rarely-used fallbacks. Rejected (ADR-0008).
  • x/image/draw scalers: a dependency for one function. Rejected.