Files
majordomo/docs/adr/0009-multimodal-strategy.md
T
steve 043249e0e1 feat: OpenAI, Anthropic, and native-Ollama providers + media pipeline
Phase 3:
- provider/openai: Chat Completions for OpenAI + compat endpoints (SSE
  streaming with by-index tool-call assembly, response_format json_schema,
  legacy max_tokens option, reasoning_effort)
- provider/anthropic: Messages API (tool_use/tool_result, GA structured
  output via output_config.format, full SSE event parser, 529 transient)
- provider/ollama: one native /api/chat client behind the ollama,
  ollama-cloud, and foreman built-ins (presets; NDJSON streaming tolerant
  of foreman's buffered single-object responses; object tool arguments;
  format-schema structured output; think mapping)
- media/: capability normalization (sniff, downscale, transcode, byte
  ladder, ErrUnsupported), wired into the chain executor per target with
  penalty-free advance past incapable elements
- registry: real provider + scheme wiring, WithHTTPClient option, required
  env-foreman TLS chat round-trip test
- ADR-0009 multimodal strategy, ADR-0010 tools/structured mapping; README
  matrix + CLAUDE.md synced

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 12:58:08 +02:00

58 lines
2.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0009: Multimodal strategy — normalize per target, enforce at the provider
**Status:** Accepted — 2026-06-10
## Context
Every provider (and some models) imposes different image rules: max
dimensions/bytes, allowed MIME types, max images per request. A caller must
be able to attach an image without knowing the eventual target — especially
with failover chains, where the serving target isn't known until runtime.
## Decision
Two cooperating layers:
1. **`media.Normalize(req, caps)`** — the transformation point. The chain
executor calls it **per target, per attempt**, against the actual
target's capabilities, before the provider sees the request:
- The real format is **sniffed from magic bytes** and wins over the
declared MIME (callers lie; jpeg/png/gif/webp recognized).
- Already-fitting images pass through untouched (fast path: zero copies).
- Oversize dimensions downscale (aspect-preserving) with a hand-rolled
box-filter — stdlib has no scaler and `x/image` stays out per
ADR-0007; box-average quality is ample for vision input.
- Disallowed MIME re-encodes: original format if allowed, else JPEG
(q85), else PNG, else the first allowed encodable type.
- Byte budgets enforce via a quality ladder (jpeg 85→65→45→30) then
dimension halving; ~6 attempts before giving up.
- WebP cannot be decoded by stdlib: it passes through when it fits and
is allowed; any needed transform is a clear error.
- Everything that cannot be made to fit errors **wrapping
`llm.ErrUnsupported`** — never silently dropped.
2. **Provider backstop** — each provider cheaply enforces its effective
capabilities at request time (image count/MIME/bytes, plus
tools/structured/streaming support flags) and rejects with
`ErrUnsupported`. This keeps providers honest for expert callers who
build models directly without the registry.
Chain semantics: a normalization failure for one target **advances** to the
next element with no health penalty (the target isn't sick, it's just
incapable) — so `fp/text-only,fp/vision` serves an image request from the
vision element automatically.
Canonical image content stays **bytes + MIME** (ADR-0002); no URL fetching.
## Consequences
- A 100×50 PNG sent at a 32px-cap target arrives as a 32×16 PNG; the same
request served by an 8000px target arrives untouched.
- Conditional provider rules (e.g. Anthropic's 2000px cap above 20 images)
are approximated by the flat declared caps — conservative and simple.
## Alternatives considered
- Normalize once against chain-intersection caps: over-restricts every
request for the sake of rarely-used fallbacks. Rejected (ADR-0008).
- `x/image/draw` scalers: a dependency for one function. Rejected.