043249e0e1
Phase 3: - provider/openai: Chat Completions for OpenAI + compat endpoints (SSE streaming with by-index tool-call assembly, response_format json_schema, legacy max_tokens option, reasoning_effort) - provider/anthropic: Messages API (tool_use/tool_result, GA structured output via output_config.format, full SSE event parser, 529 transient) - provider/ollama: one native /api/chat client behind the ollama, ollama-cloud, and foreman built-ins (presets; NDJSON streaming tolerant of foreman's buffered single-object responses; object tool arguments; format-schema structured output; think mapping) - media/: capability normalization (sniff, downscale, transcode, byte ladder, ErrUnsupported), wired into the chain executor per target with penalty-free advance past incapable elements - registry: real provider + scheme wiring, WithHTTPClient option, required env-foreman TLS chat round-trip test - ADR-0009 multimodal strategy, ADR-0010 tools/structured mapping; README matrix + CLAUDE.md synced Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
58 lines
2.7 KiB
Markdown
58 lines
2.7 KiB
Markdown
# ADR-0009: Multimodal strategy — normalize per target, enforce at the provider
|
||
|
||
**Status:** Accepted — 2026-06-10
|
||
|
||
## Context
|
||
|
||
Every provider (and some models) imposes different image rules: max
|
||
dimensions/bytes, allowed MIME types, max images per request. A caller must
|
||
be able to attach an image without knowing the eventual target — especially
|
||
with failover chains, where the serving target isn't known until runtime.
|
||
|
||
## Decision
|
||
|
||
Two cooperating layers:
|
||
|
||
1. **`media.Normalize(req, caps)`** — the transformation point. The chain
|
||
executor calls it **per target, per attempt**, against the actual
|
||
target's capabilities, before the provider sees the request:
|
||
- The real format is **sniffed from magic bytes** and wins over the
|
||
declared MIME (callers lie; jpeg/png/gif/webp recognized).
|
||
- Already-fitting images pass through untouched (fast path: zero copies).
|
||
- Oversize dimensions downscale (aspect-preserving) with a hand-rolled
|
||
box-filter — stdlib has no scaler and `x/image` stays out per
|
||
ADR-0007; box-average quality is ample for vision input.
|
||
- Disallowed MIME re-encodes: original format if allowed, else JPEG
|
||
(q85), else PNG, else the first allowed encodable type.
|
||
- Byte budgets enforce via a quality ladder (jpeg 85→65→45→30) then
|
||
dimension halving; ~6 attempts before giving up.
|
||
- WebP cannot be decoded by stdlib: it passes through when it fits and
|
||
is allowed; any needed transform is a clear error.
|
||
- Everything that cannot be made to fit errors **wrapping
|
||
`llm.ErrUnsupported`** — never silently dropped.
|
||
2. **Provider backstop** — each provider cheaply enforces its effective
|
||
capabilities at request time (image count/MIME/bytes, plus
|
||
tools/structured/streaming support flags) and rejects with
|
||
`ErrUnsupported`. This keeps providers honest for expert callers who
|
||
build models directly without the registry.
|
||
|
||
Chain semantics: a normalization failure for one target **advances** to the
|
||
next element with no health penalty (the target isn't sick, it's just
|
||
incapable) — so `fp/text-only,fp/vision` serves an image request from the
|
||
vision element automatically.
|
||
|
||
Canonical image content stays **bytes + MIME** (ADR-0002); no URL fetching.
|
||
|
||
## Consequences
|
||
|
||
- A 100×50 PNG sent at a 32px-cap target arrives as a 32×16 PNG; the same
|
||
request served by an 8000px target arrives untouched.
|
||
- Conditional provider rules (e.g. Anthropic's 2000px cap above 20 images)
|
||
are approximated by the flat declared caps — conservative and simple.
|
||
|
||
## Alternatives considered
|
||
|
||
- Normalize once against chain-intersection caps: over-restricts every
|
||
request for the sake of rarely-used fallbacks. Rejected (ADR-0008).
|
||
- `x/image/draw` scalers: a dependency for one function. Rejected.
|