feat(llamaswap): add llama-swap provider + canonical imagegen interface

Add provider/llamaswap, a tailored provider for llama-swap (the model-swapping proxy over llama.cpp / stable-diffusion.cpp). Its chat path delegates to provider/openai at {base}/v1 — no duplicated wire client (ADR-0007) — with legacy max_tokens, a Bearer no-key placeholder for keyless local instances, and a timeout-free client so cold model swaps rely on context deadlines. The "tailored" surface is concrete management methods (ListModels / Running / Unload) that don't belong on the canonical llm.Provider interface. The llama-swap:// DSN scheme builds an http base URL (local-first); a no-URL built-in errors clearly on use, mirroring foreman. Add imagegen, a new canonical text-to-image interface separate from llm (Request/Result/Model/Provider; Image = llm.ImagePart so generated images feed straight back into chat). First backend is llama-swap via OpenAI /v1/images/generations (b64_json, bytes-only). Re-exported from the root. v1 is txt2img only. Hermetic httptest coverage for chat delegation, management endpoints, image decode, and scheme wiring. ADR-0015 + ADR-0016, README support matrix + image-gen section, CLAUDE.md package map, and progress.md updated in the same commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 15:01:54 -04:00
parent 1fd7109a42
commit 96c612e707
14 changed files with 994 additions and 7 deletions
@@ -0,0 +1,58 @@
+# ADR-0015: llama-swap provider
+
+**Status:** Accepted — 2026-06-27
+
+## Context
+
+llama-swap (https://github.com/mostlygeek/llama-swap) is an on-demand
+model-swapping proxy in front of llama.cpp (and stable-diffusion.cpp) servers:
+it extracts the `model` from each request, loads/hot-swaps the matching
+upstream, and serves it. It is what foreman reached for, but more robust
+(groups, TTL unload, health checks, a management API). We want it as a
+first-class majordomo target — `llama-swap://token@host:port` in the DSN — and
+the user explicitly asked for a *tailored* provider, not a bare alias of the
+OpenAI client.
+
+The tension: llama-swap's **chat** API is byte-for-byte OpenAI Chat
+Completions. A new hand-rolled chat wire client would duplicate
+`provider/openai` for zero behavioral gain, which ADR-0007 forbids. But the
+"more robust" surface (model discovery, running list, unload) does not fit the
+canonical `llm.Provider`/`llm.Model` interface (anti-creep: no provider-specific
+features leak into the canonical API).
+
+## Decision
+
+- A dedicated `provider/llamaswap` package, but its chat path **delegates to
+  `provider/openai`** pointed at `{baseURL}/v1` — no duplicated wire client.
+  `Provider.Model` returns `openai.New(...).Model(id)`.
+- Chat construction specifics: `WithLegacyMaxTokens()` (llama.cpp's OpenAI shim
+  honors `max_tokens`, not `max_completion_tokens`); a placeholder `Bearer
+  no-key` when no token is set (the openai client treats a blank key as a
+  synthetic 401, but a local keyless llama-swap ignores a bearer it didn't ask
+  for); the injected HTTP client carries **no timeout** — a cold model swap
+  blocks up to llama-swap's `healthCheckTimeout` (≥15s), so callers bound work
+  with a context deadline, never a client timeout.
+- The "tailored" surface lives as **concrete methods** on `*llamaswap.Provider`,
+  outside the canonical interface: `ListModels` (GET `/v1/models`), `Running`
+  (GET `/running`, returned as raw JSON — its shape is not a stable contract),
+  `Unload` (POST `/api/models/unload[/:model]`). A small `doJSON` helper shares
+  bearer auth + error mapping; non-2xx → `*llm.APIError` (so `llm.Classify`
+  applies), transport errors wrapped raw.
+- DSN: the `llama-swap` scheme builds an **http://** base URL from the host
+  (llama-swap is local-first), deliberately *not* the DSN's https-always
+  `BaseURL()`. A TLS-fronted instance can use the `openai://` scheme for chat.
+  A no-DSN built-in `llama-swap` provider registers but errors on use (mirrors
+  foreman).
+- Image generation is implemented here too, against the new `imagegen`
+  interface (see ADR-0016).
+
+## Consequences
+
+- No new dependency, no duplicated chat client; the chat path inherits every
+  openai feature/fix automatically.
+- Management methods are reachable only by holding the concrete
+  `*llamaswap.Provider` (e.g. mort), not through `Parse`/`llm.Provider` — the
+  correct boundary for non-canonical features.
+- `Running`'s raw-JSON return is honest about llama-swap not publishing a stable
+  schema; a typed shape can be added later without breaking callers that ignore
+  it.