llama-swap was http-only by DSN, pushing TLS-fronted instances onto the openai:// scheme (which loses the management/image methods). Add a "llama-swaps" scheme that builds an https base URL, alongside "llama-swap" (http, local-first) — mirroring redis/rediss. Both share one factory; llama-swaps is scheme-only (no default built-in). The choice stays explicit because a DSN has no reliable http-vs-https signal. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3.2 KiB
ADR-0015: llama-swap provider
Status: Accepted — 2026-06-27
Context
llama-swap (https://github.com/mostlygeek/llama-swap) is an on-demand
model-swapping proxy in front of llama.cpp (and stable-diffusion.cpp) servers:
it extracts the model from each request, loads/hot-swaps the matching
upstream, and serves it. It is what foreman reached for, but more robust
(groups, TTL unload, health checks, a management API). We want it as a
first-class majordomo target — llama-swap://token@host:port in the DSN — and
the user explicitly asked for a tailored provider, not a bare alias of the
OpenAI client.
The tension: llama-swap's chat API is byte-for-byte OpenAI Chat
Completions. A new hand-rolled chat wire client would duplicate
provider/openai for zero behavioral gain, which ADR-0007 forbids. But the
"more robust" surface (model discovery, running list, unload) does not fit the
canonical llm.Provider/llm.Model interface (anti-creep: no provider-specific
features leak into the canonical API).
Decision
- A dedicated
provider/llamaswappackage, but its chat path delegates toprovider/openaipointed at{baseURL}/v1— no duplicated wire client.Provider.Modelreturnsopenai.New(...).Model(id). - Chat construction specifics:
WithLegacyMaxTokens()(llama.cpp's OpenAI shim honorsmax_tokens, notmax_completion_tokens); a placeholderBearer no-keywhen no token is set (the openai client treats a blank key as a synthetic 401, but a local keyless llama-swap ignores a bearer it didn't ask for); the injected HTTP client carries no timeout — a cold model swap blocks up to llama-swap'shealthCheckTimeout(≥15s), so callers bound work with a context deadline, never a client timeout. - The "tailored" surface lives as concrete methods on
*llamaswap.Provider, outside the canonical interface:ListModels(GET/v1/models),Running(GET/running, returned as raw JSON — its shape is not a stable contract),Unload(POST/api/models/unload[/:model]). A smalldoJSONhelper shares bearer auth + error mapping; non-2xx →*llm.APIError(sollm.Classifyapplies), transport errors wrapped raw. - DSN: two schemes share one factory.
llama-swapbuilds an http:// base URL from the host (llama-swap is local-first), deliberately not the DSN's https-alwaysBaseURL();llama-swapsbuilds https:// for a TLS-fronted instance (mirrors redis/rediss). Why a second scheme rather than auto-detect: a DSN carries no reliable http-vs-https signal, so the choice stays explicit. Onlyllama-swapregisters a no-DSN built-in provider (errors on use, mirrors foreman);llama-swapsis a scheme only. - Image generation is implemented here too, against the new
imagegeninterface (see ADR-0016).
Consequences
- No new dependency, no duplicated chat client; the chat path inherits every openai feature/fix automatically.
- Management methods are reachable only by holding the concrete
*llamaswap.Provider(e.g. mort), not throughParse/llm.Provider— the correct boundary for non-canonical features. Running's raw-JSON return is honest about llama-swap not publishing a stable schema; a typed shape can be added later without breaking callers that ignore it.