Files
majordomo/docs/adr/0015-llama-swap-provider.md
T
steve de2b2f0f28
CI / Tidy (pull_request) Successful in 9m43s
CI / Build & Test (pull_request) Successful in 10m26s
Adversarial Review (Gadfly) / review (pull_request) Successful in 11m47s
feat(llamaswap): add llama-swaps (TLS) DSN scheme
llama-swap was http-only by DSN, pushing TLS-fronted instances onto the openai://
scheme (which loses the management/image methods). Add a "llama-swaps" scheme
that builds an https base URL, alongside "llama-swap" (http, local-first) —
mirroring redis/rediss. Both share one factory; llama-swaps is scheme-only (no
default built-in). The choice stays explicit because a DSN has no reliable
http-vs-https signal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 17:58:59 -04:00

3.2 KiB

ADR-0015: llama-swap provider

Status: Accepted — 2026-06-27

Context

llama-swap (https://github.com/mostlygeek/llama-swap) is an on-demand model-swapping proxy in front of llama.cpp (and stable-diffusion.cpp) servers: it extracts the model from each request, loads/hot-swaps the matching upstream, and serves it. It is what foreman reached for, but more robust (groups, TTL unload, health checks, a management API). We want it as a first-class majordomo target — llama-swap://token@host:port in the DSN — and the user explicitly asked for a tailored provider, not a bare alias of the OpenAI client.

The tension: llama-swap's chat API is byte-for-byte OpenAI Chat Completions. A new hand-rolled chat wire client would duplicate provider/openai for zero behavioral gain, which ADR-0007 forbids. But the "more robust" surface (model discovery, running list, unload) does not fit the canonical llm.Provider/llm.Model interface (anti-creep: no provider-specific features leak into the canonical API).

Decision

  • A dedicated provider/llamaswap package, but its chat path delegates to provider/openai pointed at {baseURL}/v1 — no duplicated wire client. Provider.Model returns openai.New(...).Model(id).
  • Chat construction specifics: WithLegacyMaxTokens() (llama.cpp's OpenAI shim honors max_tokens, not max_completion_tokens); a placeholder Bearer no-key when no token is set (the openai client treats a blank key as a synthetic 401, but a local keyless llama-swap ignores a bearer it didn't ask for); the injected HTTP client carries no timeout — a cold model swap blocks up to llama-swap's healthCheckTimeout (≥15s), so callers bound work with a context deadline, never a client timeout.
  • The "tailored" surface lives as concrete methods on *llamaswap.Provider, outside the canonical interface: ListModels (GET /v1/models), Running (GET /running, returned as raw JSON — its shape is not a stable contract), Unload (POST /api/models/unload[/:model]). A small doJSON helper shares bearer auth + error mapping; non-2xx → *llm.APIError (so llm.Classify applies), transport errors wrapped raw.
  • DSN: two schemes share one factory. llama-swap builds an http:// base URL from the host (llama-swap is local-first), deliberately not the DSN's https-always BaseURL(); llama-swaps builds https:// for a TLS-fronted instance (mirrors redis/rediss). Why a second scheme rather than auto-detect: a DSN carries no reliable http-vs-https signal, so the choice stays explicit. Only llama-swap registers a no-DSN built-in provider (errors on use, mirrors foreman); llama-swaps is a scheme only.
  • Image generation is implemented here too, against the new imagegen interface (see ADR-0016).

Consequences

  • No new dependency, no duplicated chat client; the chat path inherits every openai feature/fix automatically.
  • Management methods are reachable only by holding the concrete *llamaswap.Provider (e.g. mort), not through Parse/llm.Provider — the correct boundary for non-canonical features.
  • Running's raw-JSON return is honest about llama-swap not publishing a stable schema; a typed shape can be added later without breaking callers that ignore it.