de2b2f0f28
llama-swap was http-only by DSN, pushing TLS-fronted instances onto the openai:// scheme (which loses the management/image methods). Add a "llama-swaps" scheme that builds an https base URL, alongside "llama-swap" (http, local-first) — mirroring redis/rediss. Both share one factory; llama-swaps is scheme-only (no default built-in). The choice stays explicit because a DSN has no reliable http-vs-https signal. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
61 lines
3.2 KiB
Markdown
61 lines
3.2 KiB
Markdown
# ADR-0015: llama-swap provider
|
|
|
|
**Status:** Accepted — 2026-06-27
|
|
|
|
## Context
|
|
|
|
llama-swap (https://github.com/mostlygeek/llama-swap) is an on-demand
|
|
model-swapping proxy in front of llama.cpp (and stable-diffusion.cpp) servers:
|
|
it extracts the `model` from each request, loads/hot-swaps the matching
|
|
upstream, and serves it. It is what foreman reached for, but more robust
|
|
(groups, TTL unload, health checks, a management API). We want it as a
|
|
first-class majordomo target — `llama-swap://token@host:port` in the DSN — and
|
|
the user explicitly asked for a *tailored* provider, not a bare alias of the
|
|
OpenAI client.
|
|
|
|
The tension: llama-swap's **chat** API is byte-for-byte OpenAI Chat
|
|
Completions. A new hand-rolled chat wire client would duplicate
|
|
`provider/openai` for zero behavioral gain, which ADR-0007 forbids. But the
|
|
"more robust" surface (model discovery, running list, unload) does not fit the
|
|
canonical `llm.Provider`/`llm.Model` interface (anti-creep: no provider-specific
|
|
features leak into the canonical API).
|
|
|
|
## Decision
|
|
|
|
- A dedicated `provider/llamaswap` package, but its chat path **delegates to
|
|
`provider/openai`** pointed at `{baseURL}/v1` — no duplicated wire client.
|
|
`Provider.Model` returns `openai.New(...).Model(id)`.
|
|
- Chat construction specifics: `WithLegacyMaxTokens()` (llama.cpp's OpenAI shim
|
|
honors `max_tokens`, not `max_completion_tokens`); a placeholder `Bearer
|
|
no-key` when no token is set (the openai client treats a blank key as a
|
|
synthetic 401, but a local keyless llama-swap ignores a bearer it didn't ask
|
|
for); the injected HTTP client carries **no timeout** — a cold model swap
|
|
blocks up to llama-swap's `healthCheckTimeout` (≥15s), so callers bound work
|
|
with a context deadline, never a client timeout.
|
|
- The "tailored" surface lives as **concrete methods** on `*llamaswap.Provider`,
|
|
outside the canonical interface: `ListModels` (GET `/v1/models`), `Running`
|
|
(GET `/running`, returned as raw JSON — its shape is not a stable contract),
|
|
`Unload` (POST `/api/models/unload[/:model]`). A small `doJSON` helper shares
|
|
bearer auth + error mapping; non-2xx → `*llm.APIError` (so `llm.Classify`
|
|
applies), transport errors wrapped raw.
|
|
- DSN: two schemes share one factory. `llama-swap` builds an **http://** base
|
|
URL from the host (llama-swap is local-first), deliberately *not* the DSN's
|
|
https-always `BaseURL()`; `llama-swaps` builds **https://** for a TLS-fronted
|
|
instance (mirrors redis/rediss). Why a second scheme rather than auto-detect:
|
|
a DSN carries no reliable http-vs-https signal, so the choice stays explicit.
|
|
Only `llama-swap` registers a no-DSN built-in provider (errors on use, mirrors
|
|
foreman); `llama-swaps` is a scheme only.
|
|
- Image generation is implemented here too, against the new `imagegen`
|
|
interface (see ADR-0016).
|
|
|
|
## Consequences
|
|
|
|
- No new dependency, no duplicated chat client; the chat path inherits every
|
|
openai feature/fix automatically.
|
|
- Management methods are reachable only by holding the concrete
|
|
`*llamaswap.Provider` (e.g. mort), not through `Parse`/`llm.Provider` — the
|
|
correct boundary for non-canonical features.
|
|
- `Running`'s raw-JSON return is honest about llama-swap not publishing a stable
|
|
schema; a typed shape can be added later without breaking callers that ignore
|
|
it.
|