0526bada90
Commit pre-existing uncommitted working-tree changes that predate the license/public-readiness work — NOT authored in this session, just flushed so they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013 (embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and the phase-0..3 prompt revisions + prompts/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
53 lines
2.6 KiB
Markdown
53 lines
2.6 KiB
Markdown
# ADR-0013: Two-slot residency and embedding bypass
|
||
|
||
**Status:** Accepted — 2026-05-23
|
||
|
||
## Context
|
||
|
||
The target keeps **two** models resident (`OLLAMA_MAX_LOADED_MODELS=2`): a small,
|
||
always-resident **embedding model** (e.g. `nomic-embed-text` or
|
||
`qwen3-embedding`, operator-swappable) and one rotating **worker model** that
|
||
chat jobs queue against (ADR-0009). The embedder is tiny (~0.3–0.6 GB) and
|
||
co-resides cheaply with a ~20 GB worker model on 32 GB.
|
||
|
||
Embeddings are latency-sensitive and high-volume — a single backfill may fire
|
||
thousands of `/api/embed` calls. Forcing them through the serialized worker queue
|
||
(ADR-0009) would make them wait behind 20 GB chat jobs and swap thrash for no
|
||
reason, since the embedder is always loaded and never needs swapping.
|
||
|
||
## Decision
|
||
|
||
**The target runs exactly two resident models, and embeddings bypass the queue.**
|
||
|
||
- `OLLAMA_MAX_LOADED_MODELS=2`: slot 1 is the always-resident embedder (pinned
|
||
with `keep_alive: -1`); slot 2 is the rotating worker model managed by the
|
||
single-worker drain-by-model loop (ADR-0009).
|
||
- **Routing rule:** only `/api/chat` and `POST /jobs` are serialized through the
|
||
worker queue. `/api/embed` (and the `/api/embeddings` alias) are proxied
|
||
**directly and concurrently** to the target, never touching the queue, the
|
||
worker loop, or the job store. Concurrent embedding requests are allowed; they
|
||
hit the always-resident embedder and do not contend with worker-model swaps.
|
||
- The embedding model name is configurable (`FOREMAN_EMBED_MODEL`); foreman warms
|
||
it on startup and on reconnect after the target was unreachable, so it stays in
|
||
slot 1.
|
||
|
||
## Consequences
|
||
|
||
- Embeddings are fast and concurrent regardless of worker-queue depth — the right
|
||
behavior for indexing/RAG backfills.
|
||
- The "concurrency" of foreman is precisely this: embedder ∥ worker. Worker jobs
|
||
among themselves remain strictly serial (ADR-0009). There is no other
|
||
parallelism, and none should be added.
|
||
- foreman must distinguish embedding routes from chat routes at the HTTP layer and
|
||
keep them on separate code paths.
|
||
- If the operator misconfigures the target to `MAX_LOADED_MODELS=1`, embeddings
|
||
and worker jobs will fight for the single slot and thrash; foreman should log a
|
||
startup warning if it observes only one slot via `/api/ps` under load.
|
||
|
||
## Alternatives considered
|
||
|
||
- **Embeddings through the queue.** Simple uniformity, but serializes a
|
||
high-volume concurrent workload behind chat jobs for no benefit. Rejected.
|
||
- **A dedicated second daemon for embeddings.** Violates one-daemon-per-target
|
||
(ADR-0001) and is unnecessary — Ollama already serves both from one endpoint.
|