# ADR-0013: Two-slot residency and embedding bypass **Status:** Accepted — 2026-05-23 ## Context The target keeps **two** models resident (`OLLAMA_MAX_LOADED_MODELS=2`): a small, always-resident **embedding model** (e.g. `nomic-embed-text` or `qwen3-embedding`, operator-swappable) and one rotating **worker model** that chat jobs queue against (ADR-0009). The embedder is tiny (~0.3–0.6 GB) and co-resides cheaply with a ~20 GB worker model on 32 GB. Embeddings are latency-sensitive and high-volume — a single backfill may fire thousands of `/api/embed` calls. Forcing them through the serialized worker queue (ADR-0009) would make them wait behind 20 GB chat jobs and swap thrash for no reason, since the embedder is always loaded and never needs swapping. ## Decision **The target runs exactly two resident models, and embeddings bypass the queue.** - `OLLAMA_MAX_LOADED_MODELS=2`: slot 1 is the always-resident embedder (pinned with `keep_alive: -1`); slot 2 is the rotating worker model managed by the single-worker drain-by-model loop (ADR-0009). - **Routing rule:** only `/api/chat` and `POST /jobs` are serialized through the worker queue. `/api/embed` (and the `/api/embeddings` alias) are proxied **directly and concurrently** to the target, never touching the queue, the worker loop, or the job store. Concurrent embedding requests are allowed; they hit the always-resident embedder and do not contend with worker-model swaps. - The embedding model name is configurable (`FOREMAN_EMBED_MODEL`); foreman warms it on startup and on reconnect after the target was unreachable, so it stays in slot 1. ## Consequences - Embeddings are fast and concurrent regardless of worker-queue depth — the right behavior for indexing/RAG backfills. - The "concurrency" of foreman is precisely this: embedder ∥ worker. Worker jobs among themselves remain strictly serial (ADR-0009). There is no other parallelism, and none should be added. - foreman must distinguish embedding routes from chat routes at the HTTP layer and keep them on separate code paths. - If the operator misconfigures the target to `MAX_LOADED_MODELS=1`, embeddings and worker jobs will fight for the single slot and thrash; foreman should log a startup warning if it observes only one slot via `/api/ps` under load. ## Alternatives considered - **Embeddings through the queue.** Simple uniformity, but serializes a high-volume concurrent workload behind chat jobs for no benefit. Rejected. - **A dedicated second daemon for embeddings.** Violates one-daemon-per-target (ADR-0001) and is unnecessary — Ollama already serves both from one endpoint.