foreman/docs/adr/0013-embeddings-bypass-and-two-slot.md

# ADR-0013: Two-slot residency and embedding bypass

**Status:** Accepted — 2026-05-23

## Context

The target keeps **two** models resident (`OLLAMA_MAX_LOADED_MODELS=2`): a small,
always-resident **embedding model** (e.g. `nomic-embed-text` or
`qwen3-embedding`, operator-swappable) and one rotating **worker model** that
chat jobs queue against (ADR-0009). The embedder is tiny (~0.3–0.6 GB) and
co-resides cheaply with a ~20 GB worker model on 32 GB.

Embeddings are latency-sensitive and high-volume — a single backfill may fire
thousands of `/api/embed` calls. Forcing them through the serialized worker queue
(ADR-0009) would make them wait behind 20 GB chat jobs and swap thrash for no
reason, since the embedder is always loaded and never needs swapping.

## Decision

**The target runs exactly two resident models, and embeddings bypass the queue.**

- `OLLAMA_MAX_LOADED_MODELS=2`: slot 1 is the always-resident embedder (pinned
  with `keep_alive: -1`); slot 2 is the rotating worker model managed by the
  single-worker drain-by-model loop (ADR-0009).
- **Routing rule:** only `/api/chat` and `POST /jobs` are serialized through the
  worker queue. `/api/embed` (and the `/api/embeddings` alias) are proxied
  **directly and concurrently** to the target, never touching the queue, the
  worker loop, or the job store. Concurrent embedding requests are allowed; they
  hit the always-resident embedder and do not contend with worker-model swaps.
- The embedding model name is configurable (`FOREMAN_EMBED_MODEL`); foreman warms
  it on startup and on reconnect after the target was unreachable, so it stays in
  slot 1.

## Consequences

- Embeddings are fast and concurrent regardless of worker-queue depth — the right
  behavior for indexing/RAG backfills.
- The "concurrency" of foreman is precisely this: embedder ∥ worker. Worker jobs
  among themselves remain strictly serial (ADR-0009). There is no other
  parallelism, and none should be added.
- foreman must distinguish embedding routes from chat routes at the HTTP layer and
  keep them on separate code paths.
- If the operator misconfigures the target to `MAX_LOADED_MODELS=1`, embeddings
  and worker jobs will fight for the single slot and thrash; foreman should log a
  startup warning if it observes only one slot via `/api/ps` under load.

## Alternatives considered

- **Embeddings through the queue.** Simple uniformity, but serializes a
  high-volume concurrent workload behind chat jobs for no benefit. Rejected.
- **A dedicated second daemon for embeddings.** Violates one-daemon-per-target
  (ADR-0001) and is unnecessary — Ollama already serves both from one endpoint.