Files
foreman/docs/adr/0013-embeddings-bypass-and-two-slot.md
T
steve 0526bada90 docs: land prior ADR + prompt updates
Commit pre-existing uncommitted working-tree changes that predate the
license/public-readiness work — NOT authored in this session, just flushed so
they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013
(embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and
the phase-0..3 prompt revisions + prompts/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 20:33:39 -04:00

53 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0013: Two-slot residency and embedding bypass
**Status:** Accepted — 2026-05-23
## Context
The target keeps **two** models resident (`OLLAMA_MAX_LOADED_MODELS=2`): a small,
always-resident **embedding model** (e.g. `nomic-embed-text` or
`qwen3-embedding`, operator-swappable) and one rotating **worker model** that
chat jobs queue against (ADR-0009). The embedder is tiny (~0.30.6 GB) and
co-resides cheaply with a ~20 GB worker model on 32 GB.
Embeddings are latency-sensitive and high-volume — a single backfill may fire
thousands of `/api/embed` calls. Forcing them through the serialized worker queue
(ADR-0009) would make them wait behind 20 GB chat jobs and swap thrash for no
reason, since the embedder is always loaded and never needs swapping.
## Decision
**The target runs exactly two resident models, and embeddings bypass the queue.**
- `OLLAMA_MAX_LOADED_MODELS=2`: slot 1 is the always-resident embedder (pinned
with `keep_alive: -1`); slot 2 is the rotating worker model managed by the
single-worker drain-by-model loop (ADR-0009).
- **Routing rule:** only `/api/chat` and `POST /jobs` are serialized through the
worker queue. `/api/embed` (and the `/api/embeddings` alias) are proxied
**directly and concurrently** to the target, never touching the queue, the
worker loop, or the job store. Concurrent embedding requests are allowed; they
hit the always-resident embedder and do not contend with worker-model swaps.
- The embedding model name is configurable (`FOREMAN_EMBED_MODEL`); foreman warms
it on startup and on reconnect after the target was unreachable, so it stays in
slot 1.
## Consequences
- Embeddings are fast and concurrent regardless of worker-queue depth — the right
behavior for indexing/RAG backfills.
- The "concurrency" of foreman is precisely this: embedder ∥ worker. Worker jobs
among themselves remain strictly serial (ADR-0009). There is no other
parallelism, and none should be added.
- foreman must distinguish embedding routes from chat routes at the HTTP layer and
keep them on separate code paths.
- If the operator misconfigures the target to `MAX_LOADED_MODELS=1`, embeddings
and worker jobs will fight for the single slot and thrash; foreman should log a
startup warning if it observes only one slot via `/api/ps` under load.
## Alternatives considered
- **Embeddings through the queue.** Simple uniformity, but serializes a
high-volume concurrent workload behind chat jobs for no benefit. Rejected.
- **A dedicated second daemon for embeddings.** Violates one-daemon-per-target
(ADR-0001) and is unnecessary — Ollama already serves both from one endpoint.