0526bada90
Commit pre-existing uncommitted working-tree changes that predate the license/public-readiness work — NOT authored in this session, just flushed so they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013 (embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and the phase-0..3 prompt revisions + prompts/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2.6 KiB
2.6 KiB
ADR-0013: Two-slot residency and embedding bypass
Status: Accepted — 2026-05-23
Context
The target keeps two models resident (OLLAMA_MAX_LOADED_MODELS=2): a small,
always-resident embedding model (e.g. nomic-embed-text or
qwen3-embedding, operator-swappable) and one rotating worker model that
chat jobs queue against (ADR-0009). The embedder is tiny (~0.3–0.6 GB) and
co-resides cheaply with a ~20 GB worker model on 32 GB.
Embeddings are latency-sensitive and high-volume — a single backfill may fire
thousands of /api/embed calls. Forcing them through the serialized worker queue
(ADR-0009) would make them wait behind 20 GB chat jobs and swap thrash for no
reason, since the embedder is always loaded and never needs swapping.
Decision
The target runs exactly two resident models, and embeddings bypass the queue.
OLLAMA_MAX_LOADED_MODELS=2: slot 1 is the always-resident embedder (pinned withkeep_alive: -1); slot 2 is the rotating worker model managed by the single-worker drain-by-model loop (ADR-0009).- Routing rule: only
/api/chatandPOST /jobsare serialized through the worker queue./api/embed(and the/api/embeddingsalias) are proxied directly and concurrently to the target, never touching the queue, the worker loop, or the job store. Concurrent embedding requests are allowed; they hit the always-resident embedder and do not contend with worker-model swaps. - The embedding model name is configurable (
FOREMAN_EMBED_MODEL); foreman warms it on startup and on reconnect after the target was unreachable, so it stays in slot 1.
Consequences
- Embeddings are fast and concurrent regardless of worker-queue depth — the right behavior for indexing/RAG backfills.
- The "concurrency" of foreman is precisely this: embedder ∥ worker. Worker jobs among themselves remain strictly serial (ADR-0009). There is no other parallelism, and none should be added.
- foreman must distinguish embedding routes from chat routes at the HTTP layer and keep them on separate code paths.
- If the operator misconfigures the target to
MAX_LOADED_MODELS=1, embeddings and worker jobs will fight for the single slot and thrash; foreman should log a startup warning if it observes only one slot via/api/psunder load.
Alternatives considered
- Embeddings through the queue. Simple uniformity, but serializes a high-volume concurrent workload behind chat jobs for no benefit. Rejected.
- A dedicated second daemon for embeddings. Violates one-daemon-per-target (ADR-0001) and is unnecessary — Ollama already serves both from one endpoint.