Files

T

steve 0526bada90 docs: land prior ADR + prompt updates

Commit pre-existing uncommitted working-tree changes that predate the
license/public-readiness work — NOT authored in this session, just flushed so
they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013
(embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and
the phase-0..3 prompt revisions + prompts/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-26 20:33:39 -04:00

2.6 KiB

Raw Blame History

ADR-0013: Two-slot residency and embedding bypass

Status: Accepted — 2026-05-23

Context

The target keeps two models resident (OLLAMA_MAX_LOADED_MODELS=2): a small, always-resident embedding model (e.g. nomic-embed-text or qwen3-embedding, operator-swappable) and one rotating worker model that chat jobs queue against (ADR-0009). The embedder is tiny (~0.3–0.6 GB) and co-resides cheaply with a ~20 GB worker model on 32 GB.

Embeddings are latency-sensitive and high-volume — a single backfill may fire thousands of /api/embed calls. Forcing them through the serialized worker queue (ADR-0009) would make them wait behind 20 GB chat jobs and swap thrash for no reason, since the embedder is always loaded and never needs swapping.

Decision

The target runs exactly two resident models, and embeddings bypass the queue.

OLLAMA_MAX_LOADED_MODELS=2: slot 1 is the always-resident embedder (pinned with keep_alive: -1); slot 2 is the rotating worker model managed by the single-worker drain-by-model loop (ADR-0009).
Routing rule: only /api/chat and POST /jobs are serialized through the worker queue. /api/embed (and the /api/embeddings alias) are proxied directly and concurrently to the target, never touching the queue, the worker loop, or the job store. Concurrent embedding requests are allowed; they hit the always-resident embedder and do not contend with worker-model swaps.
The embedding model name is configurable (FOREMAN_EMBED_MODEL); foreman warms it on startup and on reconnect after the target was unreachable, so it stays in slot 1.

Consequences

Embeddings are fast and concurrent regardless of worker-queue depth — the right behavior for indexing/RAG backfills.
The "concurrency" of foreman is precisely this: embedder ∥ worker. Worker jobs among themselves remain strictly serial (ADR-0009). There is no other parallelism, and none should be added.
foreman must distinguish embedding routes from chat routes at the HTTP layer and keep them on separate code paths.
If the operator misconfigures the target to MAX_LOADED_MODELS=1, embeddings and worker jobs will fight for the single slot and thrash; foreman should log a startup warning if it observes only one slot via /api/ps under load.

Alternatives considered

Embeddings through the queue. Simple uniformity, but serializes a high-volume concurrent workload behind chat jobs for no benefit. Rejected.
A dedicated second daemon for embeddings. Violates one-daemon-per-target (ADR-0001) and is unnecessary — Ollama already serves both from one endpoint.

2.6 KiB Raw Blame History Unescape Escape