docs: land prior ADR + prompt updates

Commit pre-existing uncommitted working-tree changes that predate the license/public-readiness work — NOT authored in this session, just flushed so they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013 (embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and the phase-0..3 prompt revisions + prompts/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 20:33:39 -04:00
parent 823c0b4ca8
commit 0526bada90
10 changed files with 276 additions and 98 deletions
@@ -4,16 +4,22 @@

 ## Context

-The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast
-at a time; loading a different model is a 5-10s cold start. Running two models
-concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism
-against a single target buys nothing and would reintroduce coordination logic.
+The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one *worker*
+model fast at a time; loading a different worker model is a 5-10s cold start.
+Running two large models concurrently on 32GB either OOMs or pages to a 5-10x
+slowdown. So parallelism among **worker** models against a single target buys
+nothing and would reintroduce coordination logic.
+
+The one exception is a small always-resident embedding model, which co-resides
+cheaply alongside the worker model and is served outside the queue entirely
+(ADR-0013). This ADR governs only the worker slot.

 ## Decision

-**Concurrency against the target is 1.** A single worker loop pulls the next job
-from the queue, ensures the right model is resident, executes, and records the
-result.
+**Worker-model concurrency against the target is 1.** A single worker loop pulls
+the next job from the queue, ensures the right worker model is resident, executes,
+and records the result. (Embeddings are not jobs and never enter this loop —
+ADR-0013.)

 **Drain-by-model scheduling:** before incurring a model swap, the worker finishes
 every queued job that targets the **currently-resident** model (observed via
@@ -25,9 +31,10 @@ heuristic, not a scheduler. There is intentionally **no** priority system,
 fairness weighting, or capacity budgeting (those sank the predecessor; see
 ADR-0001).

-Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded
-between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it
-to single-resident swap.
+Residency is pinned with Ollama `keep_alive` so the hot worker model isn't
+unloaded between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=2` on the target
+holds two slots: the always-resident embedding model plus the rotating worker
+model (ADR-0013). Worker models still swap one-at-a-time within their single slot.

 ## Consequences