docs: land prior ADR + prompt updates

Commit pre-existing uncommitted working-tree changes that predate the
license/public-readiness work — NOT authored in this session, just flushed so
they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013
(embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and
the phase-0..3 prompt revisions + prompts/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-26 20:33:39 -04:00
parent 823c0b4ca8
commit 0526bada90
10 changed files with 276 additions and 98 deletions
+17 -10
View File
@@ -4,16 +4,22 @@
## Context
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast
at a time; loading a different model is a 5-10s cold start. Running two models
concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism
against a single target buys nothing and would reintroduce coordination logic.
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one *worker*
model fast at a time; loading a different worker model is a 5-10s cold start.
Running two large models concurrently on 32GB either OOMs or pages to a 5-10x
slowdown. So parallelism among **worker** models against a single target buys
nothing and would reintroduce coordination logic.
The one exception is a small always-resident embedding model, which co-resides
cheaply alongside the worker model and is served outside the queue entirely
(ADR-0013). This ADR governs only the worker slot.
## Decision
**Concurrency against the target is 1.** A single worker loop pulls the next job
from the queue, ensures the right model is resident, executes, and records the
result.
**Worker-model concurrency against the target is 1.** A single worker loop pulls
the next job from the queue, ensures the right worker model is resident, executes,
and records the result. (Embeddings are not jobs and never enter this loop —
ADR-0013.)
**Drain-by-model scheduling:** before incurring a model swap, the worker finishes
every queued job that targets the **currently-resident** model (observed via
@@ -25,9 +31,10 @@ heuristic, not a scheduler. There is intentionally **no** priority system,
fairness weighting, or capacity budgeting (those sank the predecessor; see
ADR-0001).
Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded
between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it
to single-resident swap.
Residency is pinned with Ollama `keep_alive` so the hot worker model isn't
unloaded between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=2` on the target
holds two slots: the always-resident embedding model plus the rotating worker
model (ADR-0013). Worker models still swap one-at-a-time within their single slot.
## Consequences