docs: land prior ADR + prompt updates
Commit pre-existing uncommitted working-tree changes that predate the license/public-readiness work — NOT authored in this session, just flushed so they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013 (embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and the phase-0..3 prompt revisions + prompts/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -4,16 +4,22 @@
|
||||
|
||||
## Context
|
||||
|
||||
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast
|
||||
at a time; loading a different model is a 5-10s cold start. Running two models
|
||||
concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism
|
||||
against a single target buys nothing and would reintroduce coordination logic.
|
||||
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one *worker*
|
||||
model fast at a time; loading a different worker model is a 5-10s cold start.
|
||||
Running two large models concurrently on 32GB either OOMs or pages to a 5-10x
|
||||
slowdown. So parallelism among **worker** models against a single target buys
|
||||
nothing and would reintroduce coordination logic.
|
||||
|
||||
The one exception is a small always-resident embedding model, which co-resides
|
||||
cheaply alongside the worker model and is served outside the queue entirely
|
||||
(ADR-0013). This ADR governs only the worker slot.
|
||||
|
||||
## Decision
|
||||
|
||||
**Concurrency against the target is 1.** A single worker loop pulls the next job
|
||||
from the queue, ensures the right model is resident, executes, and records the
|
||||
result.
|
||||
**Worker-model concurrency against the target is 1.** A single worker loop pulls
|
||||
the next job from the queue, ensures the right worker model is resident, executes,
|
||||
and records the result. (Embeddings are not jobs and never enter this loop —
|
||||
ADR-0013.)
|
||||
|
||||
**Drain-by-model scheduling:** before incurring a model swap, the worker finishes
|
||||
every queued job that targets the **currently-resident** model (observed via
|
||||
@@ -25,9 +31,10 @@ heuristic, not a scheduler. There is intentionally **no** priority system,
|
||||
fairness weighting, or capacity budgeting (those sank the predecessor; see
|
||||
ADR-0001).
|
||||
|
||||
Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded
|
||||
between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it
|
||||
to single-resident swap.
|
||||
Residency is pinned with Ollama `keep_alive` so the hot worker model isn't
|
||||
unloaded between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=2` on the target
|
||||
holds two slots: the always-resident embedding model plus the rotating worker
|
||||
model (ADR-0013). Worker models still swap one-at-a-time within their single slot.
|
||||
|
||||
## Consequences
|
||||
|
||||
|
||||
Reference in New Issue
Block a user