0526bada90
Commit pre-existing uncommitted working-tree changes that predate the license/public-readiness work — NOT authored in this session, just flushed so they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013 (embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and the phase-0..3 prompt revisions + prompts/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
52 lines
2.4 KiB
Markdown
52 lines
2.4 KiB
Markdown
# ADR-0009: Single-worker serialization and drain-by-model scheduling
|
|
|
|
**Status:** Accepted — 2026-05-23
|
|
|
|
## Context
|
|
|
|
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one *worker*
|
|
model fast at a time; loading a different worker model is a 5-10s cold start.
|
|
Running two large models concurrently on 32GB either OOMs or pages to a 5-10x
|
|
slowdown. So parallelism among **worker** models against a single target buys
|
|
nothing and would reintroduce coordination logic.
|
|
|
|
The one exception is a small always-resident embedding model, which co-resides
|
|
cheaply alongside the worker model and is served outside the queue entirely
|
|
(ADR-0013). This ADR governs only the worker slot.
|
|
|
|
## Decision
|
|
|
|
**Worker-model concurrency against the target is 1.** A single worker loop pulls
|
|
the next job from the queue, ensures the right worker model is resident, executes,
|
|
and records the result. (Embeddings are not jobs and never enter this loop —
|
|
ADR-0013.)
|
|
|
|
**Drain-by-model scheduling:** before incurring a model swap, the worker finishes
|
|
every queued job that targets the **currently-resident** model (observed via
|
|
`/api/ps`, ADR-0007). Only when no job for the hot model remains does it select a
|
|
job for a different model and pay the swap cost.
|
|
|
|
This is an `ORDER BY (model != current_model), created_at` style selection — a
|
|
heuristic, not a scheduler. There is intentionally **no** priority system,
|
|
fairness weighting, or capacity budgeting (those sank the predecessor; see
|
|
ADR-0001).
|
|
|
|
Residency is pinned with Ollama `keep_alive` so the hot worker model isn't
|
|
unloaded between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=2` on the target
|
|
holds two slots: the always-resident embedding model plus the rotating worker
|
|
model (ADR-0013). Worker models still swap one-at-a-time within their single slot.
|
|
|
|
## Consequences
|
|
|
|
- Swap thrash is minimized without any complex scheduling.
|
|
- A long run of same-model jobs can delay a different-model job — acceptable for a
|
|
background box, and bounded by queue depth. If starvation ever becomes a real
|
|
problem, that is a signal to reconsider, not to pre-build fairness.
|
|
- Throughput is dominated by how well callers batch work by model.
|
|
|
|
## Alternatives considered
|
|
|
|
- **FIFO with naive swapping.** Correct but pays a cold start on every model
|
|
change; wasteful when jobs interleave models. Rejected.
|
|
- **Priority/fair scheduling.** Explicitly rejected as scope creep (ADR-0001).
|