Files
foreman/docs/adr/0009-single-worker-drain-by-model.md
steve 0526bada90 docs: land prior ADR + prompt updates
Commit pre-existing uncommitted working-tree changes that predate the
license/public-readiness work — NOT authored in this session, just flushed so
they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013
(embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and
the phase-0..3 prompt revisions + prompts/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 20:33:39 -04:00

52 lines
2.4 KiB
Markdown

# ADR-0009: Single-worker serialization and drain-by-model scheduling
**Status:** Accepted — 2026-05-23
## Context
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one *worker*
model fast at a time; loading a different worker model is a 5-10s cold start.
Running two large models concurrently on 32GB either OOMs or pages to a 5-10x
slowdown. So parallelism among **worker** models against a single target buys
nothing and would reintroduce coordination logic.
The one exception is a small always-resident embedding model, which co-resides
cheaply alongside the worker model and is served outside the queue entirely
(ADR-0013). This ADR governs only the worker slot.
## Decision
**Worker-model concurrency against the target is 1.** A single worker loop pulls
the next job from the queue, ensures the right worker model is resident, executes,
and records the result. (Embeddings are not jobs and never enter this loop —
ADR-0013.)
**Drain-by-model scheduling:** before incurring a model swap, the worker finishes
every queued job that targets the **currently-resident** model (observed via
`/api/ps`, ADR-0007). Only when no job for the hot model remains does it select a
job for a different model and pay the swap cost.
This is an `ORDER BY (model != current_model), created_at` style selection — a
heuristic, not a scheduler. There is intentionally **no** priority system,
fairness weighting, or capacity budgeting (those sank the predecessor; see
ADR-0001).
Residency is pinned with Ollama `keep_alive` so the hot worker model isn't
unloaded between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=2` on the target
holds two slots: the always-resident embedding model plus the rotating worker
model (ADR-0013). Worker models still swap one-at-a-time within their single slot.
## Consequences
- Swap thrash is minimized without any complex scheduling.
- A long run of same-model jobs can delay a different-model job — acceptable for a
background box, and bounded by queue depth. If starvation ever becomes a real
problem, that is a signal to reconsider, not to pre-build fairness.
- Throughput is dominated by how well callers batch work by model.
## Alternatives considered
- **FIFO with naive swapping.** Correct but pays a cold start on every model
change; wasteful when jobs interleave models. Rejected.
- **Priority/fair scheduling.** Explicitly rejected as scope creep (ADR-0001).