initial commit
This commit is contained in:
@@ -0,0 +1,44 @@
|
||||
# ADR-0009: Single-worker serialization and drain-by-model scheduling
|
||||
|
||||
**Status:** Accepted — 2026-05-23
|
||||
|
||||
## Context
|
||||
|
||||
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast
|
||||
at a time; loading a different model is a 5-10s cold start. Running two models
|
||||
concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism
|
||||
against a single target buys nothing and would reintroduce coordination logic.
|
||||
|
||||
## Decision
|
||||
|
||||
**Concurrency against the target is 1.** A single worker loop pulls the next job
|
||||
from the queue, ensures the right model is resident, executes, and records the
|
||||
result.
|
||||
|
||||
**Drain-by-model scheduling:** before incurring a model swap, the worker finishes
|
||||
every queued job that targets the **currently-resident** model (observed via
|
||||
`/api/ps`, ADR-0007). Only when no job for the hot model remains does it select a
|
||||
job for a different model and pay the swap cost.
|
||||
|
||||
This is an `ORDER BY (model != current_model), created_at` style selection — a
|
||||
heuristic, not a scheduler. There is intentionally **no** priority system,
|
||||
fairness weighting, or capacity budgeting (those sank the predecessor; see
|
||||
ADR-0001).
|
||||
|
||||
Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded
|
||||
between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it
|
||||
to single-resident swap.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Swap thrash is minimized without any complex scheduling.
|
||||
- A long run of same-model jobs can delay a different-model job — acceptable for a
|
||||
background box, and bounded by queue depth. If starvation ever becomes a real
|
||||
problem, that is a signal to reconsider, not to pre-build fairness.
|
||||
- Throughput is dominated by how well callers batch work by model.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **FIFO with naive swapping.** Correct but pays a cold start on every model
|
||||
change; wasteful when jobs interleave models. Rejected.
|
||||
- **Priority/fair scheduling.** Explicitly rejected as scope creep (ADR-0001).
|
||||
Reference in New Issue
Block a user