Files
foreman/docs/adr/0009-single-worker-drain-by-model.md
T
2026-05-23 16:41:20 -04:00

1.9 KiB

ADR-0009: Single-worker serialization and drain-by-model scheduling

Status: Accepted — 2026-05-23

Context

The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast at a time; loading a different model is a 5-10s cold start. Running two models concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism against a single target buys nothing and would reintroduce coordination logic.

Decision

Concurrency against the target is 1. A single worker loop pulls the next job from the queue, ensures the right model is resident, executes, and records the result.

Drain-by-model scheduling: before incurring a model swap, the worker finishes every queued job that targets the currently-resident model (observed via /api/ps, ADR-0007). Only when no job for the hot model remains does it select a job for a different model and pay the swap cost.

This is an ORDER BY (model != current_model), created_at style selection — a heuristic, not a scheduler. There is intentionally no priority system, fairness weighting, or capacity budgeting (those sank the predecessor; see ADR-0001).

Residency is pinned with Ollama keep_alive so the hot model isn't unloaded between closely-spaced jobs. OLLAMA_MAX_LOADED_MODELS=1 on the target keeps it to single-resident swap.

Consequences

  • Swap thrash is minimized without any complex scheduling.
  • A long run of same-model jobs can delay a different-model job — acceptable for a background box, and bounded by queue depth. If starvation ever becomes a real problem, that is a signal to reconsider, not to pre-build fairness.
  • Throughput is dominated by how well callers batch work by model.

Alternatives considered

  • FIFO with naive swapping. Correct but pays a cold start on every model change; wasteful when jobs interleave models. Rejected.
  • Priority/fair scheduling. Explicitly rejected as scope creep (ADR-0001).