Files
foreman/docs/adr/0009-single-worker-drain-by-model.md
steve 0526bada90 docs: land prior ADR + prompt updates
Commit pre-existing uncommitted working-tree changes that predate the
license/public-readiness work — NOT authored in this session, just flushed so
they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013
(embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and
the phase-0..3 prompt revisions + prompts/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 20:33:39 -04:00

2.4 KiB

ADR-0009: Single-worker serialization and drain-by-model scheduling

Status: Accepted — 2026-05-23

Context

The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one worker model fast at a time; loading a different worker model is a 5-10s cold start. Running two large models concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism among worker models against a single target buys nothing and would reintroduce coordination logic.

The one exception is a small always-resident embedding model, which co-resides cheaply alongside the worker model and is served outside the queue entirely (ADR-0013). This ADR governs only the worker slot.

Decision

Worker-model concurrency against the target is 1. A single worker loop pulls the next job from the queue, ensures the right worker model is resident, executes, and records the result. (Embeddings are not jobs and never enter this loop — ADR-0013.)

Drain-by-model scheduling: before incurring a model swap, the worker finishes every queued job that targets the currently-resident model (observed via /api/ps, ADR-0007). Only when no job for the hot model remains does it select a job for a different model and pay the swap cost.

This is an ORDER BY (model != current_model), created_at style selection — a heuristic, not a scheduler. There is intentionally no priority system, fairness weighting, or capacity budgeting (those sank the predecessor; see ADR-0001).

Residency is pinned with Ollama keep_alive so the hot worker model isn't unloaded between closely-spaced jobs. OLLAMA_MAX_LOADED_MODELS=2 on the target holds two slots: the always-resident embedding model plus the rotating worker model (ADR-0013). Worker models still swap one-at-a-time within their single slot.

Consequences

  • Swap thrash is minimized without any complex scheduling.
  • A long run of same-model jobs can delay a different-model job — acceptable for a background box, and bounded by queue depth. If starvation ever becomes a real problem, that is a signal to reconsider, not to pre-build fairness.
  • Throughput is dominated by how well callers batch work by model.

Alternatives considered

  • FIFO with naive swapping. Correct but pays a cold start on every model change; wasteful when jobs interleave models. Rejected.
  • Priority/fair scheduling. Explicitly rejected as scope creep (ADR-0001).