docs: land prior ADR + prompt updates
Commit pre-existing uncommitted working-tree changes that predate the license/public-readiness work — NOT authored in this session, just flushed so they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013 (embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and the phase-0..3 prompt revisions + prompts/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -40,8 +40,8 @@ only if a non-go-llm caller needs it.
|
||||
- "Set up the Mac as a go-llm target" needs zero provider changes — a thin
|
||||
constructor only (ADR-0011).
|
||||
- Preserves `think:false`, reliable tool calls, and lower latency.
|
||||
- foreman must faithfully proxy native `/api/chat` semantics, including SSE
|
||||
streaming (ADR-0012).
|
||||
- foreman must faithfully proxy native `/api/chat` semantics, including NDJSON
|
||||
streaming (`application/x-ndjson`, not SSE; ADR-0012).
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
|
||||
@@ -22,7 +22,7 @@ that URL on every state transition.
|
||||
"state": "loading",
|
||||
"previous_state": "queued",
|
||||
"timestamp": "2026-05-23T12:00:00Z",
|
||||
"model": "qwen3.6:35b",
|
||||
"model": "qwen3:30b",
|
||||
"attempt": 1,
|
||||
"error": null,
|
||||
"result": null,
|
||||
@@ -59,5 +59,6 @@ that URL on every state transition.
|
||||
|
||||
- **Polling only.** Simpler for foreman, worse for callers; rejected since
|
||||
webhooks were an explicit requirement. (Polling is still available as fallback.)
|
||||
- **WebSocket/SSE for state.** Heavier; SSE is reserved for token streaming on the
|
||||
sync surface (ADR-0012), not job-state fan-out.
|
||||
- **WebSocket/streamed connection for state.** Heavier; token streaming on the
|
||||
sync surface is NDJSON (ADR-0012), and job-state fan-out doesn't need a
|
||||
persistent connection — discrete webhook POSTs suffice.
|
||||
|
||||
@@ -4,16 +4,22 @@
|
||||
|
||||
## Context
|
||||
|
||||
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast
|
||||
at a time; loading a different model is a 5-10s cold start. Running two models
|
||||
concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism
|
||||
against a single target buys nothing and would reintroduce coordination logic.
|
||||
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one *worker*
|
||||
model fast at a time; loading a different worker model is a 5-10s cold start.
|
||||
Running two large models concurrently on 32GB either OOMs or pages to a 5-10x
|
||||
slowdown. So parallelism among **worker** models against a single target buys
|
||||
nothing and would reintroduce coordination logic.
|
||||
|
||||
The one exception is a small always-resident embedding model, which co-resides
|
||||
cheaply alongside the worker model and is served outside the queue entirely
|
||||
(ADR-0013). This ADR governs only the worker slot.
|
||||
|
||||
## Decision
|
||||
|
||||
**Concurrency against the target is 1.** A single worker loop pulls the next job
|
||||
from the queue, ensures the right model is resident, executes, and records the
|
||||
result.
|
||||
**Worker-model concurrency against the target is 1.** A single worker loop pulls
|
||||
the next job from the queue, ensures the right worker model is resident, executes,
|
||||
and records the result. (Embeddings are not jobs and never enter this loop —
|
||||
ADR-0013.)
|
||||
|
||||
**Drain-by-model scheduling:** before incurring a model swap, the worker finishes
|
||||
every queued job that targets the **currently-resident** model (observed via
|
||||
@@ -25,9 +31,10 @@ heuristic, not a scheduler. There is intentionally **no** priority system,
|
||||
fairness weighting, or capacity budgeting (those sank the predecessor; see
|
||||
ADR-0001).
|
||||
|
||||
Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded
|
||||
between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it
|
||||
to single-resident swap.
|
||||
Residency is pinned with Ollama `keep_alive` so the hot worker model isn't
|
||||
unloaded between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=2` on the target
|
||||
holds two slots: the always-resident embedding model plus the rotating worker
|
||||
model (ADR-0013). Worker models still swap one-at-a-time within their single slot.
|
||||
|
||||
## Consequences
|
||||
|
||||
|
||||
@@ -13,11 +13,13 @@ different granularity than token streaming.
|
||||
## Decision
|
||||
|
||||
- **Sync passthrough: support streaming.** When a `/api/chat` request sets
|
||||
`stream: true`, foreman streams the target's token deltas back to the caller
|
||||
(SSE/chunked, matching Ollama's native streaming). A streamed job still moves
|
||||
through the queue; streaming begins once the job reaches `working`, so a job
|
||||
waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when
|
||||
its turn comes. go-llm's `Stream()` works against foreman unchanged.
|
||||
`stream: true`, foreman streams the target's token deltas back to the caller as
|
||||
**NDJSON** (`application/x-ndjson`, newline-delimited JSON chunks — Ollama's
|
||||
native streaming wire format, which go-llm reads with a `bufio.Scanner`). This
|
||||
is *not* SSE/`text/event-stream`. A streamed job still moves through the queue;
|
||||
streaming begins once the job reaches `working`, so a job waiting behind the
|
||||
drain-by-model queue (ADR-0009) simply starts streaming when its turn comes.
|
||||
go-llm's `Stream()` works against foreman unchanged.
|
||||
- **Async `/jobs` surface: no token streaming in v1.** Webhooks carry coarse state
|
||||
transitions (ADR-0005) and the final result/artifacts, not per-token deltas.
|
||||
Token-level streaming over a fire-and-forget webhook job is deliberately
|
||||
|
||||
@@ -0,0 +1,52 @@
|
||||
# ADR-0013: Two-slot residency and embedding bypass
|
||||
|
||||
**Status:** Accepted — 2026-05-23
|
||||
|
||||
## Context
|
||||
|
||||
The target keeps **two** models resident (`OLLAMA_MAX_LOADED_MODELS=2`): a small,
|
||||
always-resident **embedding model** (e.g. `nomic-embed-text` or
|
||||
`qwen3-embedding`, operator-swappable) and one rotating **worker model** that
|
||||
chat jobs queue against (ADR-0009). The embedder is tiny (~0.3–0.6 GB) and
|
||||
co-resides cheaply with a ~20 GB worker model on 32 GB.
|
||||
|
||||
Embeddings are latency-sensitive and high-volume — a single backfill may fire
|
||||
thousands of `/api/embed` calls. Forcing them through the serialized worker queue
|
||||
(ADR-0009) would make them wait behind 20 GB chat jobs and swap thrash for no
|
||||
reason, since the embedder is always loaded and never needs swapping.
|
||||
|
||||
## Decision
|
||||
|
||||
**The target runs exactly two resident models, and embeddings bypass the queue.**
|
||||
|
||||
- `OLLAMA_MAX_LOADED_MODELS=2`: slot 1 is the always-resident embedder (pinned
|
||||
with `keep_alive: -1`); slot 2 is the rotating worker model managed by the
|
||||
single-worker drain-by-model loop (ADR-0009).
|
||||
- **Routing rule:** only `/api/chat` and `POST /jobs` are serialized through the
|
||||
worker queue. `/api/embed` (and the `/api/embeddings` alias) are proxied
|
||||
**directly and concurrently** to the target, never touching the queue, the
|
||||
worker loop, or the job store. Concurrent embedding requests are allowed; they
|
||||
hit the always-resident embedder and do not contend with worker-model swaps.
|
||||
- The embedding model name is configurable (`FOREMAN_EMBED_MODEL`); foreman warms
|
||||
it on startup and on reconnect after the target was unreachable, so it stays in
|
||||
slot 1.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Embeddings are fast and concurrent regardless of worker-queue depth — the right
|
||||
behavior for indexing/RAG backfills.
|
||||
- The "concurrency" of foreman is precisely this: embedder ∥ worker. Worker jobs
|
||||
among themselves remain strictly serial (ADR-0009). There is no other
|
||||
parallelism, and none should be added.
|
||||
- foreman must distinguish embedding routes from chat routes at the HTTP layer and
|
||||
keep them on separate code paths.
|
||||
- If the operator misconfigures the target to `MAX_LOADED_MODELS=1`, embeddings
|
||||
and worker jobs will fight for the single slot and thrash; foreman should log a
|
||||
startup warning if it observes only one slot via `/api/ps` under load.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Embeddings through the queue.** Simple uniformity, but serializes a
|
||||
high-volume concurrent workload behind chat jobs for no benefit. Rejected.
|
||||
- **A dedicated second daemon for embeddings.** Violates one-daemon-per-target
|
||||
(ADR-0001) and is unnecessary — Ollama already serves both from one endpoint.
|
||||
Reference in New Issue
Block a user