docs: land prior ADR + prompt updates

Commit pre-existing uncommitted working-tree changes that predate the
license/public-readiness work — NOT authored in this session, just flushed so
they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013
(embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and
the phase-0..3 prompt revisions + prompts/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-26 20:33:39 -04:00
parent 823c0b4ca8
commit 0526bada90
10 changed files with 276 additions and 98 deletions
+2 -2
View File
@@ -40,8 +40,8 @@ only if a non-go-llm caller needs it.
- "Set up the Mac as a go-llm target" needs zero provider changes — a thin
constructor only (ADR-0011).
- Preserves `think:false`, reliable tool calls, and lower latency.
- foreman must faithfully proxy native `/api/chat` semantics, including SSE
streaming (ADR-0012).
- foreman must faithfully proxy native `/api/chat` semantics, including NDJSON
streaming (`application/x-ndjson`, not SSE; ADR-0012).
## Alternatives considered
+4 -3
View File
@@ -22,7 +22,7 @@ that URL on every state transition.
"state": "loading",
"previous_state": "queued",
"timestamp": "2026-05-23T12:00:00Z",
"model": "qwen3.6:35b",
"model": "qwen3:30b",
"attempt": 1,
"error": null,
"result": null,
@@ -59,5 +59,6 @@ that URL on every state transition.
- **Polling only.** Simpler for foreman, worse for callers; rejected since
webhooks were an explicit requirement. (Polling is still available as fallback.)
- **WebSocket/SSE for state.** Heavier; SSE is reserved for token streaming on the
sync surface (ADR-0012), not job-state fan-out.
- **WebSocket/streamed connection for state.** Heavier; token streaming on the
sync surface is NDJSON (ADR-0012), and job-state fan-out doesn't need a
persistent connection — discrete webhook POSTs suffice.
+17 -10
View File
@@ -4,16 +4,22 @@
## Context
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast
at a time; loading a different model is a 5-10s cold start. Running two models
concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism
against a single target buys nothing and would reintroduce coordination logic.
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one *worker*
model fast at a time; loading a different worker model is a 5-10s cold start.
Running two large models concurrently on 32GB either OOMs or pages to a 5-10x
slowdown. So parallelism among **worker** models against a single target buys
nothing and would reintroduce coordination logic.
The one exception is a small always-resident embedding model, which co-resides
cheaply alongside the worker model and is served outside the queue entirely
(ADR-0013). This ADR governs only the worker slot.
## Decision
**Concurrency against the target is 1.** A single worker loop pulls the next job
from the queue, ensures the right model is resident, executes, and records the
result.
**Worker-model concurrency against the target is 1.** A single worker loop pulls
the next job from the queue, ensures the right worker model is resident, executes,
and records the result. (Embeddings are not jobs and never enter this loop —
ADR-0013.)
**Drain-by-model scheduling:** before incurring a model swap, the worker finishes
every queued job that targets the **currently-resident** model (observed via
@@ -25,9 +31,10 @@ heuristic, not a scheduler. There is intentionally **no** priority system,
fairness weighting, or capacity budgeting (those sank the predecessor; see
ADR-0001).
Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded
between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it
to single-resident swap.
Residency is pinned with Ollama `keep_alive` so the hot worker model isn't
unloaded between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=2` on the target
holds two slots: the always-resident embedding model plus the rotating worker
model (ADR-0013). Worker models still swap one-at-a-time within their single slot.
## Consequences
+7 -5
View File
@@ -13,11 +13,13 @@ different granularity than token streaming.
## Decision
- **Sync passthrough: support streaming.** When a `/api/chat` request sets
`stream: true`, foreman streams the target's token deltas back to the caller
(SSE/chunked, matching Ollama's native streaming). A streamed job still moves
through the queue; streaming begins once the job reaches `working`, so a job
waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when
its turn comes. go-llm's `Stream()` works against foreman unchanged.
`stream: true`, foreman streams the target's token deltas back to the caller as
**NDJSON** (`application/x-ndjson`, newline-delimited JSON chunks — Ollama's
native streaming wire format, which go-llm reads with a `bufio.Scanner`). This
is *not* SSE/`text/event-stream`. A streamed job still moves through the queue;
streaming begins once the job reaches `working`, so a job waiting behind the
drain-by-model queue (ADR-0009) simply starts streaming when its turn comes.
go-llm's `Stream()` works against foreman unchanged.
- **Async `/jobs` surface: no token streaming in v1.** Webhooks carry coarse state
transitions (ADR-0005) and the final result/artifacts, not per-token deltas.
Token-level streaming over a fire-and-forget webhook job is deliberately
@@ -0,0 +1,52 @@
# ADR-0013: Two-slot residency and embedding bypass
**Status:** Accepted — 2026-05-23
## Context
The target keeps **two** models resident (`OLLAMA_MAX_LOADED_MODELS=2`): a small,
always-resident **embedding model** (e.g. `nomic-embed-text` or
`qwen3-embedding`, operator-swappable) and one rotating **worker model** that
chat jobs queue against (ADR-0009). The embedder is tiny (~0.30.6 GB) and
co-resides cheaply with a ~20 GB worker model on 32 GB.
Embeddings are latency-sensitive and high-volume — a single backfill may fire
thousands of `/api/embed` calls. Forcing them through the serialized worker queue
(ADR-0009) would make them wait behind 20 GB chat jobs and swap thrash for no
reason, since the embedder is always loaded and never needs swapping.
## Decision
**The target runs exactly two resident models, and embeddings bypass the queue.**
- `OLLAMA_MAX_LOADED_MODELS=2`: slot 1 is the always-resident embedder (pinned
with `keep_alive: -1`); slot 2 is the rotating worker model managed by the
single-worker drain-by-model loop (ADR-0009).
- **Routing rule:** only `/api/chat` and `POST /jobs` are serialized through the
worker queue. `/api/embed` (and the `/api/embeddings` alias) are proxied
**directly and concurrently** to the target, never touching the queue, the
worker loop, or the job store. Concurrent embedding requests are allowed; they
hit the always-resident embedder and do not contend with worker-model swaps.
- The embedding model name is configurable (`FOREMAN_EMBED_MODEL`); foreman warms
it on startup and on reconnect after the target was unreachable, so it stays in
slot 1.
## Consequences
- Embeddings are fast and concurrent regardless of worker-queue depth — the right
behavior for indexing/RAG backfills.
- The "concurrency" of foreman is precisely this: embedder ∥ worker. Worker jobs
among themselves remain strictly serial (ADR-0009). There is no other
parallelism, and none should be added.
- foreman must distinguish embedding routes from chat routes at the HTTP layer and
keep them on separate code paths.
- If the operator misconfigures the target to `MAX_LOADED_MODELS=1`, embeddings
and worker jobs will fight for the single slot and thrash; foreman should log a
startup warning if it observes only one slot via `/api/ps` under load.
## Alternatives considered
- **Embeddings through the queue.** Simple uniformity, but serializes a
high-volume concurrent workload behind chat jobs for no benefit. Rejected.
- **A dedicated second daemon for embeddings.** Violates one-daemon-per-target
(ADR-0001) and is unnecessary — Ollama already serves both from one endpoint.