docs: land prior ADR + prompt updates

Commit pre-existing uncommitted working-tree changes that predate the license/public-readiness work — NOT authored in this session, just flushed so they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013 (embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and the phase-0..3 prompt revisions + prompts/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 20:33:39 -04:00
parent 823c0b4ca8
commit 0526bada90
10 changed files with 276 additions and 98 deletions
@@ -40,8 +40,8 @@ only if a non-go-llm caller needs it.
 - "Set up the Mac as a go-llm target" needs zero provider changes — a thin
  constructor only (ADR-0011).
 - Preserves `think:false`, reliable tool calls, and lower latency.
- foreman must faithfully proxy native `/api/chat` semantics, including SSE
-  streaming (ADR-0012).
+- foreman must faithfully proxy native `/api/chat` semantics, including NDJSON
+  streaming (`application/x-ndjson`, not SSE; ADR-0012).

 ## Alternatives considered

@@ -22,7 +22,7 @@ that URL on every state transition.
  "state": "loading",
  "previous_state": "queued",
  "timestamp": "2026-05-23T12:00:00Z",
-  "model": "qwen3.6:35b",
+  "model": "qwen3:30b",
  "attempt": 1,
  "error": null,
  "result": null,
@@ -59,5 +59,6 @@ that URL on every state transition.

 - **Polling only.** Simpler for foreman, worse for callers; rejected since
  webhooks were an explicit requirement. (Polling is still available as fallback.)
- **WebSocket/SSE for state.** Heavier; SSE is reserved for token streaming on the
-  sync surface (ADR-0012), not job-state fan-out.
+- **WebSocket/streamed connection for state.** Heavier; token streaming on the
+  sync surface is NDJSON (ADR-0012), and job-state fan-out doesn't need a
+  persistent connection — discrete webhook POSTs suffice.
@@ -4,16 +4,22 @@

 ## Context

-The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast
-at a time; loading a different model is a 5-10s cold start. Running two models
-concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism
-against a single target buys nothing and would reintroduce coordination logic.
+The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one *worker*
+model fast at a time; loading a different worker model is a 5-10s cold start.
+Running two large models concurrently on 32GB either OOMs or pages to a 5-10x
+slowdown. So parallelism among **worker** models against a single target buys
+nothing and would reintroduce coordination logic.
+
+The one exception is a small always-resident embedding model, which co-resides
+cheaply alongside the worker model and is served outside the queue entirely
+(ADR-0013). This ADR governs only the worker slot.

 ## Decision

-**Concurrency against the target is 1.** A single worker loop pulls the next job
-from the queue, ensures the right model is resident, executes, and records the
-result.
+**Worker-model concurrency against the target is 1.** A single worker loop pulls
+the next job from the queue, ensures the right worker model is resident, executes,
+and records the result. (Embeddings are not jobs and never enter this loop —
+ADR-0013.)

 **Drain-by-model scheduling:** before incurring a model swap, the worker finishes
 every queued job that targets the **currently-resident** model (observed via
@@ -25,9 +31,10 @@ heuristic, not a scheduler. There is intentionally **no** priority system,
 fairness weighting, or capacity budgeting (those sank the predecessor; see
 ADR-0001).

-Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded
-between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it
-to single-resident swap.
+Residency is pinned with Ollama `keep_alive` so the hot worker model isn't
+unloaded between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=2` on the target
+holds two slots: the always-resident embedding model plus the rotating worker
+model (ADR-0013). Worker models still swap one-at-a-time within their single slot.

 ## Consequences

@@ -13,11 +13,13 @@ different granularity than token streaming.
 ## Decision

 - **Sync passthrough: support streaming.** When a `/api/chat` request sets
-  `stream: true`, foreman streams the target's token deltas back to the caller
-  (SSE/chunked, matching Ollama's native streaming). A streamed job still moves
-  through the queue; streaming begins once the job reaches `working`, so a job
-  waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when
-  its turn comes. go-llm's `Stream()` works against foreman unchanged.
+  `stream: true`, foreman streams the target's token deltas back to the caller as
+  **NDJSON** (`application/x-ndjson`, newline-delimited JSON chunks — Ollama's
+  native streaming wire format, which go-llm reads with a `bufio.Scanner`). This
+  is *not* SSE/`text/event-stream`. A streamed job still moves through the queue;
+  streaming begins once the job reaches `working`, so a job waiting behind the
+  drain-by-model queue (ADR-0009) simply starts streaming when its turn comes.
+  go-llm's `Stream()` works against foreman unchanged.
 - **Async `/jobs` surface: no token streaming in v1.** Webhooks carry coarse state
  transitions (ADR-0005) and the final result/artifacts, not per-token deltas.
  Token-level streaming over a fire-and-forget webhook job is deliberately
@@ -0,0 +1,52 @@
+# ADR-0013: Two-slot residency and embedding bypass
+
+**Status:** Accepted — 2026-05-23
+
+## Context
+
+The target keeps **two** models resident (`OLLAMA_MAX_LOADED_MODELS=2`): a small,
+always-resident **embedding model** (e.g. `nomic-embed-text` or
+`qwen3-embedding`, operator-swappable) and one rotating **worker model** that
+chat jobs queue against (ADR-0009). The embedder is tiny (~0.3–0.6 GB) and
+co-resides cheaply with a ~20 GB worker model on 32 GB.
+
+Embeddings are latency-sensitive and high-volume — a single backfill may fire
+thousands of `/api/embed` calls. Forcing them through the serialized worker queue
+(ADR-0009) would make them wait behind 20 GB chat jobs and swap thrash for no
+reason, since the embedder is always loaded and never needs swapping.
+
+## Decision
+
+**The target runs exactly two resident models, and embeddings bypass the queue.**
+
+- `OLLAMA_MAX_LOADED_MODELS=2`: slot 1 is the always-resident embedder (pinned
+  with `keep_alive: -1`); slot 2 is the rotating worker model managed by the
+  single-worker drain-by-model loop (ADR-0009).
+- **Routing rule:** only `/api/chat` and `POST /jobs` are serialized through the
+  worker queue. `/api/embed` (and the `/api/embeddings` alias) are proxied
+  **directly and concurrently** to the target, never touching the queue, the
+  worker loop, or the job store. Concurrent embedding requests are allowed; they
+  hit the always-resident embedder and do not contend with worker-model swaps.
+- The embedding model name is configurable (`FOREMAN_EMBED_MODEL`); foreman warms
+  it on startup and on reconnect after the target was unreachable, so it stays in
+  slot 1.
+
+## Consequences
+
+- Embeddings are fast and concurrent regardless of worker-queue depth — the right
+  behavior for indexing/RAG backfills.
+- The "concurrency" of foreman is precisely this: embedder ∥ worker. Worker jobs
+  among themselves remain strictly serial (ADR-0009). There is no other
+  parallelism, and none should be added.
+- foreman must distinguish embedding routes from chat routes at the HTTP layer and
+  keep them on separate code paths.
+- If the operator misconfigures the target to `MAX_LOADED_MODELS=1`, embeddings
+  and worker jobs will fight for the single slot and thrash; foreman should log a
+  startup warning if it observes only one slot via `/api/ps` under load.
+
+## Alternatives considered
+
+- **Embeddings through the queue.** Simple uniformity, but serializes a
+  high-volume concurrent workload behind chat jobs for no benefit. Rejected.
+- **A dedicated second daemon for embeddings.** Violates one-daemon-per-target
+  (ADR-0001) and is unnecessary — Ollama already serves both from one endpoint.