docs: land prior ADR + prompt updates

Commit pre-existing uncommitted working-tree changes that predate the
license/public-readiness work — NOT authored in this session, just flushed so
they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013
(embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and
the phase-0..3 prompt revisions + prompts/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-26 20:33:39 -04:00
parent 823c0b4ca8
commit 0526bada90
10 changed files with 276 additions and 98 deletions
+2 -2
View File
@@ -40,8 +40,8 @@ only if a non-go-llm caller needs it.
- "Set up the Mac as a go-llm target" needs zero provider changes — a thin - "Set up the Mac as a go-llm target" needs zero provider changes — a thin
constructor only (ADR-0011). constructor only (ADR-0011).
- Preserves `think:false`, reliable tool calls, and lower latency. - Preserves `think:false`, reliable tool calls, and lower latency.
- foreman must faithfully proxy native `/api/chat` semantics, including SSE - foreman must faithfully proxy native `/api/chat` semantics, including NDJSON
streaming (ADR-0012). streaming (`application/x-ndjson`, not SSE; ADR-0012).
## Alternatives considered ## Alternatives considered
+4 -3
View File
@@ -22,7 +22,7 @@ that URL on every state transition.
"state": "loading", "state": "loading",
"previous_state": "queued", "previous_state": "queued",
"timestamp": "2026-05-23T12:00:00Z", "timestamp": "2026-05-23T12:00:00Z",
"model": "qwen3.6:35b", "model": "qwen3:30b",
"attempt": 1, "attempt": 1,
"error": null, "error": null,
"result": null, "result": null,
@@ -59,5 +59,6 @@ that URL on every state transition.
- **Polling only.** Simpler for foreman, worse for callers; rejected since - **Polling only.** Simpler for foreman, worse for callers; rejected since
webhooks were an explicit requirement. (Polling is still available as fallback.) webhooks were an explicit requirement. (Polling is still available as fallback.)
- **WebSocket/SSE for state.** Heavier; SSE is reserved for token streaming on the - **WebSocket/streamed connection for state.** Heavier; token streaming on the
sync surface (ADR-0012), not job-state fan-out. sync surface is NDJSON (ADR-0012), and job-state fan-out doesn't need a
persistent connection — discrete webhook POSTs suffice.
+17 -10
View File
@@ -4,16 +4,22 @@
## Context ## Context
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one *worker*
at a time; loading a different model is a 5-10s cold start. Running two models model fast at a time; loading a different worker model is a 5-10s cold start.
concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism Running two large models concurrently on 32GB either OOMs or pages to a 5-10x
against a single target buys nothing and would reintroduce coordination logic. slowdown. So parallelism among **worker** models against a single target buys
nothing and would reintroduce coordination logic.
The one exception is a small always-resident embedding model, which co-resides
cheaply alongside the worker model and is served outside the queue entirely
(ADR-0013). This ADR governs only the worker slot.
## Decision ## Decision
**Concurrency against the target is 1.** A single worker loop pulls the next job **Worker-model concurrency against the target is 1.** A single worker loop pulls
from the queue, ensures the right model is resident, executes, and records the the next job from the queue, ensures the right worker model is resident, executes,
result. and records the result. (Embeddings are not jobs and never enter this loop —
ADR-0013.)
**Drain-by-model scheduling:** before incurring a model swap, the worker finishes **Drain-by-model scheduling:** before incurring a model swap, the worker finishes
every queued job that targets the **currently-resident** model (observed via every queued job that targets the **currently-resident** model (observed via
@@ -25,9 +31,10 @@ heuristic, not a scheduler. There is intentionally **no** priority system,
fairness weighting, or capacity budgeting (those sank the predecessor; see fairness weighting, or capacity budgeting (those sank the predecessor; see
ADR-0001). ADR-0001).
Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded Residency is pinned with Ollama `keep_alive` so the hot worker model isn't
between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it unloaded between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=2` on the target
to single-resident swap. holds two slots: the always-resident embedding model plus the rotating worker
model (ADR-0013). Worker models still swap one-at-a-time within their single slot.
## Consequences ## Consequences
+7 -5
View File
@@ -13,11 +13,13 @@ different granularity than token streaming.
## Decision ## Decision
- **Sync passthrough: support streaming.** When a `/api/chat` request sets - **Sync passthrough: support streaming.** When a `/api/chat` request sets
`stream: true`, foreman streams the target's token deltas back to the caller `stream: true`, foreman streams the target's token deltas back to the caller as
(SSE/chunked, matching Ollama's native streaming). A streamed job still moves **NDJSON** (`application/x-ndjson`, newline-delimited JSON chunks — Ollama's
through the queue; streaming begins once the job reaches `working`, so a job native streaming wire format, which go-llm reads with a `bufio.Scanner`). This
waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when is *not* SSE/`text/event-stream`. A streamed job still moves through the queue;
its turn comes. go-llm's `Stream()` works against foreman unchanged. streaming begins once the job reaches `working`, so a job waiting behind the
drain-by-model queue (ADR-0009) simply starts streaming when its turn comes.
go-llm's `Stream()` works against foreman unchanged.
- **Async `/jobs` surface: no token streaming in v1.** Webhooks carry coarse state - **Async `/jobs` surface: no token streaming in v1.** Webhooks carry coarse state
transitions (ADR-0005) and the final result/artifacts, not per-token deltas. transitions (ADR-0005) and the final result/artifacts, not per-token deltas.
Token-level streaming over a fire-and-forget webhook job is deliberately Token-level streaming over a fire-and-forget webhook job is deliberately
@@ -0,0 +1,52 @@
# ADR-0013: Two-slot residency and embedding bypass
**Status:** Accepted — 2026-05-23
## Context
The target keeps **two** models resident (`OLLAMA_MAX_LOADED_MODELS=2`): a small,
always-resident **embedding model** (e.g. `nomic-embed-text` or
`qwen3-embedding`, operator-swappable) and one rotating **worker model** that
chat jobs queue against (ADR-0009). The embedder is tiny (~0.30.6 GB) and
co-resides cheaply with a ~20 GB worker model on 32 GB.
Embeddings are latency-sensitive and high-volume — a single backfill may fire
thousands of `/api/embed` calls. Forcing them through the serialized worker queue
(ADR-0009) would make them wait behind 20 GB chat jobs and swap thrash for no
reason, since the embedder is always loaded and never needs swapping.
## Decision
**The target runs exactly two resident models, and embeddings bypass the queue.**
- `OLLAMA_MAX_LOADED_MODELS=2`: slot 1 is the always-resident embedder (pinned
with `keep_alive: -1`); slot 2 is the rotating worker model managed by the
single-worker drain-by-model loop (ADR-0009).
- **Routing rule:** only `/api/chat` and `POST /jobs` are serialized through the
worker queue. `/api/embed` (and the `/api/embeddings` alias) are proxied
**directly and concurrently** to the target, never touching the queue, the
worker loop, or the job store. Concurrent embedding requests are allowed; they
hit the always-resident embedder and do not contend with worker-model swaps.
- The embedding model name is configurable (`FOREMAN_EMBED_MODEL`); foreman warms
it on startup and on reconnect after the target was unreachable, so it stays in
slot 1.
## Consequences
- Embeddings are fast and concurrent regardless of worker-queue depth — the right
behavior for indexing/RAG backfills.
- The "concurrency" of foreman is precisely this: embedder ∥ worker. Worker jobs
among themselves remain strictly serial (ADR-0009). There is no other
parallelism, and none should be added.
- foreman must distinguish embedding routes from chat routes at the HTTP layer and
keep them on separate code paths.
- If the operator misconfigures the target to `MAX_LOADED_MODELS=1`, embeddings
and worker jobs will fight for the single slot and thrash; foreman should log a
startup warning if it observes only one slot via `/api/ps` under load.
## Alternatives considered
- **Embeddings through the queue.** Simple uniformity, but serializes a
high-volume concurrent workload behind chat jobs for no benefit. Rejected.
- **A dedicated second daemon for embeddings.** Violates one-daemon-per-target
(ADR-0001) and is unnecessary — Ollama already serves both from one endpoint.
+61
View File
@@ -0,0 +1,61 @@
# foreman build prompts
This directory drives the autonomous Claude Code build of foreman.
## How to run
Start with **`phase-0-kickoff.md`** — it is the master driver. Paste it (or the
command below) into Claude Code from the repo root. It reads `CLAUDE.md`, the
ADRs in `docs/adr/`, and the `go-llm` / `steveternet` sources, then runs phases
1 → 6 **autonomously** to a finished, deployable foreman. You do **not** paste the
individual phase files — the kickoff reads them as it goes.
Kickoff command:
```
Read and follow prompts/phase-0-kickoff.md. The per-phase specs it references
(prompts/phase-1.md … phase-6.md) are in this same prompts/ directory. This is a
fully autonomous run: execute all six phases in order to a finished, working
deliverable without pausing between them. Honor docs/adr/ (note the new 0013) and
CLAUDE.md as source of truth. For the two cross-repo changes (llm.Foreman() in
steve/go-llm and the docker-compose in steve/steveternet), open a branch and PR
on each for my review — do not commit to their main. When done, report what each
phase built, the PR links, any ADRs you added, and a smoke-test checklist.
```
## What each phase produces
1. `phase-1` — scaffold, config, SQLite store, health, CI, Dockerfile.
2. `phase-2` — Ollama client + model poller + native passthrough + embedding
bypass. The Mac is usable as a `go-llm` target after this.
3. `phase-3` — durable queue + single worker + drain-by-model. **M0 complete.**
4. `phase-4` — async `/jobs` + job IDs + state webhooks + artifacts. The headline
queue-and-webhooks capability.
5. `phase-5` — Go client package (sync facade) + `llm.Foreman()` in `go-llm`.
6. `phase-6` — deploy: steveternet compose + Traefik, `.env.example`, deploy
docs, model-pull script.
## Conventions during the run
- Each phase must pass the gates before continuing: `go build ./...`,
`go vet ./...`, `go test -race -count=1 ./...`, and `go mod tidy` +
`git diff --exit-code go.mod go.sum`.
- Commits go to foreman's `main` with conventional-commit messages; `progress.md`
gets a dated entry each phase.
- A decision not covered by `CLAUDE.md` or an ADR → record a new ADR
(append-only, next number after 0013) and continue.
## Cross-repo caveat (phases 5 & 6)
The `llm.Foreman()` constructor lives in **steve/go-llm** and the deploy compose
in **steve/steveternet**. Those changes go on a **branch + PR for review — never
their main.**
## Known caveat (phase 6)
Phase 6 mirrors the Traefik/compose conventions from sibling services
(`kalimdor/orgrimmar/warhol-queue`, `ratchet`, `mort`) read via the gitea MCP.
Those reads were intermittently erroring during planning. **If phase 6 can't read
them, paste a sibling `docker-compose.yml` (e.g. ratchet's) straight into the
session** so it can mirror the network name, entrypoint, certresolver, and
router/service labels rather than inventing them.
+94 -61
View File
@@ -1,75 +1,108 @@
# phase-0-kickoff.md — foreman build kickoff # phase-0-kickoff.md — foreman autonomous build
You are building **foreman**, a Go daemon that fronts one Ollama target and turns You are building **foreman** end to end, in **one autonomous run**. Execute all
it into a queued, observable, OpenAI/Ollama-compatible job endpoint. This is a six phases in order (1 → 2 → 3 → 4 → 5 → 6) and do not stop between them. The run
deliberately pared-down restart of a system (`peon-overseer`) that died of scope ends when foreman is a working, deployable deliverable. Do not wait for my
creep. Restraint is a feature, not a limitation. approval at phase boundaries — keep going until done or genuinely blocked.
foreman is a Go daemon that fronts one Ollama target and turns it into a queued,
observable, Ollama-compatible job endpoint. It is a deliberately pared-down
restart of a system (`peon-overseer`) that died of scope creep. Restraint is a
feature: if a task seems to need distributed dispatch, leases, fair queueing,
capacity budgets, an auth framework/SSO, a GUI, or multi-target support — stop,
because that means the design is being violated.
## Read these first (authoritative, in order) ## Read these first (authoritative, in order)
1. `CLAUDE.md` in this repo — the operating manual. It is the source of truth for 1. `CLAUDE.md` — the operating manual and source of truth.
architecture, stack, conventions, and the **out-of-scope guardrails**. 2. `docs/adr/README.md`, then every `docs/adr/00NN-*.md` (00010013). The ADRs are
2. `docs/adr/README.md` then every `docs/adr/00NN-*.md`. The ADRs are the *why*. the *why*. Do not relitigate them.
Do not relitigate them; if you believe one is wrong, say so and propose a new 3. Via the **gitea MCP**, `steve/go-llm`: `v2/provider/provider.go` (the
superseding ADR rather than silently diverging. `Provider` interface), `v2/ollama/ollama.go` + `v2/ollama/native.go` +
3. Via the **gitea MCP**, read the integration target — `steve/go-llm`: `v2/constructors.go` (native `/api/chat` + Bearer + base URL), `v2/CLAUDE.md`
`v2/provider/provider.go` (the `Provider` interface you must stay compatible
with), `v2/ollama/ollama.go` and `v2/constructors.go` (how `Ollama` /
`OllamaCloud` construct over native `/api/chat` + Bearer), and `v2/CLAUDE.md`
(DD#8: native API, not OpenAI-compat). (DD#8: native API, not OpenAI-compat).
4. Via the gitea MCP, study deployment conventions in `steve/steveternet`: 4. Via the gitea MCP, `steve/steveternet`: `kalimdor/orgrimmar/warhol-queue/`,
`kalimdor/orgrimmar/warhol-queue/`, `kalimdor/orgrimmar/ratchet/`, and `kalimdor/orgrimmar/ratchet/`, `kalimdor/orgrimmar/mort/`, and
`kalimdor/orgrimmar/mort/` for `docker-compose.yml` + `.env.example` patterns, `kalimdor/orgrimmar/traefik/` (incl. `custom/`) for compose/Traefik/network
and `kalimdor/orgrimmar/traefik/` (incl. `custom/`) for the Traefik network conventions. foreman lives at `kalimdor/orgrimmar/foreman/`. Mirror these
name, entrypoint, certresolver, and router/label conventions. foreman will exactly; do not invent label syntax.
live at `kalimdor/orgrimmar/foreman/`. **Mirror these exactly; do not invent
label syntax.**
## Working agreement (opusplan) ## The phases
- **Plan before code.** For each phase, produce a plan and wait for my approval Each `prompts/phase-N.md` is the detailed spec for that phase. For each phase, in
before implementing. Do not run ahead to later phases. order: read `phase-N.md`, plan it internally, implement it, make the gates pass,
- **One phase at a time**, in order. Each phase is its own prompt I will paste. record progress, commit, then immediately continue to the next phase.
- After every phase: `go build ./...`, `go vet ./...`, `go test -race -count=1 ./...`
must all pass. Append a dated entry to `progress.md`. Commit on a phase branch
with conventional-commit messages (`feat:`, `chore:`, `test:`, `docs:`).
- **Ask before assuming.** If a detail is ambiguous and not settled by CLAUDE.md
or an ADR, ask me — don't guess.
- **Propose an ADR** (append-only, next number) for any architectural decision
not already covered. Keep `docs/adr/README.md`'s index current.
- Keep dependencies minimal; match `go-llm` house style (tabs; wrap errors with
`fmt.Errorf("%w: ...", err)`; imports stdlib → third-party → internal). SQLite
via `modernc.org/sqlite` (pure-Go, `CGO_ENABLED=0`). No UI.
- **Refuse scope creep.** No distributed dispatch, leases, fair queueing,
capacity budgets, auth framework/SSO, GUI, or multi-target support. If a task
seems to need them, stop and flag it — that means the design is being violated.
## Definition of done (whole project) **Override:** the phase files open with "Plan, get approval, implement" — that was
written for a paste-one-at-a-time workflow. In *this* autonomous run, treat it as
A deployable daemon that: "plan internally and proceed." Do not pause for approval at any phase boundary.
- fronts one configurable Ollama target and transparently proxies native
`/api/chat`, `/api/tags`, `/api/ps` (so `go-llm` uses the Mac as a target with
no provider changes), including streaming;
- runs a durable SQLite-backed queue with a single worker and drain-by-model
scheduling, surviving restarts and target sleep;
- exposes an async `POST /jobs` surface returning a job ID, with
`queued→loading→working→done/failed` state webhooks and artifact delivery;
- ships a Go client package (synchronous facade over the async surface);
- passes CI on Gitea, builds as a container, and deploys via a steveternet
`docker-compose.yml` behind Traefik.
## Phase map
1. Scaffold, config, SQLite store, health, CI, Dockerfile. 1. Scaffold, config, SQLite store, health, CI, Dockerfile.
2. Ollama target client + model poller + native passthrough (the go-llm target). 2. Ollama target client + model poller + native passthrough + embedding bypass.
3. Durable queue + single worker + drain-by-model. 3. Durable queue + single worker + drain-by-model (replaces phase-2's chat gate).
4. Async `/jobs` + job IDs + state webhooks + artifacts. 4. Async `/jobs` + job IDs + state webhooks + artifacts.
5. Go client package (sync facade) + `llm.Foreman()` in go-llm. 5. Go client package (sync facade) + `llm.Foreman()` in go-llm.
6. Deploy: steveternet compose + Traefik, `.env.example`, deploy docs, model-pull script. 6. Deploy: steveternet compose + Traefik, `.env.example`, deploy docs, model script.
## Your task right now ## Per-phase loop (do this every phase, automatically)
Confirm you've read the sources above, briefly restate the architecture in your - Implement to the phase spec and the ADRs.
own words (so I can check your understanding), flag anything in the ADRs you'd - Run the gates; **all** must pass before moving on:
push back on, then produce a **detailed plan for Phase 1 only**. Do not write code `go build ./...`, `go vet ./...`, `go test -race -count=1 ./...`, and
yet. Stop for my approval. `go mod tidy` followed by `git diff --exit-code go.mod go.sum`.
- Append a dated entry to `progress.md` (what landed, what's next).
- Commit to the **foreman** repo with conventional-commit messages
(`feat:`, `test:`, `chore:`, `docs:`). Committing to foreman's main is fine.
- Continue to the next phase without pausing.
## Invariants to honor throughout (from the ADRs)
- **Two-slot runtime (ADR-0013):** the target runs `OLLAMA_MAX_LOADED_MODELS=2`
an always-resident embedder (`FOREMAN_EMBED_MODEL`) plus one rotating worker
model. `/api/embed` (+ `/api/embeddings`) bypass the queue and run
concurrently; only `/api/chat` and `POST /jobs` are serialized through the
single worker. Worker-model concurrency is exactly 1 (ADR-0009).
- **NDJSON, not SSE (ADR-0012):** stream `/api/chat` as `application/x-ndjson`.
- **Env namespacing:** every config key is `FOREMAN_*` (incl.
`FOREMAN_OLLAMA_URL`, `FOREMAN_OLLAMA_TOKEN`). No bare `OLLAMA_*`.
- **Go 1.26** in `go.mod`, Dockerfile, and CI.
- Unreachable target = transient/recoverable, never fatal (ADR-0002).
## Cross-repo changes (phases 5 and 6)
The `llm.Foreman()` constructor (go-llm) and the steveternet `docker-compose.yml`
touch repos other than foreman. For those, **open a branch and a PR for my
review — do NOT commit to their main.** Report the branch names and PR links in
the final summary.
## When to stop vs. keep going
- Keep going through routine ambiguity. If you hit a decision not covered by
`CLAUDE.md` or an ADR, make the smallest reasonable choice, **record it as a new
ADR** (append-only, next number after 0013, update the index), and continue.
- **Only stop** for a true blocker: a gate you cannot make green after honest
effort, a repo/tool you cannot reach, or a required choice that would
contradict an accepted ADR or a scope guardrail. If you stop, say exactly why
and what you need.
## Definition of done (whole run)
- foreman fronts one configurable Ollama target; transparently proxies native
`/api/chat`, `/api/tags`, `/api/ps` (NDJSON streaming) so go-llm uses it as a
target with no provider changes; `/api/embed` bypasses the queue concurrently.
- Durable SQLite queue, single worker, drain-by-model; survives restart and
target sleep.
- `POST /jobs` returns a ULID job id; `queued→loading→working→done|failed` state
webhooks (at-least-once, optional HMAC); artifacts inline/fetch.
- A Go client package (sync facade over `/jobs`); `llm.Foreman()` branch/PR on
go-llm.
- CI green; container builds; steveternet compose + Traefik branch/PR.
## Start now
Read the sources, then begin Phase 1 and run straight through to a finished
deliverable. When done, give me: a summary of what was built per phase, the
go-llm and steveternet PR links, any ADRs you added, and a copy-pasteable
end-to-end smoke-test checklist (pull models on the Mac → set
`OLLAMA_MAX_LOADED_MODELS=2` → deploy foreman → go-llm chat → concurrent
`/api/embed``POST /jobs` with a webhook).
+7 -3
View File
@@ -16,9 +16,13 @@ health endpoint — no Ollama logic yet.
`internal/store`, `internal/server`. Don't create empty packages for later `internal/store`, `internal/server`. Don't create empty packages for later
phases. phases.
- `internal/config`: load from env into a struct — `FOREMAN_ADDR` (listen addr, - `internal/config`: load from env into a struct — `FOREMAN_ADDR` (listen addr,
default `:8080`), `FOREMAN_OLLAMA_URL` (target, required), `FOREMAN_TOKEN` default `:8080`), `FOREMAN_OLLAMA_URL` (target, required), `FOREMAN_OLLAMA_TOKEN`
(optional inbound bearer), `FOREMAN_DB_PATH`, `FOREMAN_POLL_INTERVAL`. Provide (optional outbound bearer to the target, for Ollama-Cloud-style auth),
a `.env.example` documenting every key. `FOREMAN_TOKEN` (optional inbound bearer foreman requires of its callers),
`FOREMAN_EMBED_MODEL` (the always-resident embedder, e.g. `nomic-embed-text`),
`FOREMAN_DB_PATH`, `FOREMAN_POLL_INTERVAL`. Namespace **every** key under
`FOREMAN_` (do not use bare `OLLAMA_*`, which collide with real Ollama client
vars). Provide a `.env.example` documenting every key.
- `internal/store`: SQLite via `modernc.org/sqlite`, WAL mode, with an embedded - `internal/store`: SQLite via `modernc.org/sqlite`, WAL mode, with an embedded
migration for the `jobs` and `artifacts` tables (schema sketch in ADR-0008 / migration for the `jobs` and `artifacts` tables (schema sketch in ADR-0008 /
ADR-0006). Include open/close, migrate-on-start, and basic CRUD with tests ADR-0006). Include open/close, migrate-on-start, and basic CRUD with tests
+24 -9
View File
@@ -1,33 +1,48 @@
# phase-2.md — Ollama target client, model poller, native passthrough # phase-2.md — Ollama target client, model poller, native passthrough
Re-ground: `CLAUDE.md` + ADR-0003 (API surface), 0007 (model polling), 0012 Re-ground: `CLAUDE.md` + ADR-0003 (API surface), 0007 (model polling), 0012
(streaming), 0002 (unreachable = transient). Plan, get approval, implement. (streaming = NDJSON, not SSE), 0013 (two-slot residency + embedding bypass),
0002 (unreachable = transient). Plan, get approval, implement.
## Objective ## Objective
Make foreman a working transparent front for its Ollama target — enough that Make foreman a working transparent front for its Ollama target — enough that
`go-llm` can use the Mac as a target *today*, before any queue exists. (Phase 3 `go-llm` can use the Mac as a target *today*, before any queue exists. (Phase 3
will move this through the queue; here it can proxy directly.) will move chat through the queue; here it proxies behind a single-flight gate.)
## Tasks ## Tasks
- `internal/ollama`: a small client to the target (`FOREMAN_OLLAMA_URL`) behind - `internal/ollama`: a small client to the target (`FOREMAN_OLLAMA_URL`) behind
an interface, covering `POST /api/chat` (streaming and non-streaming), an interface, covering `POST /api/chat` (streaming and non-streaming),
`GET /api/tags`, `GET /api/ps`. Attach the outbound bearer if configured. Wrap `POST /api/embed` (+ `/api/embeddings` alias), `GET /api/tags`, `GET /api/ps`.
errors; classify connection failures distinctly (Phase 3 needs that signal). Attach the outbound bearer (`FOREMAN_OLLAMA_TOKEN`) if configured. Wrap errors;
classify connection failures distinctly (Phase 3 needs that signal).
- Warm the embedder: on startup and after any reconnect-from-unreachable, issue a
trivial `/api/embed` to `FOREMAN_EMBED_MODEL` so it occupies a resident slot
(ADR-0013). The target must run `OLLAMA_MAX_LOADED_MODELS=2`; log a warning if
`/api/ps` ever shows only one slot under load.
- Model poller (goroutine): poll `/api/tags` every `FOREMAN_POLL_INTERVAL` - Model poller (goroutine): poll `/api/tags` every `FOREMAN_POLL_INTERVAL`
(default 30s) into an in-memory inventory with a mutex; track last-poll time (default 30s) into an in-memory inventory with a mutex; track last-poll time
and a degraded flag. On target unreachable, retain last-known inventory and set and a degraded flag. On target unreachable, retain last-known inventory and set
degraded — do not clear it. Wire degraded state into `/healthz`. degraded — do not clear it. Wire degraded state into `/healthz`.
- Passthrough handlers in `internal/server`: - Passthrough handlers in `internal/server`:
- `GET /api/tags` and `GET /api/ps` served from the poller/target. - `GET /api/tags` and `GET /api/ps` served from the poller/target.
- `POST /api/embed` and `POST /api/embeddings`: proxy **directly and
concurrently** to the target — these BYPASS the queue/worker gate entirely
(ADR-0013). No serialization.
- `POST /api/chat`: validate the requested model against the inventory (one - `POST /api/chat`: validate the requested model against the inventory (one
re-poll on miss, then 4xx if still absent); proxy to the target. Support re-poll on miss, then 4xx if still absent); proxy to the target. **Serialize
streaming faithfully (stream the target's chunks straight through; set the worker-model access through a single in-flight gate (a buffered channel /
right content type). For now this may call the target directly — no queue. mutex of 1)** so two concurrent chat requests never hit the worker slot at
once — this preserves the serial invariant *before* the full queue exists.
Phase 3 replaces this gate with the SQLite queue + worker loop. Stream
faithfully as **NDJSON** (`Content-Type: application/x-ndjson`, chunks passed
straight through — Ollama's native format, not SSE).
- Tests: a stub HTTP server standing in for Ollama; assert tags/ps proxy, - Tests: a stub HTTP server standing in for Ollama; assert tags/ps proxy,
model validation rejects unknown models, streaming passes chunks through, and model validation rejects unknown models, NDJSON streaming passes chunks
the poller flips degraded on target failure and recovers. through, **concurrent `/api/embed` calls run in parallel while `/api/chat` is
serialized** (assert no two chats overlap at the stub), and the poller flips
degraded on target failure and recovers (and re-warms the embedder).
## Definition of done ## Definition of done
+8 -5
View File
@@ -1,13 +1,16 @@
# phase-3.md — Durable queue, single worker, drain-by-model # phase-3.md — Durable queue, single worker, drain-by-model
Re-ground: `CLAUDE.md` + ADR-0009 (single worker / drain-by-model), 0008 (queue), Re-ground: `CLAUDE.md` + ADR-0009 (single worker / drain-by-model), 0013
0004 (lifecycle/retry). Plan, get approval, implement. (embeddings bypass — they must NOT be touched here), 0008 (queue), 0004
(lifecycle/retry). Plan, get approval, implement.
## Objective ## Objective
Route execution through the SQLite queue with exactly one worker and Replace Phase 2's interim single-flight chat gate with the real SQLite queue and
drain-by-model scheduling. The synchronous passthrough from Phase 2 now enqueues one worker, with drain-by-model scheduling. The synchronous passthrough now
and blocks on completion instead of calling the target directly. enqueues and blocks on completion instead of holding a direct gate.
`/api/embed` stays exactly as Phase 2 built it — direct, concurrent, never
queued (ADR-0013). Do not route embeddings through any of this.
## Tasks ## Tasks