From 0526bada90ce76b6217c03f9e29147f8668e15c2 Mon Sep 17 00:00:00 2001 From: Steve Dudenhoeffer Date: Fri, 26 Jun 2026 20:33:39 -0400 Subject: [PATCH] docs: land prior ADR + prompt updates MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit pre-existing uncommitted working-tree changes that predate the license/public-readiness work — NOT authored in this session, just flushed so they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013 (embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and the phase-0..3 prompt revisions + prompts/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/adr/0003-api-surface.md | 4 +- docs/adr/0005-webhook-protocol.md | 7 +- docs/adr/0009-single-worker-drain-by-model.md | 27 +-- docs/adr/0012-streaming.md | 12 +- .../0013-embeddings-bypass-and-two-slot.md | 52 ++++++ prompts/README.md | 61 +++++++ prompts/phase-0-kickoff.md | 155 +++++++++++------- prompts/phase-1.md | 10 +- prompts/phase-2.md | 33 +++- prompts/phase-3.md | 13 +- 10 files changed, 276 insertions(+), 98 deletions(-) create mode 100644 docs/adr/0013-embeddings-bypass-and-two-slot.md create mode 100644 prompts/README.md diff --git a/docs/adr/0003-api-surface.md b/docs/adr/0003-api-surface.md index 25ff622..30e0852 100644 --- a/docs/adr/0003-api-surface.md +++ b/docs/adr/0003-api-surface.md @@ -40,8 +40,8 @@ only if a non-go-llm caller needs it. - "Set up the Mac as a go-llm target" needs zero provider changes — a thin constructor only (ADR-0011). - Preserves `think:false`, reliable tool calls, and lower latency. -- foreman must faithfully proxy native `/api/chat` semantics, including SSE - streaming (ADR-0012). +- foreman must faithfully proxy native `/api/chat` semantics, including NDJSON + streaming (`application/x-ndjson`, not SSE; ADR-0012). ## Alternatives considered diff --git a/docs/adr/0005-webhook-protocol.md b/docs/adr/0005-webhook-protocol.md index 71d7613..980b451 100644 --- a/docs/adr/0005-webhook-protocol.md +++ b/docs/adr/0005-webhook-protocol.md @@ -22,7 +22,7 @@ that URL on every state transition. "state": "loading", "previous_state": "queued", "timestamp": "2026-05-23T12:00:00Z", - "model": "qwen3.6:35b", + "model": "qwen3:30b", "attempt": 1, "error": null, "result": null, @@ -59,5 +59,6 @@ that URL on every state transition. - **Polling only.** Simpler for foreman, worse for callers; rejected since webhooks were an explicit requirement. (Polling is still available as fallback.) -- **WebSocket/SSE for state.** Heavier; SSE is reserved for token streaming on the - sync surface (ADR-0012), not job-state fan-out. +- **WebSocket/streamed connection for state.** Heavier; token streaming on the + sync surface is NDJSON (ADR-0012), and job-state fan-out doesn't need a + persistent connection — discrete webhook POSTs suffice. diff --git a/docs/adr/0009-single-worker-drain-by-model.md b/docs/adr/0009-single-worker-drain-by-model.md index 08ccc8f..2e2d39c 100644 --- a/docs/adr/0009-single-worker-drain-by-model.md +++ b/docs/adr/0009-single-worker-drain-by-model.md @@ -4,16 +4,22 @@ ## Context -The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast -at a time; loading a different model is a 5-10s cold start. Running two models -concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism -against a single target buys nothing and would reintroduce coordination logic. +The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one *worker* +model fast at a time; loading a different worker model is a 5-10s cold start. +Running two large models concurrently on 32GB either OOMs or pages to a 5-10x +slowdown. So parallelism among **worker** models against a single target buys +nothing and would reintroduce coordination logic. + +The one exception is a small always-resident embedding model, which co-resides +cheaply alongside the worker model and is served outside the queue entirely +(ADR-0013). This ADR governs only the worker slot. ## Decision -**Concurrency against the target is 1.** A single worker loop pulls the next job -from the queue, ensures the right model is resident, executes, and records the -result. +**Worker-model concurrency against the target is 1.** A single worker loop pulls +the next job from the queue, ensures the right worker model is resident, executes, +and records the result. (Embeddings are not jobs and never enter this loop — +ADR-0013.) **Drain-by-model scheduling:** before incurring a model swap, the worker finishes every queued job that targets the **currently-resident** model (observed via @@ -25,9 +31,10 @@ heuristic, not a scheduler. There is intentionally **no** priority system, fairness weighting, or capacity budgeting (those sank the predecessor; see ADR-0001). -Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded -between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it -to single-resident swap. +Residency is pinned with Ollama `keep_alive` so the hot worker model isn't +unloaded between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=2` on the target +holds two slots: the always-resident embedding model plus the rotating worker +model (ADR-0013). Worker models still swap one-at-a-time within their single slot. ## Consequences diff --git a/docs/adr/0012-streaming.md b/docs/adr/0012-streaming.md index 0a46ac1..2dd8be5 100644 --- a/docs/adr/0012-streaming.md +++ b/docs/adr/0012-streaming.md @@ -13,11 +13,13 @@ different granularity than token streaming. ## Decision - **Sync passthrough: support streaming.** When a `/api/chat` request sets - `stream: true`, foreman streams the target's token deltas back to the caller - (SSE/chunked, matching Ollama's native streaming). A streamed job still moves - through the queue; streaming begins once the job reaches `working`, so a job - waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when - its turn comes. go-llm's `Stream()` works against foreman unchanged. + `stream: true`, foreman streams the target's token deltas back to the caller as + **NDJSON** (`application/x-ndjson`, newline-delimited JSON chunks — Ollama's + native streaming wire format, which go-llm reads with a `bufio.Scanner`). This + is *not* SSE/`text/event-stream`. A streamed job still moves through the queue; + streaming begins once the job reaches `working`, so a job waiting behind the + drain-by-model queue (ADR-0009) simply starts streaming when its turn comes. + go-llm's `Stream()` works against foreman unchanged. - **Async `/jobs` surface: no token streaming in v1.** Webhooks carry coarse state transitions (ADR-0005) and the final result/artifacts, not per-token deltas. Token-level streaming over a fire-and-forget webhook job is deliberately diff --git a/docs/adr/0013-embeddings-bypass-and-two-slot.md b/docs/adr/0013-embeddings-bypass-and-two-slot.md new file mode 100644 index 0000000..add3ca2 --- /dev/null +++ b/docs/adr/0013-embeddings-bypass-and-two-slot.md @@ -0,0 +1,52 @@ +# ADR-0013: Two-slot residency and embedding bypass + +**Status:** Accepted — 2026-05-23 + +## Context + +The target keeps **two** models resident (`OLLAMA_MAX_LOADED_MODELS=2`): a small, +always-resident **embedding model** (e.g. `nomic-embed-text` or +`qwen3-embedding`, operator-swappable) and one rotating **worker model** that +chat jobs queue against (ADR-0009). The embedder is tiny (~0.3–0.6 GB) and +co-resides cheaply with a ~20 GB worker model on 32 GB. + +Embeddings are latency-sensitive and high-volume — a single backfill may fire +thousands of `/api/embed` calls. Forcing them through the serialized worker queue +(ADR-0009) would make them wait behind 20 GB chat jobs and swap thrash for no +reason, since the embedder is always loaded and never needs swapping. + +## Decision + +**The target runs exactly two resident models, and embeddings bypass the queue.** + +- `OLLAMA_MAX_LOADED_MODELS=2`: slot 1 is the always-resident embedder (pinned + with `keep_alive: -1`); slot 2 is the rotating worker model managed by the + single-worker drain-by-model loop (ADR-0009). +- **Routing rule:** only `/api/chat` and `POST /jobs` are serialized through the + worker queue. `/api/embed` (and the `/api/embeddings` alias) are proxied + **directly and concurrently** to the target, never touching the queue, the + worker loop, or the job store. Concurrent embedding requests are allowed; they + hit the always-resident embedder and do not contend with worker-model swaps. +- The embedding model name is configurable (`FOREMAN_EMBED_MODEL`); foreman warms + it on startup and on reconnect after the target was unreachable, so it stays in + slot 1. + +## Consequences + +- Embeddings are fast and concurrent regardless of worker-queue depth — the right + behavior for indexing/RAG backfills. +- The "concurrency" of foreman is precisely this: embedder ∥ worker. Worker jobs + among themselves remain strictly serial (ADR-0009). There is no other + parallelism, and none should be added. +- foreman must distinguish embedding routes from chat routes at the HTTP layer and + keep them on separate code paths. +- If the operator misconfigures the target to `MAX_LOADED_MODELS=1`, embeddings + and worker jobs will fight for the single slot and thrash; foreman should log a + startup warning if it observes only one slot via `/api/ps` under load. + +## Alternatives considered + +- **Embeddings through the queue.** Simple uniformity, but serializes a + high-volume concurrent workload behind chat jobs for no benefit. Rejected. +- **A dedicated second daemon for embeddings.** Violates one-daemon-per-target + (ADR-0001) and is unnecessary — Ollama already serves both from one endpoint. diff --git a/prompts/README.md b/prompts/README.md new file mode 100644 index 0000000..7fe4405 --- /dev/null +++ b/prompts/README.md @@ -0,0 +1,61 @@ +# foreman build prompts + +This directory drives the autonomous Claude Code build of foreman. + +## How to run + +Start with **`phase-0-kickoff.md`** — it is the master driver. Paste it (or the +command below) into Claude Code from the repo root. It reads `CLAUDE.md`, the +ADRs in `docs/adr/`, and the `go-llm` / `steveternet` sources, then runs phases +1 → 6 **autonomously** to a finished, deployable foreman. You do **not** paste the +individual phase files — the kickoff reads them as it goes. + +Kickoff command: + +``` +Read and follow prompts/phase-0-kickoff.md. The per-phase specs it references +(prompts/phase-1.md … phase-6.md) are in this same prompts/ directory. This is a +fully autonomous run: execute all six phases in order to a finished, working +deliverable without pausing between them. Honor docs/adr/ (note the new 0013) and +CLAUDE.md as source of truth. For the two cross-repo changes (llm.Foreman() in +steve/go-llm and the docker-compose in steve/steveternet), open a branch and PR +on each for my review — do not commit to their main. When done, report what each +phase built, the PR links, any ADRs you added, and a smoke-test checklist. +``` + +## What each phase produces + +1. `phase-1` — scaffold, config, SQLite store, health, CI, Dockerfile. +2. `phase-2` — Ollama client + model poller + native passthrough + embedding + bypass. The Mac is usable as a `go-llm` target after this. +3. `phase-3` — durable queue + single worker + drain-by-model. **M0 complete.** +4. `phase-4` — async `/jobs` + job IDs + state webhooks + artifacts. The headline + queue-and-webhooks capability. +5. `phase-5` — Go client package (sync facade) + `llm.Foreman()` in `go-llm`. +6. `phase-6` — deploy: steveternet compose + Traefik, `.env.example`, deploy + docs, model-pull script. + +## Conventions during the run + +- Each phase must pass the gates before continuing: `go build ./...`, + `go vet ./...`, `go test -race -count=1 ./...`, and `go mod tidy` + + `git diff --exit-code go.mod go.sum`. +- Commits go to foreman's `main` with conventional-commit messages; `progress.md` + gets a dated entry each phase. +- A decision not covered by `CLAUDE.md` or an ADR → record a new ADR + (append-only, next number after 0013) and continue. + +## Cross-repo caveat (phases 5 & 6) + +The `llm.Foreman()` constructor lives in **steve/go-llm** and the deploy compose +in **steve/steveternet**. Those changes go on a **branch + PR for review — never +their main.** + +## Known caveat (phase 6) + +Phase 6 mirrors the Traefik/compose conventions from sibling services +(`kalimdor/orgrimmar/warhol-queue`, `ratchet`, `mort`) read via the gitea MCP. +Those reads were intermittently erroring during planning. **If phase 6 can't read +them, paste a sibling `docker-compose.yml` (e.g. ratchet's) straight into the +session** so it can mirror the network name, entrypoint, certresolver, and +router/service labels rather than inventing them. diff --git a/prompts/phase-0-kickoff.md b/prompts/phase-0-kickoff.md index 25b1a5a..4cba326 100644 --- a/prompts/phase-0-kickoff.md +++ b/prompts/phase-0-kickoff.md @@ -1,75 +1,108 @@ -# phase-0-kickoff.md — foreman build kickoff +# phase-0-kickoff.md — foreman autonomous build -You are building **foreman**, a Go daemon that fronts one Ollama target and turns -it into a queued, observable, OpenAI/Ollama-compatible job endpoint. This is a -deliberately pared-down restart of a system (`peon-overseer`) that died of scope -creep. Restraint is a feature, not a limitation. +You are building **foreman** end to end, in **one autonomous run**. Execute all +six phases in order (1 → 2 → 3 → 4 → 5 → 6) and do not stop between them. The run +ends when foreman is a working, deployable deliverable. Do not wait for my +approval at phase boundaries — keep going until done or genuinely blocked. + +foreman is a Go daemon that fronts one Ollama target and turns it into a queued, +observable, Ollama-compatible job endpoint. It is a deliberately pared-down +restart of a system (`peon-overseer`) that died of scope creep. Restraint is a +feature: if a task seems to need distributed dispatch, leases, fair queueing, +capacity budgets, an auth framework/SSO, a GUI, or multi-target support — stop, +because that means the design is being violated. ## Read these first (authoritative, in order) -1. `CLAUDE.md` in this repo — the operating manual. It is the source of truth for - architecture, stack, conventions, and the **out-of-scope guardrails**. -2. `docs/adr/README.md` then every `docs/adr/00NN-*.md`. The ADRs are the *why*. - Do not relitigate them; if you believe one is wrong, say so and propose a new - superseding ADR rather than silently diverging. -3. Via the **gitea MCP**, read the integration target — `steve/go-llm`: - `v2/provider/provider.go` (the `Provider` interface you must stay compatible - with), `v2/ollama/ollama.go` and `v2/constructors.go` (how `Ollama` / - `OllamaCloud` construct over native `/api/chat` + Bearer), and `v2/CLAUDE.md` +1. `CLAUDE.md` — the operating manual and source of truth. +2. `docs/adr/README.md`, then every `docs/adr/00NN-*.md` (0001–0013). The ADRs are + the *why*. Do not relitigate them. +3. Via the **gitea MCP**, `steve/go-llm`: `v2/provider/provider.go` (the + `Provider` interface), `v2/ollama/ollama.go` + `v2/ollama/native.go` + + `v2/constructors.go` (native `/api/chat` + Bearer + base URL), `v2/CLAUDE.md` (DD#8: native API, not OpenAI-compat). -4. Via the gitea MCP, study deployment conventions in `steve/steveternet`: - `kalimdor/orgrimmar/warhol-queue/`, `kalimdor/orgrimmar/ratchet/`, and - `kalimdor/orgrimmar/mort/` for `docker-compose.yml` + `.env.example` patterns, - and `kalimdor/orgrimmar/traefik/` (incl. `custom/`) for the Traefik network - name, entrypoint, certresolver, and router/label conventions. foreman will - live at `kalimdor/orgrimmar/foreman/`. **Mirror these exactly; do not invent - label syntax.** +4. Via the gitea MCP, `steve/steveternet`: `kalimdor/orgrimmar/warhol-queue/`, + `kalimdor/orgrimmar/ratchet/`, `kalimdor/orgrimmar/mort/`, and + `kalimdor/orgrimmar/traefik/` (incl. `custom/`) for compose/Traefik/network + conventions. foreman lives at `kalimdor/orgrimmar/foreman/`. Mirror these + exactly; do not invent label syntax. -## Working agreement (opusplan) +## The phases -- **Plan before code.** For each phase, produce a plan and wait for my approval - before implementing. Do not run ahead to later phases. -- **One phase at a time**, in order. Each phase is its own prompt I will paste. -- After every phase: `go build ./...`, `go vet ./...`, `go test -race -count=1 ./...` - must all pass. Append a dated entry to `progress.md`. Commit on a phase branch - with conventional-commit messages (`feat:`, `chore:`, `test:`, `docs:`). -- **Ask before assuming.** If a detail is ambiguous and not settled by CLAUDE.md - or an ADR, ask me — don't guess. -- **Propose an ADR** (append-only, next number) for any architectural decision - not already covered. Keep `docs/adr/README.md`'s index current. -- Keep dependencies minimal; match `go-llm` house style (tabs; wrap errors with - `fmt.Errorf("%w: ...", err)`; imports stdlib → third-party → internal). SQLite - via `modernc.org/sqlite` (pure-Go, `CGO_ENABLED=0`). No UI. -- **Refuse scope creep.** No distributed dispatch, leases, fair queueing, - capacity budgets, auth framework/SSO, GUI, or multi-target support. If a task - seems to need them, stop and flag it — that means the design is being violated. +Each `prompts/phase-N.md` is the detailed spec for that phase. For each phase, in +order: read `phase-N.md`, plan it internally, implement it, make the gates pass, +record progress, commit, then immediately continue to the next phase. -## Definition of done (whole project) - -A deployable daemon that: -- fronts one configurable Ollama target and transparently proxies native - `/api/chat`, `/api/tags`, `/api/ps` (so `go-llm` uses the Mac as a target with - no provider changes), including streaming; -- runs a durable SQLite-backed queue with a single worker and drain-by-model - scheduling, surviving restarts and target sleep; -- exposes an async `POST /jobs` surface returning a job ID, with - `queued→loading→working→done/failed` state webhooks and artifact delivery; -- ships a Go client package (synchronous facade over the async surface); -- passes CI on Gitea, builds as a container, and deploys via a steveternet - `docker-compose.yml` behind Traefik. - -## Phase map +**Override:** the phase files open with "Plan, get approval, implement" — that was +written for a paste-one-at-a-time workflow. In *this* autonomous run, treat it as +"plan internally and proceed." Do not pause for approval at any phase boundary. 1. Scaffold, config, SQLite store, health, CI, Dockerfile. -2. Ollama target client + model poller + native passthrough (the go-llm target). -3. Durable queue + single worker + drain-by-model. +2. Ollama target client + model poller + native passthrough + embedding bypass. +3. Durable queue + single worker + drain-by-model (replaces phase-2's chat gate). 4. Async `/jobs` + job IDs + state webhooks + artifacts. 5. Go client package (sync facade) + `llm.Foreman()` in go-llm. -6. Deploy: steveternet compose + Traefik, `.env.example`, deploy docs, model-pull script. +6. Deploy: steveternet compose + Traefik, `.env.example`, deploy docs, model script. -## Your task right now +## Per-phase loop (do this every phase, automatically) -Confirm you've read the sources above, briefly restate the architecture in your -own words (so I can check your understanding), flag anything in the ADRs you'd -push back on, then produce a **detailed plan for Phase 1 only**. Do not write code -yet. Stop for my approval. +- Implement to the phase spec and the ADRs. +- Run the gates; **all** must pass before moving on: + `go build ./...`, `go vet ./...`, `go test -race -count=1 ./...`, and + `go mod tidy` followed by `git diff --exit-code go.mod go.sum`. +- Append a dated entry to `progress.md` (what landed, what's next). +- Commit to the **foreman** repo with conventional-commit messages + (`feat:`, `test:`, `chore:`, `docs:`). Committing to foreman's main is fine. +- Continue to the next phase without pausing. + +## Invariants to honor throughout (from the ADRs) + +- **Two-slot runtime (ADR-0013):** the target runs `OLLAMA_MAX_LOADED_MODELS=2` — + an always-resident embedder (`FOREMAN_EMBED_MODEL`) plus one rotating worker + model. `/api/embed` (+ `/api/embeddings`) bypass the queue and run + concurrently; only `/api/chat` and `POST /jobs` are serialized through the + single worker. Worker-model concurrency is exactly 1 (ADR-0009). +- **NDJSON, not SSE (ADR-0012):** stream `/api/chat` as `application/x-ndjson`. +- **Env namespacing:** every config key is `FOREMAN_*` (incl. + `FOREMAN_OLLAMA_URL`, `FOREMAN_OLLAMA_TOKEN`). No bare `OLLAMA_*`. +- **Go 1.26** in `go.mod`, Dockerfile, and CI. +- Unreachable target = transient/recoverable, never fatal (ADR-0002). + +## Cross-repo changes (phases 5 and 6) + +The `llm.Foreman()` constructor (go-llm) and the steveternet `docker-compose.yml` +touch repos other than foreman. For those, **open a branch and a PR for my +review — do NOT commit to their main.** Report the branch names and PR links in +the final summary. + +## When to stop vs. keep going + +- Keep going through routine ambiguity. If you hit a decision not covered by + `CLAUDE.md` or an ADR, make the smallest reasonable choice, **record it as a new + ADR** (append-only, next number after 0013, update the index), and continue. +- **Only stop** for a true blocker: a gate you cannot make green after honest + effort, a repo/tool you cannot reach, or a required choice that would + contradict an accepted ADR or a scope guardrail. If you stop, say exactly why + and what you need. + +## Definition of done (whole run) + +- foreman fronts one configurable Ollama target; transparently proxies native + `/api/chat`, `/api/tags`, `/api/ps` (NDJSON streaming) so go-llm uses it as a + target with no provider changes; `/api/embed` bypasses the queue concurrently. +- Durable SQLite queue, single worker, drain-by-model; survives restart and + target sleep. +- `POST /jobs` returns a ULID job id; `queued→loading→working→done|failed` state + webhooks (at-least-once, optional HMAC); artifacts inline/fetch. +- A Go client package (sync facade over `/jobs`); `llm.Foreman()` branch/PR on + go-llm. +- CI green; container builds; steveternet compose + Traefik branch/PR. + +## Start now + +Read the sources, then begin Phase 1 and run straight through to a finished +deliverable. When done, give me: a summary of what was built per phase, the +go-llm and steveternet PR links, any ADRs you added, and a copy-pasteable +end-to-end smoke-test checklist (pull models on the Mac → set +`OLLAMA_MAX_LOADED_MODELS=2` → deploy foreman → go-llm chat → concurrent +`/api/embed` → `POST /jobs` with a webhook). diff --git a/prompts/phase-1.md b/prompts/phase-1.md index b0ee4dd..7d96783 100644 --- a/prompts/phase-1.md +++ b/prompts/phase-1.md @@ -16,9 +16,13 @@ health endpoint — no Ollama logic yet. `internal/store`, `internal/server`. Don't create empty packages for later phases. - `internal/config`: load from env into a struct — `FOREMAN_ADDR` (listen addr, - default `:8080`), `FOREMAN_OLLAMA_URL` (target, required), `FOREMAN_TOKEN` - (optional inbound bearer), `FOREMAN_DB_PATH`, `FOREMAN_POLL_INTERVAL`. Provide - a `.env.example` documenting every key. + default `:8080`), `FOREMAN_OLLAMA_URL` (target, required), `FOREMAN_OLLAMA_TOKEN` + (optional outbound bearer to the target, for Ollama-Cloud-style auth), + `FOREMAN_TOKEN` (optional inbound bearer foreman requires of its callers), + `FOREMAN_EMBED_MODEL` (the always-resident embedder, e.g. `nomic-embed-text`), + `FOREMAN_DB_PATH`, `FOREMAN_POLL_INTERVAL`. Namespace **every** key under + `FOREMAN_` (do not use bare `OLLAMA_*`, which collide with real Ollama client + vars). Provide a `.env.example` documenting every key. - `internal/store`: SQLite via `modernc.org/sqlite`, WAL mode, with an embedded migration for the `jobs` and `artifacts` tables (schema sketch in ADR-0008 / ADR-0006). Include open/close, migrate-on-start, and basic CRUD with tests diff --git a/prompts/phase-2.md b/prompts/phase-2.md index d1dd4af..1f2cad4 100644 --- a/prompts/phase-2.md +++ b/prompts/phase-2.md @@ -1,33 +1,48 @@ # phase-2.md — Ollama target client, model poller, native passthrough Re-ground: `CLAUDE.md` + ADR-0003 (API surface), 0007 (model polling), 0012 -(streaming), 0002 (unreachable = transient). Plan, get approval, implement. +(streaming = NDJSON, not SSE), 0013 (two-slot residency + embedding bypass), +0002 (unreachable = transient). Plan, get approval, implement. ## Objective Make foreman a working transparent front for its Ollama target — enough that `go-llm` can use the Mac as a target *today*, before any queue exists. (Phase 3 -will move this through the queue; here it can proxy directly.) +will move chat through the queue; here it proxies behind a single-flight gate.) ## Tasks - `internal/ollama`: a small client to the target (`FOREMAN_OLLAMA_URL`) behind an interface, covering `POST /api/chat` (streaming and non-streaming), - `GET /api/tags`, `GET /api/ps`. Attach the outbound bearer if configured. Wrap - errors; classify connection failures distinctly (Phase 3 needs that signal). + `POST /api/embed` (+ `/api/embeddings` alias), `GET /api/tags`, `GET /api/ps`. + Attach the outbound bearer (`FOREMAN_OLLAMA_TOKEN`) if configured. Wrap errors; + classify connection failures distinctly (Phase 3 needs that signal). +- Warm the embedder: on startup and after any reconnect-from-unreachable, issue a + trivial `/api/embed` to `FOREMAN_EMBED_MODEL` so it occupies a resident slot + (ADR-0013). The target must run `OLLAMA_MAX_LOADED_MODELS=2`; log a warning if + `/api/ps` ever shows only one slot under load. - Model poller (goroutine): poll `/api/tags` every `FOREMAN_POLL_INTERVAL` (default 30s) into an in-memory inventory with a mutex; track last-poll time and a degraded flag. On target unreachable, retain last-known inventory and set degraded — do not clear it. Wire degraded state into `/healthz`. - Passthrough handlers in `internal/server`: - `GET /api/tags` and `GET /api/ps` served from the poller/target. + - `POST /api/embed` and `POST /api/embeddings`: proxy **directly and + concurrently** to the target — these BYPASS the queue/worker gate entirely + (ADR-0013). No serialization. - `POST /api/chat`: validate the requested model against the inventory (one - re-poll on miss, then 4xx if still absent); proxy to the target. Support - streaming faithfully (stream the target's chunks straight through; set the - right content type). For now this may call the target directly — no queue. + re-poll on miss, then 4xx if still absent); proxy to the target. **Serialize + worker-model access through a single in-flight gate (a buffered channel / + mutex of 1)** so two concurrent chat requests never hit the worker slot at + once — this preserves the serial invariant *before* the full queue exists. + Phase 3 replaces this gate with the SQLite queue + worker loop. Stream + faithfully as **NDJSON** (`Content-Type: application/x-ndjson`, chunks passed + straight through — Ollama's native format, not SSE). - Tests: a stub HTTP server standing in for Ollama; assert tags/ps proxy, - model validation rejects unknown models, streaming passes chunks through, and - the poller flips degraded on target failure and recovers. + model validation rejects unknown models, NDJSON streaming passes chunks + through, **concurrent `/api/embed` calls run in parallel while `/api/chat` is + serialized** (assert no two chats overlap at the stub), and the poller flips + degraded on target failure and recovers (and re-warms the embedder). ## Definition of done diff --git a/prompts/phase-3.md b/prompts/phase-3.md index 1a95f51..0f89f98 100644 --- a/prompts/phase-3.md +++ b/prompts/phase-3.md @@ -1,13 +1,16 @@ # phase-3.md — Durable queue, single worker, drain-by-model -Re-ground: `CLAUDE.md` + ADR-0009 (single worker / drain-by-model), 0008 (queue), -0004 (lifecycle/retry). Plan, get approval, implement. +Re-ground: `CLAUDE.md` + ADR-0009 (single worker / drain-by-model), 0013 +(embeddings bypass — they must NOT be touched here), 0008 (queue), 0004 +(lifecycle/retry). Plan, get approval, implement. ## Objective -Route execution through the SQLite queue with exactly one worker and -drain-by-model scheduling. The synchronous passthrough from Phase 2 now enqueues -and blocks on completion instead of calling the target directly. +Replace Phase 2's interim single-flight chat gate with the real SQLite queue and +one worker, with drain-by-model scheduling. The synchronous passthrough now +enqueues and blocks on completion instead of holding a direct gate. +`/api/embed` stays exactly as Phase 2 built it — direct, concurrent, never +queued (ADR-0013). Do not route embeddings through any of this. ## Tasks