docs: land prior ADR + prompt updates

Commit pre-existing uncommitted working-tree changes that predate the license/public-readiness work — NOT authored in this session, just flushed so they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013 (embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and the phase-0..3 prompt revisions + prompts/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 20:33:39 -04:00
parent 823c0b4ca8
commit 0526bada90
10 changed files with 276 additions and 98 deletions
@@ -0,0 +1,61 @@
+# foreman build prompts
+
+This directory drives the autonomous Claude Code build of foreman.
+
+## How to run
+
+Start with **`phase-0-kickoff.md`** — it is the master driver. Paste it (or the
+command below) into Claude Code from the repo root. It reads `CLAUDE.md`, the
+ADRs in `docs/adr/`, and the `go-llm` / `steveternet` sources, then runs phases
+1 → 6 **autonomously** to a finished, deployable foreman. You do **not** paste the
+individual phase files — the kickoff reads them as it goes.
+
+Kickoff command:
+
+```
+Read and follow prompts/phase-0-kickoff.md. The per-phase specs it references
+(prompts/phase-1.md … phase-6.md) are in this same prompts/ directory. This is a
+fully autonomous run: execute all six phases in order to a finished, working
+deliverable without pausing between them. Honor docs/adr/ (note the new 0013) and
+CLAUDE.md as source of truth. For the two cross-repo changes (llm.Foreman() in
+steve/go-llm and the docker-compose in steve/steveternet), open a branch and PR
+on each for my review — do not commit to their main. When done, report what each
+phase built, the PR links, any ADRs you added, and a smoke-test checklist.
+```
+
+## What each phase produces
+
+1. `phase-1` — scaffold, config, SQLite store, health, CI, Dockerfile.
+2. `phase-2` — Ollama client + model poller + native passthrough + embedding
+   bypass. The Mac is usable as a `go-llm` target after this.
+3. `phase-3` — durable queue + single worker + drain-by-model. **M0 complete.**
+4. `phase-4` — async `/jobs` + job IDs + state webhooks + artifacts. The headline
+   queue-and-webhooks capability.
+5. `phase-5` — Go client package (sync facade) + `llm.Foreman()` in `go-llm`.
+6. `phase-6` — deploy: steveternet compose + Traefik, `.env.example`, deploy
+   docs, model-pull script.
+
+## Conventions during the run
+
+- Each phase must pass the gates before continuing: `go build ./...`,
+  `go vet ./...`, `go test -race -count=1 ./...`, and `go mod tidy` +
+  `git diff --exit-code go.mod go.sum`.
+- Commits go to foreman's `main` with conventional-commit messages; `progress.md`
+  gets a dated entry each phase.
+- A decision not covered by `CLAUDE.md` or an ADR → record a new ADR
+  (append-only, next number after 0013) and continue.
+
+## Cross-repo caveat (phases 5 & 6)
+
+The `llm.Foreman()` constructor lives in **steve/go-llm** and the deploy compose
+in **steve/steveternet**. Those changes go on a **branch + PR for review — never
+their main.**
+
+## Known caveat (phase 6)
+
+Phase 6 mirrors the Traefik/compose conventions from sibling services
+(`kalimdor/orgrimmar/warhol-queue`, `ratchet`, `mort`) read via the gitea MCP.
+Those reads were intermittently erroring during planning. **If phase 6 can't read
+them, paste a sibling `docker-compose.yml` (e.g. ratchet's) straight into the
+session** so it can mirror the network name, entrypoint, certresolver, and
+router/service labels rather than inventing them.
@@ -1,75 +1,108 @@
-# phase-0-kickoff.md — foreman build kickoff
+# phase-0-kickoff.md — foreman autonomous build

-You are building **foreman**, a Go daemon that fronts one Ollama target and turns
-it into a queued, observable, OpenAI/Ollama-compatible job endpoint. This is a
-deliberately pared-down restart of a system (`peon-overseer`) that died of scope
-creep. Restraint is a feature, not a limitation.
+You are building **foreman** end to end, in **one autonomous run**. Execute all
+six phases in order (1 → 2 → 3 → 4 → 5 → 6) and do not stop between them. The run
+ends when foreman is a working, deployable deliverable. Do not wait for my
+approval at phase boundaries — keep going until done or genuinely blocked.
+
+foreman is a Go daemon that fronts one Ollama target and turns it into a queued,
+observable, Ollama-compatible job endpoint. It is a deliberately pared-down
+restart of a system (`peon-overseer`) that died of scope creep. Restraint is a
+feature: if a task seems to need distributed dispatch, leases, fair queueing,
+capacity budgets, an auth framework/SSO, a GUI, or multi-target support — stop,
+because that means the design is being violated.

 ## Read these first (authoritative, in order)

-1. `CLAUDE.md` in this repo — the operating manual. It is the source of truth for
-   architecture, stack, conventions, and the **out-of-scope guardrails**.
-2. `docs/adr/README.md` then every `docs/adr/00NN-*.md`. The ADRs are the *why*.
-   Do not relitigate them; if you believe one is wrong, say so and propose a new
-   superseding ADR rather than silently diverging.
-3. Via the **gitea MCP**, read the integration target — `steve/go-llm`:
-   `v2/provider/provider.go` (the `Provider` interface you must stay compatible
-   with), `v2/ollama/ollama.go` and `v2/constructors.go` (how `Ollama` /
-   `OllamaCloud` construct over native `/api/chat` + Bearer), and `v2/CLAUDE.md`
+1. `CLAUDE.md` — the operating manual and source of truth.
+2. `docs/adr/README.md`, then every `docs/adr/00NN-*.md` (0001–0013). The ADRs are
+   the *why*. Do not relitigate them.
+3. Via the **gitea MCP**, `steve/go-llm`: `v2/provider/provider.go` (the
+   `Provider` interface), `v2/ollama/ollama.go` + `v2/ollama/native.go` +
+   `v2/constructors.go` (native `/api/chat` + Bearer + base URL), `v2/CLAUDE.md`
   (DD#8: native API, not OpenAI-compat).
-4. Via the gitea MCP, study deployment conventions in `steve/steveternet`:
-   `kalimdor/orgrimmar/warhol-queue/`, `kalimdor/orgrimmar/ratchet/`, and
-   `kalimdor/orgrimmar/mort/` for `docker-compose.yml` + `.env.example` patterns,
-   and `kalimdor/orgrimmar/traefik/` (incl. `custom/`) for the Traefik network
-   name, entrypoint, certresolver, and router/label conventions. foreman will
-   live at `kalimdor/orgrimmar/foreman/`. **Mirror these exactly; do not invent
-   label syntax.**
+4. Via the gitea MCP, `steve/steveternet`: `kalimdor/orgrimmar/warhol-queue/`,
+   `kalimdor/orgrimmar/ratchet/`, `kalimdor/orgrimmar/mort/`, and
+   `kalimdor/orgrimmar/traefik/` (incl. `custom/`) for compose/Traefik/network
+   conventions. foreman lives at `kalimdor/orgrimmar/foreman/`. Mirror these
+   exactly; do not invent label syntax.

-## Working agreement (opusplan)
+## The phases

- **Plan before code.** For each phase, produce a plan and wait for my approval
-  before implementing. Do not run ahead to later phases.
- **One phase at a time**, in order. Each phase is its own prompt I will paste.
- After every phase: `go build ./...`, `go vet ./...`, `go test -race -count=1 ./...`
-  must all pass. Append a dated entry to `progress.md`. Commit on a phase branch
-  with conventional-commit messages (`feat:`, `chore:`, `test:`, `docs:`).
- **Ask before assuming.** If a detail is ambiguous and not settled by CLAUDE.md
-  or an ADR, ask me — don't guess.
- **Propose an ADR** (append-only, next number) for any architectural decision
-  not already covered. Keep `docs/adr/README.md`'s index current.
- Keep dependencies minimal; match `go-llm` house style (tabs; wrap errors with
-  `fmt.Errorf("%w: ...", err)`; imports stdlib → third-party → internal). SQLite
-  via `modernc.org/sqlite` (pure-Go, `CGO_ENABLED=0`). No UI.
- **Refuse scope creep.** No distributed dispatch, leases, fair queueing,
-  capacity budgets, auth framework/SSO, GUI, or multi-target support. If a task
-  seems to need them, stop and flag it — that means the design is being violated.
+Each `prompts/phase-N.md` is the detailed spec for that phase. For each phase, in
+order: read `phase-N.md`, plan it internally, implement it, make the gates pass,
+record progress, commit, then immediately continue to the next phase.

-## Definition of done (whole project)
-
-A deployable daemon that:
- fronts one configurable Ollama target and transparently proxies native
-  `/api/chat`, `/api/tags`, `/api/ps` (so `go-llm` uses the Mac as a target with
-  no provider changes), including streaming;
- runs a durable SQLite-backed queue with a single worker and drain-by-model
-  scheduling, surviving restarts and target sleep;
- exposes an async `POST /jobs` surface returning a job ID, with
-  `queued→loading→working→done/failed` state webhooks and artifact delivery;
- ships a Go client package (synchronous facade over the async surface);
- passes CI on Gitea, builds as a container, and deploys via a steveternet
-  `docker-compose.yml` behind Traefik.
-
-## Phase map
+**Override:** the phase files open with "Plan, get approval, implement" — that was
+written for a paste-one-at-a-time workflow. In *this* autonomous run, treat it as
+"plan internally and proceed." Do not pause for approval at any phase boundary.

 1. Scaffold, config, SQLite store, health, CI, Dockerfile.
-2. Ollama target client + model poller + native passthrough (the go-llm target).
-3. Durable queue + single worker + drain-by-model.
+2. Ollama target client + model poller + native passthrough + embedding bypass.
+3. Durable queue + single worker + drain-by-model (replaces phase-2's chat gate).
 4. Async `/jobs` + job IDs + state webhooks + artifacts.
 5. Go client package (sync facade) + `llm.Foreman()` in go-llm.
-6. Deploy: steveternet compose + Traefik, `.env.example`, deploy docs, model-pull script.
+6. Deploy: steveternet compose + Traefik, `.env.example`, deploy docs, model script.

-## Your task right now
+## Per-phase loop (do this every phase, automatically)

-Confirm you've read the sources above, briefly restate the architecture in your
-own words (so I can check your understanding), flag anything in the ADRs you'd
-push back on, then produce a **detailed plan for Phase 1 only**. Do not write code
-yet. Stop for my approval.
+- Implement to the phase spec and the ADRs.
+- Run the gates; **all** must pass before moving on:
+  `go build ./...`, `go vet ./...`, `go test -race -count=1 ./...`, and
+  `go mod tidy` followed by `git diff --exit-code go.mod go.sum`.
+- Append a dated entry to `progress.md` (what landed, what's next).
+- Commit to the **foreman** repo with conventional-commit messages
+  (`feat:`, `test:`, `chore:`, `docs:`). Committing to foreman's main is fine.
+- Continue to the next phase without pausing.
+
+## Invariants to honor throughout (from the ADRs)
+
+- **Two-slot runtime (ADR-0013):** the target runs `OLLAMA_MAX_LOADED_MODELS=2` —
+  an always-resident embedder (`FOREMAN_EMBED_MODEL`) plus one rotating worker
+  model. `/api/embed` (+ `/api/embeddings`) bypass the queue and run
+  concurrently; only `/api/chat` and `POST /jobs` are serialized through the
+  single worker. Worker-model concurrency is exactly 1 (ADR-0009).
+- **NDJSON, not SSE (ADR-0012):** stream `/api/chat` as `application/x-ndjson`.
+- **Env namespacing:** every config key is `FOREMAN_*` (incl.
+  `FOREMAN_OLLAMA_URL`, `FOREMAN_OLLAMA_TOKEN`). No bare `OLLAMA_*`.
+- **Go 1.26** in `go.mod`, Dockerfile, and CI.
+- Unreachable target = transient/recoverable, never fatal (ADR-0002).
+
+## Cross-repo changes (phases 5 and 6)
+
+The `llm.Foreman()` constructor (go-llm) and the steveternet `docker-compose.yml`
+touch repos other than foreman. For those, **open a branch and a PR for my
+review — do NOT commit to their main.** Report the branch names and PR links in
+the final summary.
+
+## When to stop vs. keep going
+
+- Keep going through routine ambiguity. If you hit a decision not covered by
+  `CLAUDE.md` or an ADR, make the smallest reasonable choice, **record it as a new
+  ADR** (append-only, next number after 0013, update the index), and continue.
+- **Only stop** for a true blocker: a gate you cannot make green after honest
+  effort, a repo/tool you cannot reach, or a required choice that would
+  contradict an accepted ADR or a scope guardrail. If you stop, say exactly why
+  and what you need.
+
+## Definition of done (whole run)
+
+- foreman fronts one configurable Ollama target; transparently proxies native
+  `/api/chat`, `/api/tags`, `/api/ps` (NDJSON streaming) so go-llm uses it as a
+  target with no provider changes; `/api/embed` bypasses the queue concurrently.
+- Durable SQLite queue, single worker, drain-by-model; survives restart and
+  target sleep.
+- `POST /jobs` returns a ULID job id; `queued→loading→working→done|failed` state
+  webhooks (at-least-once, optional HMAC); artifacts inline/fetch.
+- A Go client package (sync facade over `/jobs`); `llm.Foreman()` branch/PR on
+  go-llm.
+- CI green; container builds; steveternet compose + Traefik branch/PR.
+
+## Start now
+
+Read the sources, then begin Phase 1 and run straight through to a finished
+deliverable. When done, give me: a summary of what was built per phase, the
+go-llm and steveternet PR links, any ADRs you added, and a copy-pasteable
+end-to-end smoke-test checklist (pull models on the Mac → set
+`OLLAMA_MAX_LOADED_MODELS=2` → deploy foreman → go-llm chat → concurrent
+`/api/embed` → `POST /jobs` with a webhook).
@@ -16,9 +16,13 @@ health endpoint — no Ollama logic yet.
  `internal/store`, `internal/server`. Don't create empty packages for later
  phases.
 - `internal/config`: load from env into a struct — `FOREMAN_ADDR` (listen addr,
-  default `:8080`), `FOREMAN_OLLAMA_URL` (target, required), `FOREMAN_TOKEN`
-  (optional inbound bearer), `FOREMAN_DB_PATH`, `FOREMAN_POLL_INTERVAL`. Provide
-  a `.env.example` documenting every key.
+  default `:8080`), `FOREMAN_OLLAMA_URL` (target, required), `FOREMAN_OLLAMA_TOKEN`
+  (optional outbound bearer to the target, for Ollama-Cloud-style auth),
+  `FOREMAN_TOKEN` (optional inbound bearer foreman requires of its callers),
+  `FOREMAN_EMBED_MODEL` (the always-resident embedder, e.g. `nomic-embed-text`),
+  `FOREMAN_DB_PATH`, `FOREMAN_POLL_INTERVAL`. Namespace **every** key under
+  `FOREMAN_` (do not use bare `OLLAMA_*`, which collide with real Ollama client
+  vars). Provide a `.env.example` documenting every key.
 - `internal/store`: SQLite via `modernc.org/sqlite`, WAL mode, with an embedded
  migration for the `jobs` and `artifacts` tables (schema sketch in ADR-0008 /
  ADR-0006). Include open/close, migrate-on-start, and basic CRUD with tests
@@ -1,33 +1,48 @@
 # phase-2.md — Ollama target client, model poller, native passthrough

 Re-ground: `CLAUDE.md` + ADR-0003 (API surface), 0007 (model polling), 0012
-(streaming), 0002 (unreachable = transient). Plan, get approval, implement.
+(streaming = NDJSON, not SSE), 0013 (two-slot residency + embedding bypass),
+0002 (unreachable = transient). Plan, get approval, implement.

 ## Objective

 Make foreman a working transparent front for its Ollama target — enough that
 `go-llm` can use the Mac as a target *today*, before any queue exists. (Phase 3
-will move this through the queue; here it can proxy directly.)
+will move chat through the queue; here it proxies behind a single-flight gate.)

 ## Tasks

 - `internal/ollama`: a small client to the target (`FOREMAN_OLLAMA_URL`) behind
  an interface, covering `POST /api/chat` (streaming and non-streaming),
-  `GET /api/tags`, `GET /api/ps`. Attach the outbound bearer if configured. Wrap
-  errors; classify connection failures distinctly (Phase 3 needs that signal).
+  `POST /api/embed` (+ `/api/embeddings` alias), `GET /api/tags`, `GET /api/ps`.
+  Attach the outbound bearer (`FOREMAN_OLLAMA_TOKEN`) if configured. Wrap errors;
+  classify connection failures distinctly (Phase 3 needs that signal).
+- Warm the embedder: on startup and after any reconnect-from-unreachable, issue a
+  trivial `/api/embed` to `FOREMAN_EMBED_MODEL` so it occupies a resident slot
+  (ADR-0013). The target must run `OLLAMA_MAX_LOADED_MODELS=2`; log a warning if
+  `/api/ps` ever shows only one slot under load.
 - Model poller (goroutine): poll `/api/tags` every `FOREMAN_POLL_INTERVAL`
  (default 30s) into an in-memory inventory with a mutex; track last-poll time
  and a degraded flag. On target unreachable, retain last-known inventory and set
  degraded — do not clear it. Wire degraded state into `/healthz`.
 - Passthrough handlers in `internal/server`:
  - `GET /api/tags` and `GET /api/ps` served from the poller/target.
+  - `POST /api/embed` and `POST /api/embeddings`: proxy **directly and
+    concurrently** to the target — these BYPASS the queue/worker gate entirely
+    (ADR-0013). No serialization.
  - `POST /api/chat`: validate the requested model against the inventory (one
-    re-poll on miss, then 4xx if still absent); proxy to the target. Support
-    streaming faithfully (stream the target's chunks straight through; set the
-    right content type). For now this may call the target directly — no queue.
+    re-poll on miss, then 4xx if still absent); proxy to the target. **Serialize
+    worker-model access through a single in-flight gate (a buffered channel /
+    mutex of 1)** so two concurrent chat requests never hit the worker slot at
+    once — this preserves the serial invariant *before* the full queue exists.
+    Phase 3 replaces this gate with the SQLite queue + worker loop. Stream
+    faithfully as **NDJSON** (`Content-Type: application/x-ndjson`, chunks passed
+    straight through — Ollama's native format, not SSE).
 - Tests: a stub HTTP server standing in for Ollama; assert tags/ps proxy,
-  model validation rejects unknown models, streaming passes chunks through, and
-  the poller flips degraded on target failure and recovers.
+  model validation rejects unknown models, NDJSON streaming passes chunks
+  through, **concurrent `/api/embed` calls run in parallel while `/api/chat` is
+  serialized** (assert no two chats overlap at the stub), and the poller flips
+  degraded on target failure and recovers (and re-warms the embedder).

 ## Definition of done

@@ -1,13 +1,16 @@
 # phase-3.md — Durable queue, single worker, drain-by-model

-Re-ground: `CLAUDE.md` + ADR-0009 (single worker / drain-by-model), 0008 (queue),
-0004 (lifecycle/retry). Plan, get approval, implement.
+Re-ground: `CLAUDE.md` + ADR-0009 (single worker / drain-by-model), 0013
+(embeddings bypass — they must NOT be touched here), 0008 (queue), 0004
+(lifecycle/retry). Plan, get approval, implement.

 ## Objective

-Route execution through the SQLite queue with exactly one worker and
-drain-by-model scheduling. The synchronous passthrough from Phase 2 now enqueues
-and blocks on completion instead of calling the target directly.
+Replace Phase 2's interim single-flight chat gate with the real SQLite queue and
+one worker, with drain-by-model scheduling. The synchronous passthrough now
+enqueues and blocks on completion instead of holding a direct gate.
+`/api/embed` stays exactly as Phase 2 built it — direct, concurrent, never
+queued (ADR-0013). Do not route embeddings through any of this.

 ## Tasks