# phase-2.md — Ollama target client, model poller, native passthrough Re-ground: `CLAUDE.md` + ADR-0003 (API surface), 0007 (model polling), 0012 (streaming = NDJSON, not SSE), 0013 (two-slot residency + embedding bypass), 0002 (unreachable = transient). Plan, get approval, implement. ## Objective Make foreman a working transparent front for its Ollama target — enough that `go-llm` can use the Mac as a target *today*, before any queue exists. (Phase 3 will move chat through the queue; here it proxies behind a single-flight gate.) ## Tasks - `internal/ollama`: a small client to the target (`FOREMAN_OLLAMA_URL`) behind an interface, covering `POST /api/chat` (streaming and non-streaming), `POST /api/embed` (+ `/api/embeddings` alias), `GET /api/tags`, `GET /api/ps`. Attach the outbound bearer (`FOREMAN_OLLAMA_TOKEN`) if configured. Wrap errors; classify connection failures distinctly (Phase 3 needs that signal). - Warm the embedder: on startup and after any reconnect-from-unreachable, issue a trivial `/api/embed` to `FOREMAN_EMBED_MODEL` so it occupies a resident slot (ADR-0013). The target must run `OLLAMA_MAX_LOADED_MODELS=2`; log a warning if `/api/ps` ever shows only one slot under load. - Model poller (goroutine): poll `/api/tags` every `FOREMAN_POLL_INTERVAL` (default 30s) into an in-memory inventory with a mutex; track last-poll time and a degraded flag. On target unreachable, retain last-known inventory and set degraded — do not clear it. Wire degraded state into `/healthz`. - Passthrough handlers in `internal/server`: - `GET /api/tags` and `GET /api/ps` served from the poller/target. - `POST /api/embed` and `POST /api/embeddings`: proxy **directly and concurrently** to the target — these BYPASS the queue/worker gate entirely (ADR-0013). No serialization. - `POST /api/chat`: validate the requested model against the inventory (one re-poll on miss, then 4xx if still absent); proxy to the target. **Serialize worker-model access through a single in-flight gate (a buffered channel / mutex of 1)** so two concurrent chat requests never hit the worker slot at once — this preserves the serial invariant *before* the full queue exists. Phase 3 replaces this gate with the SQLite queue + worker loop. Stream faithfully as **NDJSON** (`Content-Type: application/x-ndjson`, chunks passed straight through — Ollama's native format, not SSE). - Tests: a stub HTTP server standing in for Ollama; assert tags/ps proxy, model validation rejects unknown models, NDJSON streaming passes chunks through, **concurrent `/api/embed` calls run in parallel while `/api/chat` is serialized** (assert no two chats overlap at the stub), and the poller flips degraded on target failure and recovers (and re-warms the embedder). ## Definition of done - `go build/vet/test -race` green. - Against a real or stubbed Ollama: `curl .../api/tags` returns the inventory; a non-streaming and a streaming `/api/chat` both work end-to-end. - Acceptance: from a scratch Go program, `llm.Ollama(llm.WithBaseURL("http://:8080"))` (or `llm.OllamaCloud(token, WithBaseURL(...))` if a token is set) completes a chat through foreman. Note this in `progress.md`. Wrap up: `progress.md`, commit on `phase-2-passthrough`, note what Phase 3 changes (routing this through the queue).