Files
foreman/prompts/phase-2.md
T
steve 0526bada90 docs: land prior ADR + prompt updates
Commit pre-existing uncommitted working-tree changes that predate the
license/public-readiness work — NOT authored in this session, just flushed so
they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013
(embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and
the phase-0..3 prompt revisions + prompts/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 20:33:39 -04:00

3.3 KiB

phase-2.md — Ollama target client, model poller, native passthrough

Re-ground: CLAUDE.md + ADR-0003 (API surface), 0007 (model polling), 0012 (streaming = NDJSON, not SSE), 0013 (two-slot residency + embedding bypass), 0002 (unreachable = transient). Plan, get approval, implement.

Objective

Make foreman a working transparent front for its Ollama target — enough that go-llm can use the Mac as a target today, before any queue exists. (Phase 3 will move chat through the queue; here it proxies behind a single-flight gate.)

Tasks

  • internal/ollama: a small client to the target (FOREMAN_OLLAMA_URL) behind an interface, covering POST /api/chat (streaming and non-streaming), POST /api/embed (+ /api/embeddings alias), GET /api/tags, GET /api/ps. Attach the outbound bearer (FOREMAN_OLLAMA_TOKEN) if configured. Wrap errors; classify connection failures distinctly (Phase 3 needs that signal).
  • Warm the embedder: on startup and after any reconnect-from-unreachable, issue a trivial /api/embed to FOREMAN_EMBED_MODEL so it occupies a resident slot (ADR-0013). The target must run OLLAMA_MAX_LOADED_MODELS=2; log a warning if /api/ps ever shows only one slot under load.
  • Model poller (goroutine): poll /api/tags every FOREMAN_POLL_INTERVAL (default 30s) into an in-memory inventory with a mutex; track last-poll time and a degraded flag. On target unreachable, retain last-known inventory and set degraded — do not clear it. Wire degraded state into /healthz.
  • Passthrough handlers in internal/server:
    • GET /api/tags and GET /api/ps served from the poller/target.
    • POST /api/embed and POST /api/embeddings: proxy directly and concurrently to the target — these BYPASS the queue/worker gate entirely (ADR-0013). No serialization.
    • POST /api/chat: validate the requested model against the inventory (one re-poll on miss, then 4xx if still absent); proxy to the target. Serialize worker-model access through a single in-flight gate (a buffered channel / mutex of 1) so two concurrent chat requests never hit the worker slot at once — this preserves the serial invariant before the full queue exists. Phase 3 replaces this gate with the SQLite queue + worker loop. Stream faithfully as NDJSON (Content-Type: application/x-ndjson, chunks passed straight through — Ollama's native format, not SSE).
  • Tests: a stub HTTP server standing in for Ollama; assert tags/ps proxy, model validation rejects unknown models, NDJSON streaming passes chunks through, concurrent /api/embed calls run in parallel while /api/chat is serialized (assert no two chats overlap at the stub), and the poller flips degraded on target failure and recovers (and re-warms the embedder).

Definition of done

  • go build/vet/test -race green.
  • Against a real or stubbed Ollama: curl .../api/tags returns the inventory; a non-streaming and a streaming /api/chat both work end-to-end.
  • Acceptance: from a scratch Go program, llm.Ollama(llm.WithBaseURL("http://<foreman>:8080")) (or llm.OllamaCloud(token, WithBaseURL(...)) if a token is set) completes a chat through foreman. Note this in progress.md.

Wrap up: progress.md, commit on phase-2-passthrough, note what Phase 3 changes (routing this through the queue).