0526bada90
Commit pre-existing uncommitted working-tree changes that predate the license/public-readiness work — NOT authored in this session, just flushed so they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013 (embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and the phase-0..3 prompt revisions + prompts/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3.3 KiB
3.3 KiB
phase-2.md — Ollama target client, model poller, native passthrough
Re-ground: CLAUDE.md + ADR-0003 (API surface), 0007 (model polling), 0012
(streaming = NDJSON, not SSE), 0013 (two-slot residency + embedding bypass),
0002 (unreachable = transient). Plan, get approval, implement.
Objective
Make foreman a working transparent front for its Ollama target — enough that
go-llm can use the Mac as a target today, before any queue exists. (Phase 3
will move chat through the queue; here it proxies behind a single-flight gate.)
Tasks
internal/ollama: a small client to the target (FOREMAN_OLLAMA_URL) behind an interface, coveringPOST /api/chat(streaming and non-streaming),POST /api/embed(+/api/embeddingsalias),GET /api/tags,GET /api/ps. Attach the outbound bearer (FOREMAN_OLLAMA_TOKEN) if configured. Wrap errors; classify connection failures distinctly (Phase 3 needs that signal).- Warm the embedder: on startup and after any reconnect-from-unreachable, issue a
trivial
/api/embedtoFOREMAN_EMBED_MODELso it occupies a resident slot (ADR-0013). The target must runOLLAMA_MAX_LOADED_MODELS=2; log a warning if/api/psever shows only one slot under load. - Model poller (goroutine): poll
/api/tagseveryFOREMAN_POLL_INTERVAL(default 30s) into an in-memory inventory with a mutex; track last-poll time and a degraded flag. On target unreachable, retain last-known inventory and set degraded — do not clear it. Wire degraded state into/healthz. - Passthrough handlers in
internal/server:GET /api/tagsandGET /api/psserved from the poller/target.POST /api/embedandPOST /api/embeddings: proxy directly and concurrently to the target — these BYPASS the queue/worker gate entirely (ADR-0013). No serialization.POST /api/chat: validate the requested model against the inventory (one re-poll on miss, then 4xx if still absent); proxy to the target. Serialize worker-model access through a single in-flight gate (a buffered channel / mutex of 1) so two concurrent chat requests never hit the worker slot at once — this preserves the serial invariant before the full queue exists. Phase 3 replaces this gate with the SQLite queue + worker loop. Stream faithfully as NDJSON (Content-Type: application/x-ndjson, chunks passed straight through — Ollama's native format, not SSE).
- Tests: a stub HTTP server standing in for Ollama; assert tags/ps proxy,
model validation rejects unknown models, NDJSON streaming passes chunks
through, concurrent
/api/embedcalls run in parallel while/api/chatis serialized (assert no two chats overlap at the stub), and the poller flips degraded on target failure and recovers (and re-warms the embedder).
Definition of done
go build/vet/test -racegreen.- Against a real or stubbed Ollama:
curl .../api/tagsreturns the inventory; a non-streaming and a streaming/api/chatboth work end-to-end. - Acceptance: from a scratch Go program,
llm.Ollama(llm.WithBaseURL("http://<foreman>:8080"))(orllm.OllamaCloud(token, WithBaseURL(...))if a token is set) completes a chat through foreman. Note this inprogress.md.
Wrap up: progress.md, commit on phase-2-passthrough, note what Phase 3 changes
(routing this through the queue).