docs: land prior ADR + prompt updates
Commit pre-existing uncommitted working-tree changes that predate the license/public-readiness work — NOT authored in this session, just flushed so they're not lost: ADR-0003/0005/0009/0012 edits, the new ADR-0013 (embeddings-bypass + two-slot residency, already referenced by CLAUDE.md), and the phase-0..3 prompt revisions + prompts/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
+24
-9
@@ -1,33 +1,48 @@
|
||||
# phase-2.md — Ollama target client, model poller, native passthrough
|
||||
|
||||
Re-ground: `CLAUDE.md` + ADR-0003 (API surface), 0007 (model polling), 0012
|
||||
(streaming), 0002 (unreachable = transient). Plan, get approval, implement.
|
||||
(streaming = NDJSON, not SSE), 0013 (two-slot residency + embedding bypass),
|
||||
0002 (unreachable = transient). Plan, get approval, implement.
|
||||
|
||||
## Objective
|
||||
|
||||
Make foreman a working transparent front for its Ollama target — enough that
|
||||
`go-llm` can use the Mac as a target *today*, before any queue exists. (Phase 3
|
||||
will move this through the queue; here it can proxy directly.)
|
||||
will move chat through the queue; here it proxies behind a single-flight gate.)
|
||||
|
||||
## Tasks
|
||||
|
||||
- `internal/ollama`: a small client to the target (`FOREMAN_OLLAMA_URL`) behind
|
||||
an interface, covering `POST /api/chat` (streaming and non-streaming),
|
||||
`GET /api/tags`, `GET /api/ps`. Attach the outbound bearer if configured. Wrap
|
||||
errors; classify connection failures distinctly (Phase 3 needs that signal).
|
||||
`POST /api/embed` (+ `/api/embeddings` alias), `GET /api/tags`, `GET /api/ps`.
|
||||
Attach the outbound bearer (`FOREMAN_OLLAMA_TOKEN`) if configured. Wrap errors;
|
||||
classify connection failures distinctly (Phase 3 needs that signal).
|
||||
- Warm the embedder: on startup and after any reconnect-from-unreachable, issue a
|
||||
trivial `/api/embed` to `FOREMAN_EMBED_MODEL` so it occupies a resident slot
|
||||
(ADR-0013). The target must run `OLLAMA_MAX_LOADED_MODELS=2`; log a warning if
|
||||
`/api/ps` ever shows only one slot under load.
|
||||
- Model poller (goroutine): poll `/api/tags` every `FOREMAN_POLL_INTERVAL`
|
||||
(default 30s) into an in-memory inventory with a mutex; track last-poll time
|
||||
and a degraded flag. On target unreachable, retain last-known inventory and set
|
||||
degraded — do not clear it. Wire degraded state into `/healthz`.
|
||||
- Passthrough handlers in `internal/server`:
|
||||
- `GET /api/tags` and `GET /api/ps` served from the poller/target.
|
||||
- `POST /api/embed` and `POST /api/embeddings`: proxy **directly and
|
||||
concurrently** to the target — these BYPASS the queue/worker gate entirely
|
||||
(ADR-0013). No serialization.
|
||||
- `POST /api/chat`: validate the requested model against the inventory (one
|
||||
re-poll on miss, then 4xx if still absent); proxy to the target. Support
|
||||
streaming faithfully (stream the target's chunks straight through; set the
|
||||
right content type). For now this may call the target directly — no queue.
|
||||
re-poll on miss, then 4xx if still absent); proxy to the target. **Serialize
|
||||
worker-model access through a single in-flight gate (a buffered channel /
|
||||
mutex of 1)** so two concurrent chat requests never hit the worker slot at
|
||||
once — this preserves the serial invariant *before* the full queue exists.
|
||||
Phase 3 replaces this gate with the SQLite queue + worker loop. Stream
|
||||
faithfully as **NDJSON** (`Content-Type: application/x-ndjson`, chunks passed
|
||||
straight through — Ollama's native format, not SSE).
|
||||
- Tests: a stub HTTP server standing in for Ollama; assert tags/ps proxy,
|
||||
model validation rejects unknown models, streaming passes chunks through, and
|
||||
the poller flips degraded on target failure and recovers.
|
||||
model validation rejects unknown models, NDJSON streaming passes chunks
|
||||
through, **concurrent `/api/embed` calls run in parallel while `/api/chat` is
|
||||
serialized** (assert no two chats overlap at the stub), and the poller flips
|
||||
degraded on target failure and recovers (and re-warms the embedder).
|
||||
|
||||
## Definition of done
|
||||
|
||||
|
||||
Reference in New Issue
Block a user