From 8fde0242811db2cb3a9932343e1661a3d32ced85 Mon Sep 17 00:00:00 2001 From: Steve Dudenhoeffer Date: Sat, 23 May 2026 16:41:20 -0400 Subject: [PATCH] initial commit --- .gitignore | 27 ++++ CLAUDE.md | 144 ++++++++++++++++++ docs/adr/0001-one-daemon-per-target.md | 37 +++++ docs/adr/0002-daemon-placement.md | 36 +++++ docs/adr/0003-api-surface.md | 51 +++++++ docs/adr/0004-async-job-surface.md | 52 +++++++ docs/adr/0005-webhook-protocol.md | 63 ++++++++ docs/adr/0006-artifact-handling.md | 53 +++++++ docs/adr/0007-model-polling.md | 48 ++++++ docs/adr/0008-sqlite-queue.md | 42 +++++ docs/adr/0009-single-worker-drain-by-model.md | 44 ++++++ docs/adr/0010-auth-and-security.md | 51 +++++++ .../0011-go-client-and-go-llm-integration.md | 73 +++++++++ docs/adr/0012-streaming.md | 41 +++++ docs/adr/README.md | 41 +++++ 15 files changed, 803 insertions(+) create mode 100644 .gitignore create mode 100644 CLAUDE.md create mode 100644 docs/adr/0001-one-daemon-per-target.md create mode 100644 docs/adr/0002-daemon-placement.md create mode 100644 docs/adr/0003-api-surface.md create mode 100644 docs/adr/0004-async-job-surface.md create mode 100644 docs/adr/0005-webhook-protocol.md create mode 100644 docs/adr/0006-artifact-handling.md create mode 100644 docs/adr/0007-model-polling.md create mode 100644 docs/adr/0008-sqlite-queue.md create mode 100644 docs/adr/0009-single-worker-drain-by-model.md create mode 100644 docs/adr/0010-auth-and-security.md create mode 100644 docs/adr/0011-go-client-and-go-llm-integration.md create mode 100644 docs/adr/0012-streaming.md create mode 100644 docs/adr/README.md diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..5d2f263 --- /dev/null +++ b/.gitignore @@ -0,0 +1,27 @@ +# Compiled binary (cmd/foreman) +/foreman +/dist/ +*.exe + +# Test & coverage output +*.out +*.test +coverage.* + +# SQLite queue + artifacts (local dev data — never commit) +*.db +*.db-wal +*.db-shm +*.sqlite +*.sqlite3 + +# Local config / secrets (commit .env.example, not .env) +.env +.env.local +*.local + +# Editor / OS cruft +.DS_Store +.idea/ +.vscode/ +*.swp diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..e8c1545 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,144 @@ +# foreman + +A small, always-on daemon that fronts **one** Ollama target. It turns a single +Ollama instance into a queued, observable job endpoint: it polls the target's +installed models, serializes work through the target (managing model swaps), +assigns every job an ID, and reports progress + artifacts via webhooks. On the +wire it speaks **native Ollama**, so it doubles as a drop-in `go-llm` target. + +foreman is the deliberately pared-down successor to `peon-overseer`. One daemon, +one target, one queue. The complexity that sank the predecessor — distributed +dispatch, claim leases, weighted fair queueing, capacity budgets, eligibility +gates — existed to coordinate *multiple* workers and is **out of scope**. +Resisting that creep is a first-class design goal. See `docs/adr/` for the +decisions; this file summarizes them. + +## Topology (ADR-0001, ADR-0002) + +``` +orgrimmar: foreman (Go binary + SQLite queue + HTTP API + worker loop) + | HTTP over the trusted VLAN / Tailscale + v +M1 Pro Mac: Ollama only (models on disk, no foreman logic) +``` + +- One foreman process per Ollama target, configured by a single base URL + (default: the Mac's Tailscale address). A second worker = a second foreman. +- foreman runs on the homelab, containerized, deployed via Komodo. The Mac stays + a dumb appliance. +- The target is a laptop and may sleep. Unreachability is transient/recoverable, + never fatal (poller degraded mode + job retry below). + +## API surfaces (ADR-0003, ADR-0004) + +1. **Primary — transparent native Ollama passthrough:** `/api/chat`, `/api/tags`, + `/api/ps`. foreman looks exactly like an Ollama server. Synchronous: calls are + queued internally but the HTTP response blocks until completion. SSE streaming + supported (ADR-0012). This is the `go-llm` target path. +2. **Async jobs — `POST /jobs`, `GET /jobs/{id}`:** body is a native-chat payload + plus optional `state_webhook_url`. Returns `202` + `{ "job_id": "" }` + immediately. For fire-and-forget orchestration callers. +3. **Optional OpenAI-compat `/v1/chat/completions` + `/v1/models`:** deferred; + added only if a non-go-llm caller needs it. + +Job lifecycle: `queued → loading → working → done` (+ terminal `failed`). A +connection failure to the target re-queues the job with backoff (bounded retries +guard poison jobs). IDs are ULIDs (sortable, timestamped). + +## Webhooks & artifacts (ADR-0005, ADR-0006) + +- On each state transition, POST a JSON event to `state_webhook_url` + (`job_id`, `state`, `previous_state`, `timestamp`, `model`, `attempt`, and on + completion `result` / `artifacts` / `error`). +- At-least-once delivery; callers must be idempotent on `job_id`+`state`; missed + events reconcile via `GET /jobs/{id}`. Retry with bounded backoff. Optional + `X-Foreman-Signature` HMAC when a webhook secret is configured. +- Artifacts are named typed blobs; the completion is always artifact `completion`. + Inline under ~256KB, otherwise fetched via `GET /jobs/{id}/artifacts/{name}`. + +## Model inventory (ADR-0007) + +- A poller hits the target's `/api/tags` (default ~30s) to keep an in-sync model + list; backs foreman's `/api/tags` passthrough and job validation. +- `/api/ps` tells foreman what's resident, feeding the scheduler. +- Jobs naming an uninstalled model are rejected at submit time (one re-check on + miss). Target unreachable → retain last-known list, mark degraded on a health + endpoint; do not reject wholesale on a single failed poll. + +## Execution (ADR-0009) + +- **Concurrency against the target is 1.** A single worker loop pulls a job, + ensures the right model is resident, executes, records the result. +- **Drain-by-model:** finish every queued job for the currently-resident model + before paying a swap (`ORDER BY (model != current), created_at`). A heuristic, + not a scheduler. No priorities, fairness, or budgets. +- Pin residency with Ollama `keep_alive`; target runs `OLLAMA_MAX_LOADED_MODELS=1` + and `OLLAMA_CONTEXT_LENGTH=8192`+. + +## Persistence (ADR-0008) + +- SQLite, WAL mode, pure-Go `modernc.org/sqlite` (no CGO → trivial Komodo builds). +- `jobs` + `artifacts` tables; single writer (the worker) + HTTP readers. TTL + sweep for pruning. No external broker. + +## Models served + +foreman serves **any installed model** named in a request; it does not own a +role→model mapping (the caller picks the model, e.g. go-llm `.Model(...)`). +Recommended roster to pull on the Mac (32GB, ~26-28GB usable, single-resident +swap): + +- **parse / data** — `qwen3:14b` (~9GB, structured/JSON output). +- **agent + code** — `qwen3.6:35b` (MoE, ~3B active, ~20GB, fast tool-calling). +- Split a dedicated dense coder (`qwen3.6:27b`) off later only if `35b`'s code + quality disappoints; it's bandwidth-bound and slow on this Mac. +- Verify exact tags against the Ollama library before pulling; the registry moves. + +## go-llm integration (ADR-0011) + +Verified: `llm.OllamaCloud(key, WithBaseURL(...))` already targets a private +authenticated native-Ollama endpoint — which foreman is. Integration is a thin +constructor, no new provider: + +- **Level 0 (now):** `llm.Foreman(baseURL, token).Model("qwen3.6:35b")` — delegates + to the ollama provider; transparent, synchronous, full tool/think/stream. +- **Level 1 (later):** a `foreman` client package — synchronous facade over the + async `/jobs` surface (manages a webhook receiver, blocks to done). +- **Level 2 (if needed):** a dedicated `provider.Provider` surfacing job IDs/state. + +## Security (ADR-0010) + +- Network is the boundary: target `:11434` firewalled to foreman, and/or both on + Tailscale. foreman is **not** on a public Traefik entrypoint. +- Optional static bearer: validate `Authorization: Bearer `, which reuses + the header `go-llm` already sends via the Foreman/OllamaCloud path. +- No Authentik/SSO, no per-caller identities for v1. No financial/identity data + ever transits foreman. + +## Stack & conventions + +- Go, stdlib `net/http`, minimal deps. SQLite via `modernc.org/sqlite`. +- No UI. HTTP API + small CLI only. +- Match go-llm house style: standard Go tabs; `camelCase`/`PascalCase`; check + errors immediately and wrap with `fmt.Errorf("%w: ...", err)`; imports stdlib → + third-party → internal. The worker loop never panics; it logs, marks the job, + continues. +- ADRs in `docs/adr/` (one decision each, append/supersede). Living `progress.md` + at repo root. Repo: `gitea.stevedudenhoeffer.com`. + +## Out of scope (anti-creep guardrails — ADR-0001) + +Distributed dispatch, multiple workers, claim leases, weighted fair queueing, +capacity budgets, eligibility gates, an auth framework / SSO, a GUI, and managing +more than one target per daemon. Keep the ollama client behind a small interface +so a future second backend is additive — but do not build for it now. + +## Milestones + +- **M0** — native `/api/chat` passthrough + SQLite queue + single-worker loop, one + model end to end, synchronous. +- **M1** — model poller + `/api/tags`/`/api/ps`, drain-by-model, async `/jobs` + + `state_webhook_url` + artifacts + retry-on-unreachable, the CLI, and the + `llm.Foreman()` constructor in go-llm. +- **M2 (later)** — optional OpenAI-compat `/v1`, Level-1 client / dedicated + provider, metrics. diff --git a/docs/adr/0001-one-daemon-per-target.md b/docs/adr/0001-one-daemon-per-target.md new file mode 100644 index 0000000..93f0e85 --- /dev/null +++ b/docs/adr/0001-one-daemon-per-target.md @@ -0,0 +1,37 @@ +# ADR-0001: One daemon per Ollama target + +**Status:** Accepted — 2026-05-23 + +## Context + +`peon-overseer` ballooned because it coordinated *multiple* workers from a +central service: pull-based dispatch, claim leases, weighted fair queueing, +capacity budgets, eligibility gates. All of that complexity existed solely to +arbitrate shared workers. We want none of it back. + +The system being built fronts inference hardware (initially the M1 Pro running +Ollama) and exposes it as a managed job endpoint. + +## Decision + +Each `foreman` process is bound to **exactly one** Ollama target, configured by a +single base URL. One target = one daemon = one queue. There is no cross-daemon +awareness and no shared state between daemons. + +If a second worker is added later (the 4090 box, the M5 Max), it gets its own +`foreman` instance. Any fan-out across workers is the concern of a *separate* +higher-level router that talks to multiple foreman instances — explicitly out of +scope here and not to be anticipated in this codebase. + +## Consequences + +- The daemon is radically simple: one target, one serialized work stream. +- Horizontal scale is "run another daemon," an operational act, not a code change. +- No lease/fairness/budget machinery is permitted in this repo. If a change + starts to require it, that is the signal that the multi-worker router (a + different project) is what's actually needed. + +## Alternatives considered + +- **One daemon managing many targets.** Rejected: reintroduces the scheduling and + arbitration complexity that sank the predecessor. diff --git a/docs/adr/0002-daemon-placement.md b/docs/adr/0002-daemon-placement.md new file mode 100644 index 0000000..28ab650 --- /dev/null +++ b/docs/adr/0002-daemon-placement.md @@ -0,0 +1,36 @@ +# ADR-0002: Daemon placement and remote target configuration + +**Status:** Accepted — 2026-05-23 + +## Context + +The inference box is an M1 Pro MacBook — a laptop, not always-on infrastructure. +The rest of steveternet runs on the homelab and is deployed/managed through +Komodo. We do not want bespoke job-controller logic living on the Mac. + +## Decision + +`foreman` runs on the homelab (e.g. orgrimmar), containerized and deployed via +Komodo like everything else. It is **given** its Ollama target as a configurable +base URL (default: the Mac's Tailscale address) and reaches it over the network. + +The Mac runs Ollama and nothing `foreman`-specific. It stays a dumb appliance. + +## Consequences + +- Ops consistency: foreman is a normal Komodo-managed container. +- The target URL is config, never hardcoded — swapping the Mac for another + backend is a config edit (within the one-target-per-daemon rule of ADR-0001). +- The Mac is a laptop and may sleep or change networks. The daemon must treat an + unreachable target as a transient, recoverable condition (see ADR-0007 for the + model poller's degraded mode and ADR-0004 for job retry semantics), never as a + fatal error. Operationally: `caffeinate`/`pmset` keeps the Mac awake; Tailscale + keeps its address stable. +- Network is now the trust boundary; Ollama has no auth of its own (see ADR-0010). + +## Alternatives considered + +- **Co-locate foreman on the Mac.** Rejected: contradicts the stated preference to + keep controller logic off the laptop, and complicates Komodo-based deployment. + Note that "given a target URL" keeps this reversible — co-location would just be + pointing the URL at localhost. diff --git a/docs/adr/0003-api-surface.md b/docs/adr/0003-api-surface.md new file mode 100644 index 0000000..25ff622 --- /dev/null +++ b/docs/adr/0003-api-surface.md @@ -0,0 +1,51 @@ +# ADR-0003: API surface — native Ollama passthrough vs OpenAI-compat + +**Status:** Accepted — 2026-05-23 (resolved in favor of native Ollama) + +## Context + +Two goals were in mild tension: the original phrasing asked for an +"OpenAI-compatible API," while the stated ultimate goal is to use the M1 Pro +**simply as a target for `go-llm`**. + +`go-llm`'s `v2/CLAUDE.md` Key Design Decision #8 is explicit: its Ollama provider +deliberately uses native `/api/chat`, *not* OpenAI-compat `/v1`, for `think:false` +support, more reliable tool calling, and ~15-20% lower latency. + +**Verified in code (`v2/constructors.go`).** `llm.OllamaCloud(apiKey, opts...)` +sends the key as `Authorization: Bearer ` over native `/api/chat`, and its +doc comment says to "use `WithBaseURL` to point at a private Ollama deployment +that requires auth." So go-llm *already* has a first-class path for a private, +authenticated, native-Ollama endpoint — exactly what foreman is on the wire. +Choosing OpenAI-compat would push go-llm onto a path its own author rejected, for +no benefit to the primary caller. + +## Decision + +Native Ollama is **the** surface for v1. foreman speaks native `/api/chat`, +`/api/tags`, and `/api/ps`, optionally behind a Bearer token (ADR-0010). To +go-llm and any Ollama client it is indistinguishable from a private Ollama +deployment. + +The synchronous passthrough is transparent: calls are queued internally +(ADR-0009) but the HTTP response blocks until the job completes. Async features +(job IDs, `state_webhook_url`, artifacts) live on a separate `/jobs` surface +(ADR-0004), not bolted onto the passthrough. + +OpenAI-compat `/v1/chat/completions` is **deferred**, added in a later milestone +only if a non-go-llm caller needs it. + +## Consequences + +- "Set up the Mac as a go-llm target" needs zero provider changes — a thin + constructor only (ADR-0011). +- Preserves `think:false`, reliable tool calls, and lower latency. +- foreman must faithfully proxy native `/api/chat` semantics, including SSE + streaming (ADR-0012). + +## Alternatives considered + +- **OpenAI-compat as primary/only surface.** Matches the original phrasing but + contradicts go-llm DD#8 and adds nothing for the primary caller. Rejected. +- **Native-only, never add OpenAI-compat.** Fully serves the goal; the secondary + surface is kept as an option, not a commitment. diff --git a/docs/adr/0004-async-job-surface.md b/docs/adr/0004-async-job-surface.md new file mode 100644 index 0000000..2877d39 --- /dev/null +++ b/docs/adr/0004-async-job-surface.md @@ -0,0 +1,52 @@ +# ADR-0004: Async job surface, job IDs, and queued execution + +**Status:** Accepted — 2026-05-23 + +## Context + +The transparent passthrough (ADR-0003) is synchronous: the caller holds an HTTP +connection until the completion returns. That is fine for interactive-length work +and for go-llm, but two needs aren't served by it: + +- Long-running jobs held open through Traefik risk idle-connection timeouts. +- Orchestration callers (mort/ratchet/werk-style) want fire-and-forget: submit, + get an ID back immediately, and be told asynchronously when the work is done. + +## Decision + +Add a distinct async surface: `POST /jobs`. + +- The body carries a chat payload (native-Ollama-shaped, mirroring `/api/chat`) + plus optional extension fields, notably `state_webhook_url` (ADR-0005). +- foreman enqueues the job, assigns it a **ULID** (sortable, timestamped), and + immediately returns `202 Accepted` with `{ "job_id": "" }`. +- The caller correlates later webhook callbacks to its request via `job_id`. +- `GET /jobs/{id}` returns current state, result, and artifact references for + polling-style callers or for recovery after a missed webhook. + +Every unit of work is a row in the queue (ADR-0008) regardless of which surface +created it; the synchronous passthrough is simply a `/jobs` submission whose +handler blocks on the job's completion instead of returning the ID. + +### Job lifecycle + +`queued → loading → working → done`, plus terminal `failed`. A job whose target +is unreachable re-enters `queued` with a backoff (it is retryable, never +auto-failed on a connection error — the target is a laptop, ADR-0002). A bounded +retry count guards against poison jobs; exceeding it moves the job to `failed` +with the last error recorded. + +## Consequences + +- One queue, one execution engine, two entry points (sync passthrough, async + `/jobs`). +- Job IDs are stable, sortable, and meaningful to correlate webhooks. +- `GET /jobs/{id}` gives at-least-once webhook delivery a recovery path. + +## Alternatives considered + +- **Reuse the OpenAI response `id` field instead of a separate `/jobs` surface.** + Workable for sync, but doesn't give async callers an immediate handle before + completion. The explicit `/jobs` surface is clearer. +- **UUIDv4 for IDs.** Rejected in favor of ULID for natural time-ordering in the + queue and logs. diff --git a/docs/adr/0005-webhook-protocol.md b/docs/adr/0005-webhook-protocol.md new file mode 100644 index 0000000..71d7613 --- /dev/null +++ b/docs/adr/0005-webhook-protocol.md @@ -0,0 +1,63 @@ +# ADR-0005: Webhook state-update protocol + +**Status:** Accepted — 2026-05-23 + +## Context + +Async callers (ADR-0004) need to know how their job is progressing without +polling. The requirement: periodically push state updates +(`queued → loading → working → done`) and deliver results/artifacts on +completion. + +## Decision + +When a job is submitted with `state_webhook_url`, foreman POSTs a JSON event to +that URL on every state transition. + +### Event payload + +```json +{ + "job_id": "01J...", + "state": "loading", + "previous_state": "queued", + "timestamp": "2026-05-23T12:00:00Z", + "model": "qwen3.6:35b", + "attempt": 1, + "error": null, + "result": null, + "artifacts": null +} +``` + +- `state`: one of `queued`, `loading`, `working`, `done`, `failed`. +- On `done`: `result` holds the completion (native-Ollama-shaped) and `artifacts` + holds artifact references (ADR-0006). +- On `failed`: `error` holds a message; `result` is null. + +### Delivery semantics + +- **At-least-once.** Callers must be idempotent on `job_id` + `state`. A missed + webhook can always be reconciled via `GET /jobs/{id}` (ADR-0004). +- **Retry with backoff** on non-2xx or connection failure, bounded attempts, then + the event is dropped (the job state itself is unaffected and remains queryable). +- **Ordering is not guaranteed** across retries; `previous_state` + `timestamp` + let callers order/deduplicate. +- **Optional HMAC signing:** if a webhook secret is configured, foreman sends an + `X-Foreman-Signature` header (HMAC-SHA256 of the body) so receivers can verify + authenticity. Off by default; recommended once foreman is reachable beyond a + fully trusted network. + +## Consequences + +- Callers get push observability with a polling fallback. +- Idempotency is pushed onto the caller — documented as a hard requirement. +- Webhook delivery is decoupled from job execution: a flaky receiver never blocks + or fails the job. + +## Alternatives considered + +- **Polling only.** Simpler for foreman, worse for callers; rejected since + webhooks were an explicit requirement. (Polling is still available as fallback.) +- **WebSocket/SSE for state.** Heavier; SSE is reserved for token streaming on the + sync surface (ADR-0012), not job-state fan-out. diff --git a/docs/adr/0006-artifact-handling.md b/docs/adr/0006-artifact-handling.md new file mode 100644 index 0000000..cc45855 --- /dev/null +++ b/docs/adr/0006-artifact-handling.md @@ -0,0 +1,53 @@ +# ADR-0006: Artifact handling and transport + +**Status:** Accepted — 2026-05-23 + +## Context + +Jobs must "transmit artifacts when done." For a chat completion the obvious +artifact is the assistant's text/tool-call output, but the term is deliberately +broader: a job may produce structured data, multiple named outputs, or content +too large to embed comfortably in a webhook body. + +## Decision + +An **artifact** is a named, typed blob attached to a completed job: + +```json +{ "name": "completion", "content_type": "application/json", "size": 1234, + "inline": { ... }, "url": null } +``` + +- The primary completion is always emitted as an artifact named `completion` + (the native-Ollama response shape), so there is one consistent access pattern. +- Additional artifacts use distinct names. + +### Transport: inline vs fetch + +- **Small artifacts** (under a configurable threshold, default ~256 KB) are + delivered **inline** in the `done` webhook (`inline` populated, `url` null) and + in `GET /jobs/{id}`. +- **Large artifacts** exceed the threshold: the webhook/`GET` carries metadata + plus a `url` (`GET /jobs/{id}/artifacts/{name}`), and the bytes are fetched + on demand. This keeps webhook payloads bounded and avoids shipping megabytes + through a callback POST. + +### Retention + +Artifacts are stored alongside the job in SQLite (ADR-0008) and pruned with the +job after a configurable TTL. No separate blob store in v1; revisit only if +artifact sizes outgrow SQLite comfort (single-digit MB). + +## Consequences + +- One uniform way to read output (`completion` artifact), extensible to richer + jobs later without protocol changes. +- Webhook bodies stay small; large outputs don't bloat or break delivery. +- A pull endpoint for artifacts means a missed/oversized webhook never loses data. + +## Alternatives considered + +- **Always inline.** Simple but risks huge webhook bodies and SQLite row bloat in + the hot path. Rejected. +- **External object store (S3/MinIO) from day one.** Over-engineered for the + expected sizes; deferred behind the TTL/threshold knobs. diff --git a/docs/adr/0007-model-polling.md b/docs/adr/0007-model-polling.md new file mode 100644 index 0000000..a2317b8 --- /dev/null +++ b/docs/adr/0007-model-polling.md @@ -0,0 +1,48 @@ +# ADR-0007: Model inventory polling and discovery + +**Status:** Accepted — 2026-05-23 + +## Context + +foreman needs a "relatively in-sync" view of which models are installed on its +target so it can (a) advertise them to callers, (b) reject jobs for missing +models early instead of failing mid-execution, and (c) know what is currently +resident to inform scheduling (ADR-0009). + +## Decision + +A background poller queries the target on a configurable interval (default ~30s): + +- `GET /api/tags` → the installed-model inventory. Cached in memory; this cache + backs foreman's own `/api/tags` passthrough (ADR-0003) and `/v1/models` if the + OpenAI-compat surface is enabled. +- `GET /api/ps` → which model(s) are currently loaded, their VRAM/where-resident, + and the unload timer. Used by the scheduler to decide whether the next job + requires a swap. + +### Behavior + +- **Early validation:** a job naming a model absent from the cached inventory is + rejected at submit time with a clear error (and, for async jobs, the inventory + is recent enough that this is reliable). A small grace path allows a job for a + model that appears between polls by re-checking once on a miss. +- **Degraded mode:** if the target is unreachable, the last-known inventory is + retained and foreman marks itself degraded (surfaced on a health endpoint). + Jobs are not rejected wholesale on a single failed poll — the target is a + laptop that may briefly sleep (ADR-0002). Execution-time unreachability is + handled by job retry (ADR-0004). + +## Consequences + +- Callers can discover available models through the normal Ollama/OpenAI + endpoints; no foreman-specific discovery API needed. +- Bad-model jobs fail fast and cheaply. +- A health/status endpoint exposing degraded state and last-poll time is required. + +## Alternatives considered + +- **No caching; proxy `/api/tags` live per request.** Simpler but couples every + discovery call to target availability and adds latency. Rejected; the poller + also feeds the scheduler, so the cache is needed regardless. +- **Push/event-based inventory.** Ollama offers no such mechanism; polling is the + only option. diff --git a/docs/adr/0008-sqlite-queue.md b/docs/adr/0008-sqlite-queue.md new file mode 100644 index 0000000..2da53f5 --- /dev/null +++ b/docs/adr/0008-sqlite-queue.md @@ -0,0 +1,42 @@ +# ADR-0008: Durable SQLite-backed queue + +**Status:** Accepted — 2026-05-23 + +## Context + +Jobs are queued, carry state, and may be retried across target sleep/restart. A +caller that submitted an async job and is waiting on a webhook must not lose its +job because foreman restarted. State must survive process restarts. + +## Decision + +The job queue and all job state (including artifacts, ADR-0006) live in **SQLite** +in WAL mode, via the pure-Go `modernc.org/sqlite` driver (no CGO, so the Komodo +container build stays trivial). + +### Schema sketch + +- `jobs(id TEXT PK, state TEXT, model TEXT, request BLOB, result BLOB, + error TEXT, webhook_url TEXT, attempt INT, created_at, updated_at, …)` +- `artifacts(job_id TEXT, name TEXT, content_type TEXT, size INT, inline BLOB, + PRIMARY KEY(job_id, name))` + +A single writer (the worker, ADR-0009) plus the HTTP handlers; WAL handles the +concurrent-reader / single-writer pattern well at this scale. + +## Consequences + +- Jobs and results are durable across restarts; webhook recovery via + `GET /jobs/{id}` (ADR-0004) is meaningful. +- Pure-Go driver keeps cross-compilation and container builds painless. +- Pruning is a TTL sweep over `jobs`/`artifacts`; no external store to operate. +- SQLite caps practical artifact size at single-digit MB — acceptable per ADR-0006 + thresholds; revisit if outputs grow. + +## Alternatives considered + +- **In-memory queue.** Loses async jobs on restart; unacceptable given webhooks. +- **Redis / external broker.** Another moving part to run for a single-worker + daemon; over-engineered. Rejected. +- **`mattn/go-sqlite3` (CGO).** Faster in some cases but complicates static builds + and container images. Pure-Go preferred for ops simplicity. diff --git a/docs/adr/0009-single-worker-drain-by-model.md b/docs/adr/0009-single-worker-drain-by-model.md new file mode 100644 index 0000000..08ccc8f --- /dev/null +++ b/docs/adr/0009-single-worker-drain-by-model.md @@ -0,0 +1,44 @@ +# ADR-0009: Single-worker serialization and drain-by-model scheduling + +**Status:** Accepted — 2026-05-23 + +## Context + +The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast +at a time; loading a different model is a 5-10s cold start. Running two models +concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism +against a single target buys nothing and would reintroduce coordination logic. + +## Decision + +**Concurrency against the target is 1.** A single worker loop pulls the next job +from the queue, ensures the right model is resident, executes, and records the +result. + +**Drain-by-model scheduling:** before incurring a model swap, the worker finishes +every queued job that targets the **currently-resident** model (observed via +`/api/ps`, ADR-0007). Only when no job for the hot model remains does it select a +job for a different model and pay the swap cost. + +This is an `ORDER BY (model != current_model), created_at` style selection — a +heuristic, not a scheduler. There is intentionally **no** priority system, +fairness weighting, or capacity budgeting (those sank the predecessor; see +ADR-0001). + +Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded +between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it +to single-resident swap. + +## Consequences + +- Swap thrash is minimized without any complex scheduling. +- A long run of same-model jobs can delay a different-model job — acceptable for a + background box, and bounded by queue depth. If starvation ever becomes a real + problem, that is a signal to reconsider, not to pre-build fairness. +- Throughput is dominated by how well callers batch work by model. + +## Alternatives considered + +- **FIFO with naive swapping.** Correct but pays a cold start on every model + change; wasteful when jobs interleave models. Rejected. +- **Priority/fair scheduling.** Explicitly rejected as scope creep (ADR-0001). diff --git a/docs/adr/0010-auth-and-security.md b/docs/adr/0010-auth-and-security.md new file mode 100644 index 0000000..f18f570 --- /dev/null +++ b/docs/adr/0010-auth-and-security.md @@ -0,0 +1,51 @@ +# ADR-0010: Authentication and security boundary + +**Status:** Accepted — 2026-05-23 + +## Context + +Ollama itself has no authentication — anyone who can reach `:11434` can drive it. +foreman sits in front of it and is the network-facing component. We need a real +boundary without dragging in an auth framework (the kind of scope creep ADR-0001 +guards against). + +## Decision + +**Primary boundary is the network.** foreman and its Ollama target sit on a +trusted segment: the target's `:11434` is firewalled to foreman only, and/or +both are bound to the Tailscale interface. foreman is **not** exposed through a +public Traefik entrypoint. + +**Optional static bearer token.** If a token is configured, foreman validates the +`Authorization: Bearer ` header on incoming requests. This reuses headers +that clients already send: + +- `go-llm` via `llm.Ollama()` sends no auth (fine on a trusted segment); via + `ollama.New(key, baseURL)` it sends `Authorization: Bearer ` — so a + configured foreman token slots straight into the existing provider with no new + code. +- The OpenAI-compat surface (if enabled, ADR-0003) carries the same header. + +foreman → target auth: an optional bearer the daemon attaches to its own calls to +Ollama, for the Ollama-Cloud-style case; empty for a local/LAN target. + +## Out of scope for v1 + +- Authentik / SSO. It is painful for service-to-service traffic and adds nothing + over network isolation here. +- Per-caller identities, scopes, rate limiting. Not needed for a single-tenant + homelab daemon. + +## Consequences + +- Minimal but real security: network isolation always, plus an optional shared + secret that integrates with existing clients for free. +- Webhook authenticity is handled separately by optional HMAC signing (ADR-0005). +- No financial/identity/credential data ever transits foreman; it brokers chat + jobs only. + +## Alternatives considered + +- **No auth, network-only.** Acceptable on a fully trusted tailnet; the optional + token exists for when foreman's reachability widens. +- **Full auth framework / SSO.** Rejected as scope creep. diff --git a/docs/adr/0011-go-client-and-go-llm-integration.md b/docs/adr/0011-go-client-and-go-llm-integration.md new file mode 100644 index 0000000..23a2760 --- /dev/null +++ b/docs/adr/0011-go-client-and-go-llm-integration.md @@ -0,0 +1,73 @@ +# ADR-0011: Go integration — the `Foreman` interface + +**Status:** Accepted — 2026-05-23 + +## Context + +The ultimate goal: use the M1 Pro **simply as a target for `go-llm`**. + +**Verified (`v2/constructors.go`, `v2/ollama/ollama.go`):** `llm.OllamaCloud(key, +WithBaseURL(...))` already targets "a private Ollama deployment that requires +auth" — native `/api/chat` + `Authorization: Bearer ` against any base URL. +foreman is exactly that on the wire (ADR-0003). So integration needs **no new +provider** — only a clean, intent-revealing seam so call sites say "foreman," not +"Ollama." + +`go-llm`'s provider contract (`v2/provider`) is two methods, `Complete` and +`Stream`; a future dedicated provider would implement them. + +## Decision + +Add a `llm.Foreman(baseURL, apiKey, opts...)` constructor to go-llm that delegates +to the ollama native provider — the ollama translation happens behind the scenes: + +```go +func Foreman(baseURL, apiKey string, opts ...ClientOption) *Client { + cfg := &clientConfig{} + for _, opt := range opts { + opt(cfg) + } + if cfg.baseURL != "" { + baseURL = cfg.baseURL + } + return NewClient(ollamaProvider.New(apiKey, baseURL)) +} + +// model := llm.Foreman("http://foreman.orgrimmar:PORT", token).Model("qwen3.6:35b") +``` + +`baseURL` is required (foreman has no default public address). This is a +deliberate **seam**: v1 is a pass-through to the `ollama` provider; a dedicated +foreman provider can later replace the delegate to surface job IDs / async state +without changing call sites. + +### Three escalating levels + +- **Level 0 — `llm.Foreman(...)` (now, the headline goal).** Transparent, + synchronous, full native tool-calling / `think:false` / streaming. Queueing and + model-swap management happen invisibly inside the daemon. Zero provider code. +- **Level 1 — `foreman` client package (when an orchestration caller needs it).** + A synchronous facade over the async `/jobs` surface: given messages, it manages + an ephemeral webhook receiver, blocks until `done`, and returns result + + artifacts (falling back to `GET /jobs/{id}` polling if it can't receive + callbacks). For callers wanting async semantics — surfaced job IDs, no + long-held connection — with a synchronous call signature. +- **Level 2 — dedicated `provider.Provider` (only if needed).** Wraps Level 1 so + foreman is a first-class go-llm backend exposing job IDs / state / artifacts the + plain ollama provider can't. Built only if Level 0 proves insufficient. + +## Consequences + +- Headline goal met with one constructor and no provider code. +- Call sites are foreman-named and future-proofed by the seam. +- Async ergonomics are available later without forcing webhook plumbing on + callers, and without touching Level-0 users. + +## Alternatives considered + +- **Just tell users to call `OllamaCloud` with a base URL.** Works identically + today, but leaks the implementation ("it's Ollama") and offers no seam for + future foreman-specific behavior. The named constructor is the requested + "foreman interface." +- **Ship a dedicated provider from day one (Level 2 first).** More code; bypasses + the zero-friction win. Deferred. diff --git a/docs/adr/0012-streaming.md b/docs/adr/0012-streaming.md new file mode 100644 index 0000000..0a46ac1 --- /dev/null +++ b/docs/adr/0012-streaming.md @@ -0,0 +1,41 @@ +# ADR-0012: Streaming support + +**Status:** Accepted — 2026-05-23 + +## Context + +`go-llm`'s provider interface has a `Stream()` method, and Ollama's native +`/api/chat` streams token-by-token by default. The synchronous passthrough +(ADR-0003) must not break streaming clients. Separately, the async `/jobs` +surface (ADR-0004) reports progress via discrete state webhooks, which is a +different granularity than token streaming. + +## Decision + +- **Sync passthrough: support streaming.** When a `/api/chat` request sets + `stream: true`, foreman streams the target's token deltas back to the caller + (SSE/chunked, matching Ollama's native streaming). A streamed job still moves + through the queue; streaming begins once the job reaches `working`, so a job + waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when + its turn comes. go-llm's `Stream()` works against foreman unchanged. +- **Async `/jobs` surface: no token streaming in v1.** Webhooks carry coarse state + transitions (ADR-0005) and the final result/artifacts, not per-token deltas. + Token-level streaming over a fire-and-forget webhook job is deliberately + deferred — it adds a transport (persistent connection or chunked webhook) whose + complexity isn't justified yet. + +## Consequences + +- Interactive go-llm usage gets real streaming through the transparent surface. +- Orchestration callers get state + final artifacts, which is what they need; + they can use the sync streaming surface directly if they want tokens. +- The job state machine and webhook protocol stay simple (no streaming transport + to design or operate). + +## Alternatives considered + +- **Stream tokens over the async surface too.** Deferred: requires either a + long-lived connection (defeats the point of async) or chunked-delta webhooks + (complex, rarely needed). Revisit only on a concrete need. +- **No streaming at all.** Would break go-llm's `Stream()` and interactive use on + the very path that is the primary goal. Rejected. diff --git a/docs/adr/README.md b/docs/adr/README.md new file mode 100644 index 0000000..7154f2c --- /dev/null +++ b/docs/adr/README.md @@ -0,0 +1,41 @@ +# foreman — Architecture Decision Records + +`foreman` is a small daemon that fronts **one** Ollama target. It turns a single +Ollama instance into a queued, observable job endpoint: it polls the target's +installed models, serializes jobs through the target (managing model swaps), +assigns every job an ID, and reports progress + artifacts via webhooks. It also +ships a Go client so the target is trivial to use from `go-llm`. + +It is the deliberately pared-down successor to `peon-overseer`. One daemon, one +worker, one queue. No distributed dispatch, no leases, no fair queueing. + +## Index + +| ADR | Title | Status | +|-----|-------|--------| +| 0001 | One daemon per Ollama target | Accepted | +| 0002 | Daemon placement and remote target configuration | Accepted | +| 0003 | API surface: native Ollama passthrough vs OpenAI-compat | Accepted | +| 0004 | Async job surface, job IDs, and queued execution | Accepted | +| 0005 | Webhook state-update protocol | Accepted | +| 0006 | Artifact handling and transport | Accepted | +| 0007 | Model inventory polling and discovery | Accepted | +| 0008 | Durable SQLite-backed queue | Accepted | +| 0009 | Single-worker serialization and drain-by-model scheduling | Accepted | +| 0010 | Authentication and security boundary | Accepted | +| 0011 | Go client library and go-llm integration | Accepted | +| 0012 | Streaming support | Accepted | + +ADR-0003 was resolved in favor of **native Ollama** as the v1 surface: foreman is, +on the wire, a private authenticated Ollama deployment, so `go-llm` integrates via +a thin `llm.Foreman(baseURL, token)` constructor that delegates to the existing +ollama provider (ADR-0011). OpenAI-compat `/v1` is deferred. + +These ADRs refine the API/integration sections of the project `CLAUDE.md`. The +queue, single-worker, drain-by-model, and security guardrails carry forward +unchanged. + +## Format + +Each ADR: Status, Context, Decision, Consequences, and Alternatives where useful. +One decision per file. Append new ADRs; supersede rather than rewrite.