# foreman A small, always-on daemon that fronts **one** Ollama target. It turns a single Ollama instance into a queued, observable job endpoint: it polls the target's installed models, serializes work through the target (managing model swaps), assigns every job an ID, and reports progress + artifacts via webhooks. On the wire it speaks **native Ollama**, so it doubles as a drop-in client target — for any Ollama client, and specifically for [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo) (the `go-llm` library referenced throughout these docs is now majordomo) and the [gadfly](https://gitea.stevedudenhoeffer.com/steve/gadfly) reviewer built on it. > This is a public, **vibe-coded** project (built largely by an AI agent). Keep > that framing honest in the README; don't oversell it. Homelab specifics below > (orgrimmar, the Macs, Komodo, Tailscale) are the author's deployment and are > illustrative — the daemon itself is generic. foreman is the deliberately pared-down successor to `peon-overseer`. One daemon, one target, one queue. The complexity that sank the predecessor — distributed dispatch, claim leases, weighted fair queueing, capacity budgets, eligibility gates — existed to coordinate *multiple* workers and is **out of scope**. Resisting that creep is a first-class design goal. See `docs/adr/` for the decisions; this file summarizes them. ## Build / test / run ```sh go build ./cmd/foreman # the daemon binary go test ./... # client/ + internal/* unit tests go vet ./... && gofmt -l . # must be quiet / clean before committing ``` Run it locally against a real Ollama target (only `FOREMAN_OLLAMA_URL` is required; full env reference in `.env.example` and the README table): ```sh FOREMAN_OLLAMA_URL=http://mac.tail:11434 go run ./cmd/foreman serve curl -s localhost:8080/healthz # {"status":"ok","degraded":false} scripts/pull-models.sh # pull the recommended roster on the target ``` Pure-Go only (`modernc.org/sqlite`, no CGO) so Docker/Komodo builds stay trivial — keep it that way. The worker loop must never panic: log, mark the job, continue. ## Topology (ADR-0001, ADR-0002) ``` orgrimmar: foreman (Go binary + SQLite queue + HTTP API + worker loop) | HTTP over the trusted VLAN / Tailscale v M1 Pro Mac: Ollama only (models on disk, no foreman logic) ``` - One foreman process per Ollama target, configured by a single base URL (default: the Mac's Tailscale address). A second worker = a second foreman. - foreman runs on the homelab, containerized, deployed via Komodo. The Mac stays a dumb appliance. - The target is a laptop and may sleep. Unreachability is transient/recoverable, never fatal (poller degraded mode + job retry below). ## API surfaces (ADR-0003, ADR-0004) 1. **Primary — transparent native Ollama passthrough:** `/api/chat`, `/api/tags`, `/api/ps`. foreman looks exactly like an Ollama server. Synchronous: calls are queued internally but the HTTP response blocks until completion. NDJSON streaming supported (`application/x-ndjson` — Ollama's native wire format, not SSE; ADR-0012). This is the `go-llm` target path. 2. **Embeddings (bypass the queue) — `/api/embed`, `/api/embeddings`:** proxied directly and concurrently to the always-resident embedder; never touch the queue or worker loop (ADR-0013). 3. **Async jobs — `POST /jobs`, `GET /jobs/{id}`:** body is a native-chat payload plus optional `state_webhook_url`. Returns `202` + `{ "job_id": "" }` immediately. For fire-and-forget orchestration callers. 4. **Optional OpenAI-compat `/v1/chat/completions` + `/v1/models`:** deferred; added only if a non-go-llm caller needs it. Job lifecycle: `queued → loading → working → done` (+ terminal `failed`). A connection failure to the target re-queues the job with backoff (bounded retries guard poison jobs). IDs are ULIDs (sortable, timestamped). ## Webhooks & artifacts (ADR-0005, ADR-0006) - On each state transition, POST a JSON event to `state_webhook_url` (`job_id`, `state`, `previous_state`, `timestamp`, `model`, `attempt`, and on completion `result` / `artifacts` / `error`). - At-least-once delivery; callers must be idempotent on `job_id`+`state`; missed events reconcile via `GET /jobs/{id}`. Retry with bounded backoff. Optional `X-Foreman-Signature` HMAC when a webhook secret is configured. - Artifacts are named typed blobs; the completion is always artifact `completion`. Inline under ~256KB, otherwise fetched via `GET /jobs/{id}/artifacts/{name}`. ## Model inventory (ADR-0007) - A poller hits the target's `/api/tags` (default ~30s) to keep an in-sync model list; backs foreman's `/api/tags` passthrough and job validation. - `/api/ps` tells foreman what's resident, feeding the scheduler. - Jobs naming an uninstalled model are rejected at submit time (one re-check on miss). Target unreachable → retain last-known list, mark degraded on a health endpoint; do not reject wholesale on a single failed poll. ## Execution (ADR-0009, ADR-0013) - **Worker-model concurrency against the target is 1.** A single worker loop pulls a job, ensures the right worker model is resident, executes, records the result. Embeddings are not jobs and bypass this loop entirely (ADR-0013). - **Drain-by-model:** finish every queued job for the currently-resident worker model before paying a swap (`ORDER BY (model != current), created_at`). A heuristic, not a scheduler. No priorities, fairness, or budgets. - **Two resident slots:** target runs `OLLAMA_MAX_LOADED_MODELS=2` — slot 1 is the always-resident embedder (`FOREMAN_EMBED_MODEL`, pinned `keep_alive: -1`, warmed on startup/reconnect); slot 2 is the rotating worker model. Pin the worker with `keep_alive`; set `OLLAMA_CONTEXT_LENGTH=8192`+. ## Persistence (ADR-0008) - SQLite, WAL mode, pure-Go `modernc.org/sqlite` (no CGO → trivial Komodo builds). - `jobs` + `artifacts` tables; single writer (the worker) + HTTP readers. TTL sweep for pruning. No external broker. ## Models served foreman serves **any installed model** named in a request; it does not own a role→model mapping (the caller picks the model, e.g. go-llm `.Model(...)`). Recommended roster to pull on the Mac (32GB; the embedder stays resident in slot 1, one worker model rotates through slot 2 — ADR-0013): - **embedder (always resident)** — `nomic-embed-text` (~0.3GB) or `qwen3-embedding:0.6b`; selected via `FOREMAN_EMBED_MODEL`. - **parse / data** — `qwen3:14b` (~9GB, structured/JSON output). - **agent + code** — `qwen3:30b` (Qwen3-30B-A3B MoE, ~3B active, ~19GB, fast tool-calling). This is the default worker model. - Add a dedicated dense coder only if `qwen3:30b`'s code quality disappoints: `gpt-oss:20b` (~13GB, faster) or `qwen2.5-coder:32b` (~20GB, higher quality but bandwidth-bound and slow on this Mac). - Verify exact tags against the Ollama library before pulling; the registry moves. ## go-llm integration (ADR-0011) Verified: `llm.OllamaCloud(key, WithBaseURL(...))` already targets a private authenticated native-Ollama endpoint — which foreman is. Integration is a thin constructor, no new provider: - **Level 0 (now):** `llm.Foreman(baseURL, token).Model("qwen3:30b")` — delegates to the ollama provider; transparent, synchronous, full tool/think/stream. - **Level 1 (later):** a `foreman` client package — synchronous facade over the async `/jobs` surface (manages a webhook receiver, blocks to done). - **Level 2 (if needed):** a dedicated `provider.Provider` surfacing job IDs/state. ## Security (ADR-0010) - Network is the boundary: target `:11434` firewalled to foreman, and/or both on Tailscale. foreman is **not** on a public Traefik entrypoint. - Optional static bearer: validate `Authorization: Bearer `, which reuses the header `go-llm` already sends via the Foreman/OllamaCloud path. - No Authentik/SSO, no per-caller identities for v1. No financial/identity data ever transits foreman. ## Stack & conventions - Go 1.26, stdlib `net/http`, minimal deps. SQLite via `modernc.org/sqlite`. - No UI. HTTP API + small CLI only. - Match go-llm house style: standard Go tabs; `camelCase`/`PascalCase`; check errors immediately and wrap with `fmt.Errorf("%w: ...", err)`; imports stdlib → third-party → internal. The worker loop never panics; it logs, marks the job, continues. - ADRs in `docs/adr/` (one decision each, append/supersede). Living `progress.md` at repo root. Repo: `gitea.stevedudenhoeffer.com`. ## Out of scope (anti-creep guardrails — ADR-0001) Distributed dispatch, multiple workers, claim leases, weighted fair queueing, capacity budgets, eligibility gates, an auth framework / SSO, a GUI, and managing more than one target per daemon. Keep the ollama client behind a small interface so a future second backend is additive — but do not build for it now. ## Milestones - **M0** — native `/api/chat` passthrough + SQLite queue + single-worker loop, one model end to end, synchronous. - **M1** — model poller + `/api/tags`/`/api/ps`, drain-by-model, embedding bypass, async `/jobs` + `state_webhook_url` + artifacts + retry-on-unreachable, the CLI, and the `llm.Foreman()` constructor in go-llm. - **M2 (later)** — optional OpenAI-compat `/v1`, Level-1 client / dedicated provider, metrics.