7.1 KiB
foreman
A small, always-on daemon that fronts one Ollama target. It turns a single
Ollama instance into a queued, observable job endpoint: it polls the target's
installed models, serializes work through the target (managing model swaps),
assigns every job an ID, and reports progress + artifacts via webhooks. On the
wire it speaks native Ollama, so it doubles as a drop-in go-llm target.
foreman is the deliberately pared-down successor to peon-overseer. One daemon,
one target, one queue. The complexity that sank the predecessor — distributed
dispatch, claim leases, weighted fair queueing, capacity budgets, eligibility
gates — existed to coordinate multiple workers and is out of scope.
Resisting that creep is a first-class design goal. See docs/adr/ for the
decisions; this file summarizes them.
Topology (ADR-0001, ADR-0002)
orgrimmar: foreman (Go binary + SQLite queue + HTTP API + worker loop)
| HTTP over the trusted VLAN / Tailscale
v
M1 Pro Mac: Ollama only (models on disk, no foreman logic)
- One foreman process per Ollama target, configured by a single base URL (default: the Mac's Tailscale address). A second worker = a second foreman.
- foreman runs on the homelab, containerized, deployed via Komodo. The Mac stays a dumb appliance.
- The target is a laptop and may sleep. Unreachability is transient/recoverable, never fatal (poller degraded mode + job retry below).
API surfaces (ADR-0003, ADR-0004)
- Primary — transparent native Ollama passthrough:
/api/chat,/api/tags,/api/ps. foreman looks exactly like an Ollama server. Synchronous: calls are queued internally but the HTTP response blocks until completion. SSE streaming supported (ADR-0012). This is thego-llmtarget path. - Async jobs —
POST /jobs,GET /jobs/{id}: body is a native-chat payload plus optionalstate_webhook_url. Returns202+{ "job_id": "<ulid>" }immediately. For fire-and-forget orchestration callers. - Optional OpenAI-compat
/v1/chat/completions+/v1/models: deferred; added only if a non-go-llm caller needs it.
Job lifecycle: queued → loading → working → done (+ terminal failed). A
connection failure to the target re-queues the job with backoff (bounded retries
guard poison jobs). IDs are ULIDs (sortable, timestamped).
Webhooks & artifacts (ADR-0005, ADR-0006)
- On each state transition, POST a JSON event to
state_webhook_url(job_id,state,previous_state,timestamp,model,attempt, and on completionresult/artifacts/error). - At-least-once delivery; callers must be idempotent on
job_id+state; missed events reconcile viaGET /jobs/{id}. Retry with bounded backoff. OptionalX-Foreman-SignatureHMAC when a webhook secret is configured. - Artifacts are named typed blobs; the completion is always artifact
completion. Inline under ~256KB, otherwise fetched viaGET /jobs/{id}/artifacts/{name}.
Model inventory (ADR-0007)
- A poller hits the target's
/api/tags(default ~30s) to keep an in-sync model list; backs foreman's/api/tagspassthrough and job validation. /api/pstells foreman what's resident, feeding the scheduler.- Jobs naming an uninstalled model are rejected at submit time (one re-check on miss). Target unreachable → retain last-known list, mark degraded on a health endpoint; do not reject wholesale on a single failed poll.
Execution (ADR-0009)
- Concurrency against the target is 1. A single worker loop pulls a job, ensures the right model is resident, executes, records the result.
- Drain-by-model: finish every queued job for the currently-resident model
before paying a swap (
ORDER BY (model != current), created_at). A heuristic, not a scheduler. No priorities, fairness, or budgets. - Pin residency with Ollama
keep_alive; target runsOLLAMA_MAX_LOADED_MODELS=1andOLLAMA_CONTEXT_LENGTH=8192+.
Persistence (ADR-0008)
- SQLite, WAL mode, pure-Go
modernc.org/sqlite(no CGO → trivial Komodo builds). jobs+artifactstables; single writer (the worker) + HTTP readers. TTL sweep for pruning. No external broker.
Models served
foreman serves any installed model named in a request; it does not own a
role→model mapping (the caller picks the model, e.g. go-llm .Model(...)).
Recommended roster to pull on the Mac (32GB, ~26-28GB usable, single-resident
swap):
- parse / data —
qwen3:14b(~9GB, structured/JSON output). - agent + code —
qwen3.6:35b(MoE, ~3B active, ~20GB, fast tool-calling). - Split a dedicated dense coder (
qwen3.6:27b) off later only if35b's code quality disappoints; it's bandwidth-bound and slow on this Mac. - Verify exact tags against the Ollama library before pulling; the registry moves.
go-llm integration (ADR-0011)
Verified: llm.OllamaCloud(key, WithBaseURL(...)) already targets a private
authenticated native-Ollama endpoint — which foreman is. Integration is a thin
constructor, no new provider:
- Level 0 (now):
llm.Foreman(baseURL, token).Model("qwen3.6:35b")— delegates to the ollama provider; transparent, synchronous, full tool/think/stream. - Level 1 (later): a
foremanclient package — synchronous facade over the async/jobssurface (manages a webhook receiver, blocks to done). - Level 2 (if needed): a dedicated
provider.Providersurfacing job IDs/state.
Security (ADR-0010)
- Network is the boundary: target
:11434firewalled to foreman, and/or both on Tailscale. foreman is not on a public Traefik entrypoint. - Optional static bearer: validate
Authorization: Bearer <token>, which reuses the headergo-llmalready sends via the Foreman/OllamaCloud path. - No Authentik/SSO, no per-caller identities for v1. No financial/identity data ever transits foreman.
Stack & conventions
- Go, stdlib
net/http, minimal deps. SQLite viamodernc.org/sqlite. - No UI. HTTP API + small CLI only.
- Match go-llm house style: standard Go tabs;
camelCase/PascalCase; check errors immediately and wrap withfmt.Errorf("%w: ...", err); imports stdlib → third-party → internal. The worker loop never panics; it logs, marks the job, continues. - ADRs in
docs/adr/(one decision each, append/supersede). Livingprogress.mdat repo root. Repo:gitea.stevedudenhoeffer.com.
Out of scope (anti-creep guardrails — ADR-0001)
Distributed dispatch, multiple workers, claim leases, weighted fair queueing, capacity budgets, eligibility gates, an auth framework / SSO, a GUI, and managing more than one target per daemon. Keep the ollama client behind a small interface so a future second backend is additive — but do not build for it now.
Milestones
- M0 — native
/api/chatpassthrough + SQLite queue + single-worker loop, one model end to end, synchronous. - M1 — model poller +
/api/tags//api/ps, drain-by-model, async/jobs+state_webhook_url+ artifacts + retry-on-unreachable, the CLI, and thellm.Foreman()constructor in go-llm. - M2 (later) — optional OpenAI-compat
/v1, Level-1 client / dedicated provider, metrics.