foreman

A small, always-on daemon that fronts one Ollama target. It turns a single Ollama instance into a queued, observable job endpoint: it polls the target's installed models, serializes work through the target (managing model swaps), assigns every job an ID, and reports progress + artifacts via webhooks. On the wire it speaks native Ollama, so it doubles as a drop-in go-llm target.

foreman is the deliberately pared-down successor to peon-overseer. One daemon, one target, one queue. The complexity that sank the predecessor — distributed dispatch, claim leases, weighted fair queueing, capacity budgets, eligibility gates — existed to coordinate multiple workers and is out of scope. Resisting that creep is a first-class design goal. See docs/adr/ for the decisions; this file summarizes them.

Topology (ADR-0001, ADR-0002)

orgrimmar:  foreman  (Go binary + SQLite queue + HTTP API + worker loop)
              |  HTTP over the trusted VLAN / Tailscale
              v
M1 Pro Mac:  Ollama only  (models on disk, no foreman logic)

One foreman process per Ollama target, configured by a single base URL (default: the Mac's Tailscale address). A second worker = a second foreman.
foreman runs on the homelab, containerized, deployed via Komodo. The Mac stays a dumb appliance.
The target is a laptop and may sleep. Unreachability is transient/recoverable, never fatal (poller degraded mode + job retry below).

API surfaces (ADR-0003, ADR-0004)

Primary — transparent native Ollama passthrough: /api/chat, /api/tags, /api/ps. foreman looks exactly like an Ollama server. Synchronous: calls are queued internally but the HTTP response blocks until completion. SSE streaming supported (ADR-0012). This is the go-llm target path.
Async jobs — POST /jobs, GET /jobs/{id}: body is a native-chat payload plus optional state_webhook_url. Returns 202 + { "job_id": "<ulid>" } immediately. For fire-and-forget orchestration callers.
Optional OpenAI-compat /v1/chat/completions + /v1/models: deferred; added only if a non-go-llm caller needs it.

Job lifecycle: queued → loading → working → done (+ terminal failed). A connection failure to the target re-queues the job with backoff (bounded retries guard poison jobs). IDs are ULIDs (sortable, timestamped).

Webhooks & artifacts (ADR-0005, ADR-0006)

On each state transition, POST a JSON event to state_webhook_url (job_id, state, previous_state, timestamp, model, attempt, and on completion result / artifacts / error).
At-least-once delivery; callers must be idempotent on job_id+state; missed events reconcile via GET /jobs/{id}. Retry with bounded backoff. Optional X-Foreman-Signature HMAC when a webhook secret is configured.
Artifacts are named typed blobs; the completion is always artifact completion. Inline under ~256KB, otherwise fetched via GET /jobs/{id}/artifacts/{name}.

Model inventory (ADR-0007)

A poller hits the target's /api/tags (default ~30s) to keep an in-sync model list; backs foreman's /api/tags passthrough and job validation.
/api/ps tells foreman what's resident, feeding the scheduler.
Jobs naming an uninstalled model are rejected at submit time (one re-check on miss). Target unreachable → retain last-known list, mark degraded on a health endpoint; do not reject wholesale on a single failed poll.

Execution (ADR-0009)

Concurrency against the target is 1. A single worker loop pulls a job, ensures the right model is resident, executes, records the result.
Drain-by-model: finish every queued job for the currently-resident model before paying a swap (ORDER BY (model != current), created_at). A heuristic, not a scheduler. No priorities, fairness, or budgets.
Pin residency with Ollama keep_alive; target runs OLLAMA_MAX_LOADED_MODELS=1 and OLLAMA_CONTEXT_LENGTH=8192+.

Persistence (ADR-0008)

SQLite, WAL mode, pure-Go modernc.org/sqlite (no CGO → trivial Komodo builds).
jobs + artifacts tables; single writer (the worker) + HTTP readers. TTL sweep for pruning. No external broker.

Models served

foreman serves any installed model named in a request; it does not own a role→model mapping (the caller picks the model, e.g. go-llm .Model(...)). Recommended roster to pull on the Mac (32GB, ~26-28GB usable, single-resident swap):

parse / data — qwen3:14b (~9GB, structured/JSON output).
agent + code — qwen3.6:35b (MoE, ~3B active, ~20GB, fast tool-calling).
Split a dedicated dense coder (qwen3.6:27b) off later only if 35b's code quality disappoints; it's bandwidth-bound and slow on this Mac.
Verify exact tags against the Ollama library before pulling; the registry moves.

go-llm integration (ADR-0011)

Verified: llm.OllamaCloud(key, WithBaseURL(...)) already targets a private authenticated native-Ollama endpoint — which foreman is. Integration is a thin constructor, no new provider:

Level 0 (now): llm.Foreman(baseURL, token).Model("qwen3.6:35b") — delegates to the ollama provider; transparent, synchronous, full tool/think/stream.
Level 1 (later): a foreman client package — synchronous facade over the async /jobs surface (manages a webhook receiver, blocks to done).
Level 2 (if needed): a dedicated provider.Provider surfacing job IDs/state.

Security (ADR-0010)

Network is the boundary: target :11434 firewalled to foreman, and/or both on Tailscale. foreman is not on a public Traefik entrypoint.
Optional static bearer: validate Authorization: Bearer <token>, which reuses the header go-llm already sends via the Foreman/OllamaCloud path.
No Authentik/SSO, no per-caller identities for v1. No financial/identity data ever transits foreman.

Stack & conventions

Go, stdlib net/http, minimal deps. SQLite via modernc.org/sqlite.
No UI. HTTP API + small CLI only.
Match go-llm house style: standard Go tabs; camelCase/PascalCase; check errors immediately and wrap with fmt.Errorf("%w: ...", err); imports stdlib → third-party → internal. The worker loop never panics; it logs, marks the job, continues.
ADRs in docs/adr/ (one decision each, append/supersede). Living progress.md at repo root. Repo: gitea.stevedudenhoeffer.com.

Out of scope (anti-creep guardrails — ADR-0001)

Distributed dispatch, multiple workers, claim leases, weighted fair queueing, capacity budgets, eligibility gates, an auth framework / SSO, a GUI, and managing more than one target per daemon. Keep the ollama client behind a small interface so a future second backend is additive — but do not build for it now.

Milestones

M0 — native /api/chat passthrough + SQLite queue + single-worker loop, one model end to end, synchronous.
M1 — model poller + /api/tags//api/ps, drain-by-model, async /jobs + state_webhook_url + artifacts + retry-on-unreachable, the CLI, and the llm.Foreman() constructor in go-llm.
M2 (later) — optional OpenAI-compat /v1, Level-1 client / dedicated provider, metrics.

7.1 KiB Raw Blame History