initial commit

2026-05-23 16:41:20 -04:00
commit 8fde024281
15 changed files with 803 additions and 0 deletions
@@ -0,0 +1,144 @@
+# foreman
+
+A small, always-on daemon that fronts **one** Ollama target. It turns a single
+Ollama instance into a queued, observable job endpoint: it polls the target's
+installed models, serializes work through the target (managing model swaps),
+assigns every job an ID, and reports progress + artifacts via webhooks. On the
+wire it speaks **native Ollama**, so it doubles as a drop-in `go-llm` target.
+
+foreman is the deliberately pared-down successor to `peon-overseer`. One daemon,
+one target, one queue. The complexity that sank the predecessor — distributed
+dispatch, claim leases, weighted fair queueing, capacity budgets, eligibility
+gates — existed to coordinate *multiple* workers and is **out of scope**.
+Resisting that creep is a first-class design goal. See `docs/adr/` for the
+decisions; this file summarizes them.
+
+## Topology (ADR-0001, ADR-0002)
+
+```
+orgrimmar:  foreman  (Go binary + SQLite queue + HTTP API + worker loop)
+              |  HTTP over the trusted VLAN / Tailscale
+              v
+M1 Pro Mac:  Ollama only  (models on disk, no foreman logic)
+```
+
+- One foreman process per Ollama target, configured by a single base URL
+  (default: the Mac's Tailscale address). A second worker = a second foreman.
+- foreman runs on the homelab, containerized, deployed via Komodo. The Mac stays
+  a dumb appliance.
+- The target is a laptop and may sleep. Unreachability is transient/recoverable,
+  never fatal (poller degraded mode + job retry below).
+
+## API surfaces (ADR-0003, ADR-0004)
+
+1. **Primary — transparent native Ollama passthrough:** `/api/chat`, `/api/tags`,
+   `/api/ps`. foreman looks exactly like an Ollama server. Synchronous: calls are
+   queued internally but the HTTP response blocks until completion. SSE streaming
+   supported (ADR-0012). This is the `go-llm` target path.
+2. **Async jobs — `POST /jobs`, `GET /jobs/{id}`:** body is a native-chat payload
+   plus optional `state_webhook_url`. Returns `202` + `{ "job_id": "<ulid>" }`
+   immediately. For fire-and-forget orchestration callers.
+3. **Optional OpenAI-compat `/v1/chat/completions` + `/v1/models`:** deferred;
+   added only if a non-go-llm caller needs it.
+
+Job lifecycle: `queued → loading → working → done` (+ terminal `failed`). A
+connection failure to the target re-queues the job with backoff (bounded retries
+guard poison jobs). IDs are ULIDs (sortable, timestamped).
+
+## Webhooks & artifacts (ADR-0005, ADR-0006)
+
+- On each state transition, POST a JSON event to `state_webhook_url`
+  (`job_id`, `state`, `previous_state`, `timestamp`, `model`, `attempt`, and on
+  completion `result` / `artifacts` / `error`).
+- At-least-once delivery; callers must be idempotent on `job_id`+`state`; missed
+  events reconcile via `GET /jobs/{id}`. Retry with bounded backoff. Optional
+  `X-Foreman-Signature` HMAC when a webhook secret is configured.
+- Artifacts are named typed blobs; the completion is always artifact `completion`.
+  Inline under ~256KB, otherwise fetched via `GET /jobs/{id}/artifacts/{name}`.
+
+## Model inventory (ADR-0007)
+
+- A poller hits the target's `/api/tags` (default ~30s) to keep an in-sync model
+  list; backs foreman's `/api/tags` passthrough and job validation.
+- `/api/ps` tells foreman what's resident, feeding the scheduler.
+- Jobs naming an uninstalled model are rejected at submit time (one re-check on
+  miss). Target unreachable → retain last-known list, mark degraded on a health
+  endpoint; do not reject wholesale on a single failed poll.
+
+## Execution (ADR-0009)
+
+- **Concurrency against the target is 1.** A single worker loop pulls a job,
+  ensures the right model is resident, executes, records the result.
+- **Drain-by-model:** finish every queued job for the currently-resident model
+  before paying a swap (`ORDER BY (model != current), created_at`). A heuristic,
+  not a scheduler. No priorities, fairness, or budgets.
+- Pin residency with Ollama `keep_alive`; target runs `OLLAMA_MAX_LOADED_MODELS=1`
+  and `OLLAMA_CONTEXT_LENGTH=8192`+.
+
+## Persistence (ADR-0008)
+
+- SQLite, WAL mode, pure-Go `modernc.org/sqlite` (no CGO → trivial Komodo builds).
+- `jobs` + `artifacts` tables; single writer (the worker) + HTTP readers. TTL
+  sweep for pruning. No external broker.
+
+## Models served
+
+foreman serves **any installed model** named in a request; it does not own a
+role→model mapping (the caller picks the model, e.g. go-llm `.Model(...)`).
+Recommended roster to pull on the Mac (32GB, ~26-28GB usable, single-resident
+swap):
+
+- **parse / data** — `qwen3:14b` (~9GB, structured/JSON output).
+- **agent + code** — `qwen3.6:35b` (MoE, ~3B active, ~20GB, fast tool-calling).
+- Split a dedicated dense coder (`qwen3.6:27b`) off later only if `35b`'s code
+  quality disappoints; it's bandwidth-bound and slow on this Mac.
+- Verify exact tags against the Ollama library before pulling; the registry moves.
+
+## go-llm integration (ADR-0011)
+
+Verified: `llm.OllamaCloud(key, WithBaseURL(...))` already targets a private
+authenticated native-Ollama endpoint — which foreman is. Integration is a thin
+constructor, no new provider:
+
+- **Level 0 (now):** `llm.Foreman(baseURL, token).Model("qwen3.6:35b")` — delegates
+  to the ollama provider; transparent, synchronous, full tool/think/stream.
+- **Level 1 (later):** a `foreman` client package — synchronous facade over the
+  async `/jobs` surface (manages a webhook receiver, blocks to done).
+- **Level 2 (if needed):** a dedicated `provider.Provider` surfacing job IDs/state.
+
+## Security (ADR-0010)
+
+- Network is the boundary: target `:11434` firewalled to foreman, and/or both on
+  Tailscale. foreman is **not** on a public Traefik entrypoint.
+- Optional static bearer: validate `Authorization: Bearer <token>`, which reuses
+  the header `go-llm` already sends via the Foreman/OllamaCloud path.
+- No Authentik/SSO, no per-caller identities for v1. No financial/identity data
+  ever transits foreman.
+
+## Stack & conventions
+
+- Go, stdlib `net/http`, minimal deps. SQLite via `modernc.org/sqlite`.
+- No UI. HTTP API + small CLI only.
+- Match go-llm house style: standard Go tabs; `camelCase`/`PascalCase`; check
+  errors immediately and wrap with `fmt.Errorf("%w: ...", err)`; imports stdlib →
+  third-party → internal. The worker loop never panics; it logs, marks the job,
+  continues.
+- ADRs in `docs/adr/` (one decision each, append/supersede). Living `progress.md`
+  at repo root. Repo: `gitea.stevedudenhoeffer.com`.
+
+## Out of scope (anti-creep guardrails — ADR-0001)
+
+Distributed dispatch, multiple workers, claim leases, weighted fair queueing,
+capacity budgets, eligibility gates, an auth framework / SSO, a GUI, and managing
+more than one target per daemon. Keep the ollama client behind a small interface
+so a future second backend is additive — but do not build for it now.
+
+## Milestones
+
+- **M0** — native `/api/chat` passthrough + SQLite queue + single-worker loop, one
+  model end to end, synchronous.
+- **M1** — model poller + `/api/tags`/`/api/ps`, drain-by-model, async `/jobs` +
+  `state_webhook_url` + artifacts + retry-on-unreachable, the CLI, and the
+  `llm.Foreman()` constructor in go-llm.
+- **M2 (later)** — optional OpenAI-compat `/v1`, Level-1 client / dedicated
+  provider, metrics.