initial commit

2026-05-23 16:41:20 -04:00
commit 8fde024281
15 changed files with 803 additions and 0 deletions
@@ -0,0 +1,27 @@
+# Compiled binary (cmd/foreman)
+/foreman
+/dist/
+*.exe
+
+# Test & coverage output
+*.out
+*.test
+coverage.*
+
+# SQLite queue + artifacts (local dev data — never commit)
+*.db
+*.db-wal
+*.db-shm
+*.sqlite
+*.sqlite3
+
+# Local config / secrets (commit .env.example, not .env)
+.env
+.env.local
+*.local
+
+# Editor / OS cruft
+.DS_Store
+.idea/
+.vscode/
+*.swp
@@ -0,0 +1,144 @@
+# foreman
+
+A small, always-on daemon that fronts **one** Ollama target. It turns a single
+Ollama instance into a queued, observable job endpoint: it polls the target's
+installed models, serializes work through the target (managing model swaps),
+assigns every job an ID, and reports progress + artifacts via webhooks. On the
+wire it speaks **native Ollama**, so it doubles as a drop-in `go-llm` target.
+
+foreman is the deliberately pared-down successor to `peon-overseer`. One daemon,
+one target, one queue. The complexity that sank the predecessor — distributed
+dispatch, claim leases, weighted fair queueing, capacity budgets, eligibility
+gates — existed to coordinate *multiple* workers and is **out of scope**.
+Resisting that creep is a first-class design goal. See `docs/adr/` for the
+decisions; this file summarizes them.
+
+## Topology (ADR-0001, ADR-0002)
+
+```
+orgrimmar:  foreman  (Go binary + SQLite queue + HTTP API + worker loop)
+              |  HTTP over the trusted VLAN / Tailscale
+              v
+M1 Pro Mac:  Ollama only  (models on disk, no foreman logic)
+```
+
+- One foreman process per Ollama target, configured by a single base URL
+  (default: the Mac's Tailscale address). A second worker = a second foreman.
+- foreman runs on the homelab, containerized, deployed via Komodo. The Mac stays
+  a dumb appliance.
+- The target is a laptop and may sleep. Unreachability is transient/recoverable,
+  never fatal (poller degraded mode + job retry below).
+
+## API surfaces (ADR-0003, ADR-0004)
+
+1. **Primary — transparent native Ollama passthrough:** `/api/chat`, `/api/tags`,
+   `/api/ps`. foreman looks exactly like an Ollama server. Synchronous: calls are
+   queued internally but the HTTP response blocks until completion. SSE streaming
+   supported (ADR-0012). This is the `go-llm` target path.
+2. **Async jobs — `POST /jobs`, `GET /jobs/{id}`:** body is a native-chat payload
+   plus optional `state_webhook_url`. Returns `202` + `{ "job_id": "<ulid>" }`
+   immediately. For fire-and-forget orchestration callers.
+3. **Optional OpenAI-compat `/v1/chat/completions` + `/v1/models`:** deferred;
+   added only if a non-go-llm caller needs it.
+
+Job lifecycle: `queued → loading → working → done` (+ terminal `failed`). A
+connection failure to the target re-queues the job with backoff (bounded retries
+guard poison jobs). IDs are ULIDs (sortable, timestamped).
+
+## Webhooks & artifacts (ADR-0005, ADR-0006)
+
+- On each state transition, POST a JSON event to `state_webhook_url`
+  (`job_id`, `state`, `previous_state`, `timestamp`, `model`, `attempt`, and on
+  completion `result` / `artifacts` / `error`).
+- At-least-once delivery; callers must be idempotent on `job_id`+`state`; missed
+  events reconcile via `GET /jobs/{id}`. Retry with bounded backoff. Optional
+  `X-Foreman-Signature` HMAC when a webhook secret is configured.
+- Artifacts are named typed blobs; the completion is always artifact `completion`.
+  Inline under ~256KB, otherwise fetched via `GET /jobs/{id}/artifacts/{name}`.
+
+## Model inventory (ADR-0007)
+
+- A poller hits the target's `/api/tags` (default ~30s) to keep an in-sync model
+  list; backs foreman's `/api/tags` passthrough and job validation.
+- `/api/ps` tells foreman what's resident, feeding the scheduler.
+- Jobs naming an uninstalled model are rejected at submit time (one re-check on
+  miss). Target unreachable → retain last-known list, mark degraded on a health
+  endpoint; do not reject wholesale on a single failed poll.
+
+## Execution (ADR-0009)
+
+- **Concurrency against the target is 1.** A single worker loop pulls a job,
+  ensures the right model is resident, executes, records the result.
+- **Drain-by-model:** finish every queued job for the currently-resident model
+  before paying a swap (`ORDER BY (model != current), created_at`). A heuristic,
+  not a scheduler. No priorities, fairness, or budgets.
+- Pin residency with Ollama `keep_alive`; target runs `OLLAMA_MAX_LOADED_MODELS=1`
+  and `OLLAMA_CONTEXT_LENGTH=8192`+.
+
+## Persistence (ADR-0008)
+
+- SQLite, WAL mode, pure-Go `modernc.org/sqlite` (no CGO → trivial Komodo builds).
+- `jobs` + `artifacts` tables; single writer (the worker) + HTTP readers. TTL
+  sweep for pruning. No external broker.
+
+## Models served
+
+foreman serves **any installed model** named in a request; it does not own a
+role→model mapping (the caller picks the model, e.g. go-llm `.Model(...)`).
+Recommended roster to pull on the Mac (32GB, ~26-28GB usable, single-resident
+swap):
+
+- **parse / data** — `qwen3:14b` (~9GB, structured/JSON output).
+- **agent + code** — `qwen3.6:35b` (MoE, ~3B active, ~20GB, fast tool-calling).
+- Split a dedicated dense coder (`qwen3.6:27b`) off later only if `35b`'s code
+  quality disappoints; it's bandwidth-bound and slow on this Mac.
+- Verify exact tags against the Ollama library before pulling; the registry moves.
+
+## go-llm integration (ADR-0011)
+
+Verified: `llm.OllamaCloud(key, WithBaseURL(...))` already targets a private
+authenticated native-Ollama endpoint — which foreman is. Integration is a thin
+constructor, no new provider:
+
+- **Level 0 (now):** `llm.Foreman(baseURL, token).Model("qwen3.6:35b")` — delegates
+  to the ollama provider; transparent, synchronous, full tool/think/stream.
+- **Level 1 (later):** a `foreman` client package — synchronous facade over the
+  async `/jobs` surface (manages a webhook receiver, blocks to done).
+- **Level 2 (if needed):** a dedicated `provider.Provider` surfacing job IDs/state.
+
+## Security (ADR-0010)
+
+- Network is the boundary: target `:11434` firewalled to foreman, and/or both on
+  Tailscale. foreman is **not** on a public Traefik entrypoint.
+- Optional static bearer: validate `Authorization: Bearer <token>`, which reuses
+  the header `go-llm` already sends via the Foreman/OllamaCloud path.
+- No Authentik/SSO, no per-caller identities for v1. No financial/identity data
+  ever transits foreman.
+
+## Stack & conventions
+
+- Go, stdlib `net/http`, minimal deps. SQLite via `modernc.org/sqlite`.
+- No UI. HTTP API + small CLI only.
+- Match go-llm house style: standard Go tabs; `camelCase`/`PascalCase`; check
+  errors immediately and wrap with `fmt.Errorf("%w: ...", err)`; imports stdlib →
+  third-party → internal. The worker loop never panics; it logs, marks the job,
+  continues.
+- ADRs in `docs/adr/` (one decision each, append/supersede). Living `progress.md`
+  at repo root. Repo: `gitea.stevedudenhoeffer.com`.
+
+## Out of scope (anti-creep guardrails — ADR-0001)
+
+Distributed dispatch, multiple workers, claim leases, weighted fair queueing,
+capacity budgets, eligibility gates, an auth framework / SSO, a GUI, and managing
+more than one target per daemon. Keep the ollama client behind a small interface
+so a future second backend is additive — but do not build for it now.
+
+## Milestones
+
+- **M0** — native `/api/chat` passthrough + SQLite queue + single-worker loop, one
+  model end to end, synchronous.
+- **M1** — model poller + `/api/tags`/`/api/ps`, drain-by-model, async `/jobs` +
+  `state_webhook_url` + artifacts + retry-on-unreachable, the CLI, and the
+  `llm.Foreman()` constructor in go-llm.
+- **M2 (later)** — optional OpenAI-compat `/v1`, Level-1 client / dedicated
+  provider, metrics.
@@ -0,0 +1,37 @@
+# ADR-0001: One daemon per Ollama target
+
+**Status:** Accepted — 2026-05-23
+
+## Context
+
+`peon-overseer` ballooned because it coordinated *multiple* workers from a
+central service: pull-based dispatch, claim leases, weighted fair queueing,
+capacity budgets, eligibility gates. All of that complexity existed solely to
+arbitrate shared workers. We want none of it back.
+
+The system being built fronts inference hardware (initially the M1 Pro running
+Ollama) and exposes it as a managed job endpoint.
+
+## Decision
+
+Each `foreman` process is bound to **exactly one** Ollama target, configured by a
+single base URL. One target = one daemon = one queue. There is no cross-daemon
+awareness and no shared state between daemons.
+
+If a second worker is added later (the 4090 box, the M5 Max), it gets its own
+`foreman` instance. Any fan-out across workers is the concern of a *separate*
+higher-level router that talks to multiple foreman instances — explicitly out of
+scope here and not to be anticipated in this codebase.
+
+## Consequences
+
+- The daemon is radically simple: one target, one serialized work stream.
+- Horizontal scale is "run another daemon," an operational act, not a code change.
+- No lease/fairness/budget machinery is permitted in this repo. If a change
+  starts to require it, that is the signal that the multi-worker router (a
+  different project) is what's actually needed.
+
+## Alternatives considered
+
+- **One daemon managing many targets.** Rejected: reintroduces the scheduling and
+  arbitration complexity that sank the predecessor.
@@ -0,0 +1,36 @@
+# ADR-0002: Daemon placement and remote target configuration
+
+**Status:** Accepted — 2026-05-23
+
+## Context
+
+The inference box is an M1 Pro MacBook — a laptop, not always-on infrastructure.
+The rest of steveternet runs on the homelab and is deployed/managed through
+Komodo. We do not want bespoke job-controller logic living on the Mac.
+
+## Decision
+
+`foreman` runs on the homelab (e.g. orgrimmar), containerized and deployed via
+Komodo like everything else. It is **given** its Ollama target as a configurable
+base URL (default: the Mac's Tailscale address) and reaches it over the network.
+
+The Mac runs Ollama and nothing `foreman`-specific. It stays a dumb appliance.
+
+## Consequences
+
+- Ops consistency: foreman is a normal Komodo-managed container.
+- The target URL is config, never hardcoded — swapping the Mac for another
+  backend is a config edit (within the one-target-per-daemon rule of ADR-0001).
+- The Mac is a laptop and may sleep or change networks. The daemon must treat an
+  unreachable target as a transient, recoverable condition (see ADR-0007 for the
+  model poller's degraded mode and ADR-0004 for job retry semantics), never as a
+  fatal error. Operationally: `caffeinate`/`pmset` keeps the Mac awake; Tailscale
+  keeps its address stable.
+- Network is now the trust boundary; Ollama has no auth of its own (see ADR-0010).
+
+## Alternatives considered
+
+- **Co-locate foreman on the Mac.** Rejected: contradicts the stated preference to
+  keep controller logic off the laptop, and complicates Komodo-based deployment.
+  Note that "given a target URL" keeps this reversible — co-location would just be
+  pointing the URL at localhost.
@@ -0,0 +1,51 @@
+# ADR-0003: API surface — native Ollama passthrough vs OpenAI-compat
+
+**Status:** Accepted — 2026-05-23 (resolved in favor of native Ollama)
+
+## Context
+
+Two goals were in mild tension: the original phrasing asked for an
+"OpenAI-compatible API," while the stated ultimate goal is to use the M1 Pro
+**simply as a target for `go-llm`**.
+
+`go-llm`'s `v2/CLAUDE.md` Key Design Decision #8 is explicit: its Ollama provider
+deliberately uses native `/api/chat`, *not* OpenAI-compat `/v1`, for `think:false`
+support, more reliable tool calling, and ~15-20% lower latency.
+
+**Verified in code (`v2/constructors.go`).** `llm.OllamaCloud(apiKey, opts...)`
+sends the key as `Authorization: Bearer <key>` over native `/api/chat`, and its
+doc comment says to "use `WithBaseURL` to point at a private Ollama deployment
+that requires auth." So go-llm *already* has a first-class path for a private,
+authenticated, native-Ollama endpoint — exactly what foreman is on the wire.
+Choosing OpenAI-compat would push go-llm onto a path its own author rejected, for
+no benefit to the primary caller.
+
+## Decision
+
+Native Ollama is **the** surface for v1. foreman speaks native `/api/chat`,
+`/api/tags`, and `/api/ps`, optionally behind a Bearer token (ADR-0010). To
+go-llm and any Ollama client it is indistinguishable from a private Ollama
+deployment.
+
+The synchronous passthrough is transparent: calls are queued internally
+(ADR-0009) but the HTTP response blocks until the job completes. Async features
+(job IDs, `state_webhook_url`, artifacts) live on a separate `/jobs` surface
+(ADR-0004), not bolted onto the passthrough.
+
+OpenAI-compat `/v1/chat/completions` is **deferred**, added in a later milestone
+only if a non-go-llm caller needs it.
+
+## Consequences
+
+- "Set up the Mac as a go-llm target" needs zero provider changes — a thin
+  constructor only (ADR-0011).
+- Preserves `think:false`, reliable tool calls, and lower latency.
+- foreman must faithfully proxy native `/api/chat` semantics, including SSE
+  streaming (ADR-0012).
+
+## Alternatives considered
+
+- **OpenAI-compat as primary/only surface.** Matches the original phrasing but
+  contradicts go-llm DD#8 and adds nothing for the primary caller. Rejected.
+- **Native-only, never add OpenAI-compat.** Fully serves the goal; the secondary
+  surface is kept as an option, not a commitment.
@@ -0,0 +1,52 @@
+# ADR-0004: Async job surface, job IDs, and queued execution
+
+**Status:** Accepted — 2026-05-23
+
+## Context
+
+The transparent passthrough (ADR-0003) is synchronous: the caller holds an HTTP
+connection until the completion returns. That is fine for interactive-length work
+and for go-llm, but two needs aren't served by it:
+
+- Long-running jobs held open through Traefik risk idle-connection timeouts.
+- Orchestration callers (mort/ratchet/werk-style) want fire-and-forget: submit,
+  get an ID back immediately, and be told asynchronously when the work is done.
+
+## Decision
+
+Add a distinct async surface: `POST /jobs`.
+
+- The body carries a chat payload (native-Ollama-shaped, mirroring `/api/chat`)
+  plus optional extension fields, notably `state_webhook_url` (ADR-0005).
+- foreman enqueues the job, assigns it a **ULID** (sortable, timestamped), and
+  immediately returns `202 Accepted` with `{ "job_id": "<ulid>" }`.
+- The caller correlates later webhook callbacks to its request via `job_id`.
+- `GET /jobs/{id}` returns current state, result, and artifact references for
+  polling-style callers or for recovery after a missed webhook.
+
+Every unit of work is a row in the queue (ADR-0008) regardless of which surface
+created it; the synchronous passthrough is simply a `/jobs` submission whose
+handler blocks on the job's completion instead of returning the ID.
+
+### Job lifecycle
+
+`queued → loading → working → done`, plus terminal `failed`. A job whose target
+is unreachable re-enters `queued` with a backoff (it is retryable, never
+auto-failed on a connection error — the target is a laptop, ADR-0002). A bounded
+retry count guards against poison jobs; exceeding it moves the job to `failed`
+with the last error recorded.
+
+## Consequences
+
+- One queue, one execution engine, two entry points (sync passthrough, async
+  `/jobs`).
+- Job IDs are stable, sortable, and meaningful to correlate webhooks.
+- `GET /jobs/{id}` gives at-least-once webhook delivery a recovery path.
+
+## Alternatives considered
+
+- **Reuse the OpenAI response `id` field instead of a separate `/jobs` surface.**
+  Workable for sync, but doesn't give async callers an immediate handle before
+  completion. The explicit `/jobs` surface is clearer.
+- **UUIDv4 for IDs.** Rejected in favor of ULID for natural time-ordering in the
+  queue and logs.
@@ -0,0 +1,63 @@
+# ADR-0005: Webhook state-update protocol
+
+**Status:** Accepted — 2026-05-23
+
+## Context
+
+Async callers (ADR-0004) need to know how their job is progressing without
+polling. The requirement: periodically push state updates
+(`queued → loading → working → done`) and deliver results/artifacts on
+completion.
+
+## Decision
+
+When a job is submitted with `state_webhook_url`, foreman POSTs a JSON event to
+that URL on every state transition.
+
+### Event payload
+
+```json
+{
+  "job_id": "01J...",
+  "state": "loading",
+  "previous_state": "queued",
+  "timestamp": "2026-05-23T12:00:00Z",
+  "model": "qwen3.6:35b",
+  "attempt": 1,
+  "error": null,
+  "result": null,
+  "artifacts": null
+}
+```
+
+- `state`: one of `queued`, `loading`, `working`, `done`, `failed`.
+- On `done`: `result` holds the completion (native-Ollama-shaped) and `artifacts`
+  holds artifact references (ADR-0006).
+- On `failed`: `error` holds a message; `result` is null.
+
+### Delivery semantics
+
+- **At-least-once.** Callers must be idempotent on `job_id` + `state`. A missed
+  webhook can always be reconciled via `GET /jobs/{id}` (ADR-0004).
+- **Retry with backoff** on non-2xx or connection failure, bounded attempts, then
+  the event is dropped (the job state itself is unaffected and remains queryable).
+- **Ordering is not guaranteed** across retries; `previous_state` + `timestamp`
+  let callers order/deduplicate.
+- **Optional HMAC signing:** if a webhook secret is configured, foreman sends an
+  `X-Foreman-Signature` header (HMAC-SHA256 of the body) so receivers can verify
+  authenticity. Off by default; recommended once foreman is reachable beyond a
+  fully trusted network.
+
+## Consequences
+
+- Callers get push observability with a polling fallback.
+- Idempotency is pushed onto the caller — documented as a hard requirement.
+- Webhook delivery is decoupled from job execution: a flaky receiver never blocks
+  or fails the job.
+
+## Alternatives considered
+
+- **Polling only.** Simpler for foreman, worse for callers; rejected since
+  webhooks were an explicit requirement. (Polling is still available as fallback.)
+- **WebSocket/SSE for state.** Heavier; SSE is reserved for token streaming on the
+  sync surface (ADR-0012), not job-state fan-out.
@@ -0,0 +1,53 @@
+# ADR-0006: Artifact handling and transport
+
+**Status:** Accepted — 2026-05-23
+
+## Context
+
+Jobs must "transmit artifacts when done." For a chat completion the obvious
+artifact is the assistant's text/tool-call output, but the term is deliberately
+broader: a job may produce structured data, multiple named outputs, or content
+too large to embed comfortably in a webhook body.
+
+## Decision
+
+An **artifact** is a named, typed blob attached to a completed job:
+
+```json
+{ "name": "completion", "content_type": "application/json", "size": 1234,
+  "inline": { ... }, "url": null }
+```
+
+- The primary completion is always emitted as an artifact named `completion`
+  (the native-Ollama response shape), so there is one consistent access pattern.
+- Additional artifacts use distinct names.
+
+### Transport: inline vs fetch
+
+- **Small artifacts** (under a configurable threshold, default ~256 KB) are
+  delivered **inline** in the `done` webhook (`inline` populated, `url` null) and
+  in `GET /jobs/{id}`.
+- **Large artifacts** exceed the threshold: the webhook/`GET` carries metadata
+  plus a `url` (`GET /jobs/{id}/artifacts/{name}`), and the bytes are fetched
+  on demand. This keeps webhook payloads bounded and avoids shipping megabytes
+  through a callback POST.
+
+### Retention
+
+Artifacts are stored alongside the job in SQLite (ADR-0008) and pruned with the
+job after a configurable TTL. No separate blob store in v1; revisit only if
+artifact sizes outgrow SQLite comfort (single-digit MB).
+
+## Consequences
+
+- One uniform way to read output (`completion` artifact), extensible to richer
+  jobs later without protocol changes.
+- Webhook bodies stay small; large outputs don't bloat or break delivery.
+- A pull endpoint for artifacts means a missed/oversized webhook never loses data.
+
+## Alternatives considered
+
+- **Always inline.** Simple but risks huge webhook bodies and SQLite row bloat in
+  the hot path. Rejected.
+- **External object store (S3/MinIO) from day one.** Over-engineered for the
+  expected sizes; deferred behind the TTL/threshold knobs.
@@ -0,0 +1,48 @@
+# ADR-0007: Model inventory polling and discovery
+
+**Status:** Accepted — 2026-05-23
+
+## Context
+
+foreman needs a "relatively in-sync" view of which models are installed on its
+target so it can (a) advertise them to callers, (b) reject jobs for missing
+models early instead of failing mid-execution, and (c) know what is currently
+resident to inform scheduling (ADR-0009).
+
+## Decision
+
+A background poller queries the target on a configurable interval (default ~30s):
+
+- `GET /api/tags` → the installed-model inventory. Cached in memory; this cache
+  backs foreman's own `/api/tags` passthrough (ADR-0003) and `/v1/models` if the
+  OpenAI-compat surface is enabled.
+- `GET /api/ps` → which model(s) are currently loaded, their VRAM/where-resident,
+  and the unload timer. Used by the scheduler to decide whether the next job
+  requires a swap.
+
+### Behavior
+
+- **Early validation:** a job naming a model absent from the cached inventory is
+  rejected at submit time with a clear error (and, for async jobs, the inventory
+  is recent enough that this is reliable). A small grace path allows a job for a
+  model that appears between polls by re-checking once on a miss.
+- **Degraded mode:** if the target is unreachable, the last-known inventory is
+  retained and foreman marks itself degraded (surfaced on a health endpoint).
+  Jobs are not rejected wholesale on a single failed poll — the target is a
+  laptop that may briefly sleep (ADR-0002). Execution-time unreachability is
+  handled by job retry (ADR-0004).
+
+## Consequences
+
+- Callers can discover available models through the normal Ollama/OpenAI
+  endpoints; no foreman-specific discovery API needed.
+- Bad-model jobs fail fast and cheaply.
+- A health/status endpoint exposing degraded state and last-poll time is required.
+
+## Alternatives considered
+
+- **No caching; proxy `/api/tags` live per request.** Simpler but couples every
+  discovery call to target availability and adds latency. Rejected; the poller
+  also feeds the scheduler, so the cache is needed regardless.
+- **Push/event-based inventory.** Ollama offers no such mechanism; polling is the
+  only option.
@@ -0,0 +1,42 @@
+# ADR-0008: Durable SQLite-backed queue
+
+**Status:** Accepted — 2026-05-23
+
+## Context
+
+Jobs are queued, carry state, and may be retried across target sleep/restart. A
+caller that submitted an async job and is waiting on a webhook must not lose its
+job because foreman restarted. State must survive process restarts.
+
+## Decision
+
+The job queue and all job state (including artifacts, ADR-0006) live in **SQLite**
+in WAL mode, via the pure-Go `modernc.org/sqlite` driver (no CGO, so the Komodo
+container build stays trivial).
+
+### Schema sketch
+
+- `jobs(id TEXT PK, state TEXT, model TEXT, request BLOB, result BLOB,
+  error TEXT, webhook_url TEXT, attempt INT, created_at, updated_at, …)`
+- `artifacts(job_id TEXT, name TEXT, content_type TEXT, size INT, inline BLOB,
+  PRIMARY KEY(job_id, name))`
+
+A single writer (the worker, ADR-0009) plus the HTTP handlers; WAL handles the
+concurrent-reader / single-writer pattern well at this scale.
+
+## Consequences
+
+- Jobs and results are durable across restarts; webhook recovery via
+  `GET /jobs/{id}` (ADR-0004) is meaningful.
+- Pure-Go driver keeps cross-compilation and container builds painless.
+- Pruning is a TTL sweep over `jobs`/`artifacts`; no external store to operate.
+- SQLite caps practical artifact size at single-digit MB — acceptable per ADR-0006
+  thresholds; revisit if outputs grow.
+
+## Alternatives considered
+
+- **In-memory queue.** Loses async jobs on restart; unacceptable given webhooks.
+- **Redis / external broker.** Another moving part to run for a single-worker
+  daemon; over-engineered. Rejected.
+- **`mattn/go-sqlite3` (CGO).** Faster in some cases but complicates static builds
+  and container images. Pure-Go preferred for ops simplicity.
@@ -0,0 +1,44 @@
+# ADR-0009: Single-worker serialization and drain-by-model scheduling
+
+**Status:** Accepted — 2026-05-23
+
+## Context
+
+The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast
+at a time; loading a different model is a 5-10s cold start. Running two models
+concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism
+against a single target buys nothing and would reintroduce coordination logic.
+
+## Decision
+
+**Concurrency against the target is 1.** A single worker loop pulls the next job
+from the queue, ensures the right model is resident, executes, and records the
+result.
+
+**Drain-by-model scheduling:** before incurring a model swap, the worker finishes
+every queued job that targets the **currently-resident** model (observed via
+`/api/ps`, ADR-0007). Only when no job for the hot model remains does it select a
+job for a different model and pay the swap cost.
+
+This is an `ORDER BY (model != current_model), created_at` style selection — a
+heuristic, not a scheduler. There is intentionally **no** priority system,
+fairness weighting, or capacity budgeting (those sank the predecessor; see
+ADR-0001).
+
+Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded
+between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it
+to single-resident swap.
+
+## Consequences
+
+- Swap thrash is minimized without any complex scheduling.
+- A long run of same-model jobs can delay a different-model job — acceptable for a
+  background box, and bounded by queue depth. If starvation ever becomes a real
+  problem, that is a signal to reconsider, not to pre-build fairness.
+- Throughput is dominated by how well callers batch work by model.
+
+## Alternatives considered
+
+- **FIFO with naive swapping.** Correct but pays a cold start on every model
+  change; wasteful when jobs interleave models. Rejected.
+- **Priority/fair scheduling.** Explicitly rejected as scope creep (ADR-0001).
@@ -0,0 +1,51 @@
+# ADR-0010: Authentication and security boundary
+
+**Status:** Accepted — 2026-05-23
+
+## Context
+
+Ollama itself has no authentication — anyone who can reach `:11434` can drive it.
+foreman sits in front of it and is the network-facing component. We need a real
+boundary without dragging in an auth framework (the kind of scope creep ADR-0001
+guards against).
+
+## Decision
+
+**Primary boundary is the network.** foreman and its Ollama target sit on a
+trusted segment: the target's `:11434` is firewalled to foreman only, and/or
+both are bound to the Tailscale interface. foreman is **not** exposed through a
+public Traefik entrypoint.
+
+**Optional static bearer token.** If a token is configured, foreman validates the
+`Authorization: Bearer <token>` header on incoming requests. This reuses headers
+that clients already send:
+
+- `go-llm` via `llm.Ollama()` sends no auth (fine on a trusted segment); via
+  `ollama.New(key, baseURL)` it sends `Authorization: Bearer <key>` — so a
+  configured foreman token slots straight into the existing provider with no new
+  code.
+- The OpenAI-compat surface (if enabled, ADR-0003) carries the same header.
+
+foreman → target auth: an optional bearer the daemon attaches to its own calls to
+Ollama, for the Ollama-Cloud-style case; empty for a local/LAN target.
+
+## Out of scope for v1
+
+- Authentik / SSO. It is painful for service-to-service traffic and adds nothing
+  over network isolation here.
+- Per-caller identities, scopes, rate limiting. Not needed for a single-tenant
+  homelab daemon.
+
+## Consequences
+
+- Minimal but real security: network isolation always, plus an optional shared
+  secret that integrates with existing clients for free.
+- Webhook authenticity is handled separately by optional HMAC signing (ADR-0005).
+- No financial/identity/credential data ever transits foreman; it brokers chat
+  jobs only.
+
+## Alternatives considered
+
+- **No auth, network-only.** Acceptable on a fully trusted tailnet; the optional
+  token exists for when foreman's reachability widens.
+- **Full auth framework / SSO.** Rejected as scope creep.
@@ -0,0 +1,73 @@
+# ADR-0011: Go integration — the `Foreman` interface
+
+**Status:** Accepted — 2026-05-23
+
+## Context
+
+The ultimate goal: use the M1 Pro **simply as a target for `go-llm`**.
+
+**Verified (`v2/constructors.go`, `v2/ollama/ollama.go`):** `llm.OllamaCloud(key,
+WithBaseURL(...))` already targets "a private Ollama deployment that requires
+auth" — native `/api/chat` + `Authorization: Bearer <key>` against any base URL.
+foreman is exactly that on the wire (ADR-0003). So integration needs **no new
+provider** — only a clean, intent-revealing seam so call sites say "foreman," not
+"Ollama."
+
+`go-llm`'s provider contract (`v2/provider`) is two methods, `Complete` and
+`Stream`; a future dedicated provider would implement them.
+
+## Decision
+
+Add a `llm.Foreman(baseURL, apiKey, opts...)` constructor to go-llm that delegates
+to the ollama native provider — the ollama translation happens behind the scenes:
+
+```go
+func Foreman(baseURL, apiKey string, opts ...ClientOption) *Client {
+    cfg := &clientConfig{}
+    for _, opt := range opts {
+        opt(cfg)
+    }
+    if cfg.baseURL != "" {
+        baseURL = cfg.baseURL
+    }
+    return NewClient(ollamaProvider.New(apiKey, baseURL))
+}
+
+// model := llm.Foreman("http://foreman.orgrimmar:PORT", token).Model("qwen3.6:35b")
+```
+
+`baseURL` is required (foreman has no default public address). This is a
+deliberate **seam**: v1 is a pass-through to the `ollama` provider; a dedicated
+foreman provider can later replace the delegate to surface job IDs / async state
+without changing call sites.
+
+### Three escalating levels
+
+- **Level 0 — `llm.Foreman(...)` (now, the headline goal).** Transparent,
+  synchronous, full native tool-calling / `think:false` / streaming. Queueing and
+  model-swap management happen invisibly inside the daemon. Zero provider code.
+- **Level 1 — `foreman` client package (when an orchestration caller needs it).**
+  A synchronous facade over the async `/jobs` surface: given messages, it manages
+  an ephemeral webhook receiver, blocks until `done`, and returns result +
+  artifacts (falling back to `GET /jobs/{id}` polling if it can't receive
+  callbacks). For callers wanting async semantics — surfaced job IDs, no
+  long-held connection — with a synchronous call signature.
+- **Level 2 — dedicated `provider.Provider` (only if needed).** Wraps Level 1 so
+  foreman is a first-class go-llm backend exposing job IDs / state / artifacts the
+  plain ollama provider can't. Built only if Level 0 proves insufficient.
+
+## Consequences
+
+- Headline goal met with one constructor and no provider code.
+- Call sites are foreman-named and future-proofed by the seam.
+- Async ergonomics are available later without forcing webhook plumbing on
+  callers, and without touching Level-0 users.
+
+## Alternatives considered
+
+- **Just tell users to call `OllamaCloud` with a base URL.** Works identically
+  today, but leaks the implementation ("it's Ollama") and offers no seam for
+  future foreman-specific behavior. The named constructor is the requested
+  "foreman interface."
+- **Ship a dedicated provider from day one (Level 2 first).** More code; bypasses
+  the zero-friction win. Deferred.
@@ -0,0 +1,41 @@
+# ADR-0012: Streaming support
+
+**Status:** Accepted — 2026-05-23
+
+## Context
+
+`go-llm`'s provider interface has a `Stream()` method, and Ollama's native
+`/api/chat` streams token-by-token by default. The synchronous passthrough
+(ADR-0003) must not break streaming clients. Separately, the async `/jobs`
+surface (ADR-0004) reports progress via discrete state webhooks, which is a
+different granularity than token streaming.
+
+## Decision
+
+- **Sync passthrough: support streaming.** When a `/api/chat` request sets
+  `stream: true`, foreman streams the target's token deltas back to the caller
+  (SSE/chunked, matching Ollama's native streaming). A streamed job still moves
+  through the queue; streaming begins once the job reaches `working`, so a job
+  waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when
+  its turn comes. go-llm's `Stream()` works against foreman unchanged.
+- **Async `/jobs` surface: no token streaming in v1.** Webhooks carry coarse state
+  transitions (ADR-0005) and the final result/artifacts, not per-token deltas.
+  Token-level streaming over a fire-and-forget webhook job is deliberately
+  deferred — it adds a transport (persistent connection or chunked webhook) whose
+  complexity isn't justified yet.
+
+## Consequences
+
+- Interactive go-llm usage gets real streaming through the transparent surface.
+- Orchestration callers get state + final artifacts, which is what they need;
+  they can use the sync streaming surface directly if they want tokens.
+- The job state machine and webhook protocol stay simple (no streaming transport
+  to design or operate).
+
+## Alternatives considered
+
+- **Stream tokens over the async surface too.** Deferred: requires either a
+  long-lived connection (defeats the point of async) or chunked-delta webhooks
+  (complex, rarely needed). Revisit only on a concrete need.
+- **No streaming at all.** Would break go-llm's `Stream()` and interactive use on
+  the very path that is the primary goal. Rejected.
@@ -0,0 +1,41 @@
+# foreman — Architecture Decision Records
+
+`foreman` is a small daemon that fronts **one** Ollama target. It turns a single
+Ollama instance into a queued, observable job endpoint: it polls the target's
+installed models, serializes jobs through the target (managing model swaps),
+assigns every job an ID, and reports progress + artifacts via webhooks. It also
+ships a Go client so the target is trivial to use from `go-llm`.
+
+It is the deliberately pared-down successor to `peon-overseer`. One daemon, one
+worker, one queue. No distributed dispatch, no leases, no fair queueing.
+
+## Index
+
+| ADR | Title | Status |
+|-----|-------|--------|
+| 0001 | One daemon per Ollama target | Accepted |
+| 0002 | Daemon placement and remote target configuration | Accepted |
+| 0003 | API surface: native Ollama passthrough vs OpenAI-compat | Accepted |
+| 0004 | Async job surface, job IDs, and queued execution | Accepted |
+| 0005 | Webhook state-update protocol | Accepted |
+| 0006 | Artifact handling and transport | Accepted |
+| 0007 | Model inventory polling and discovery | Accepted |
+| 0008 | Durable SQLite-backed queue | Accepted |
+| 0009 | Single-worker serialization and drain-by-model scheduling | Accepted |
+| 0010 | Authentication and security boundary | Accepted |
+| 0011 | Go client library and go-llm integration | Accepted |
+| 0012 | Streaming support | Accepted |
+
+ADR-0003 was resolved in favor of **native Ollama** as the v1 surface: foreman is,
+on the wire, a private authenticated Ollama deployment, so `go-llm` integrates via
+a thin `llm.Foreman(baseURL, token)` constructor that delegates to the existing
+ollama provider (ADR-0011). OpenAI-compat `/v1` is deferred.
+
+These ADRs refine the API/integration sections of the project `CLAUDE.md`. The
+queue, single-worker, drain-by-model, and security guardrails carry forward
+unchanged.
+
+## Format
+
+Each ADR: Status, Context, Decision, Consequences, and Alternatives where useful.
+One decision per file. Append new ADRs; supersede rather than rewrite.