initial commit

2026-05-23 16:41:20 -04:00
commit 8fde024281
15 changed files with 803 additions and 0 deletions
@@ -0,0 +1,27 @@
 # Compiled binary (cmd/foreman)
 /foreman
 /dist/
 *.exe
 # Test & coverage output
 *.out
 *.test
 coverage.*
 # SQLite queue + artifacts (local dev data — never commit)
 *.db
 *.db-wal
 *.db-shm
 *.sqlite
 *.sqlite3
 # Local config / secrets (commit .env.example, not .env)
 .env
 .env.local
 *.local
 # Editor / OS cruft
 .DS_Store
 .idea/
 .vscode/
 *.swp
@@ -0,0 +1,144 @@
 # foreman
 A small, always-on daemon that fronts **one** Ollama target. It turns a single
 Ollama instance into a queued, observable job endpoint: it polls the target's
 installed models, serializes work through the target (managing model swaps),
 assigns every job an ID, and reports progress + artifacts via webhooks. On the
 wire it speaks **native Ollama**, so it doubles as a drop-in `go-llm` target.
 foreman is the deliberately pared-down successor to `peon-overseer`. One daemon,
 one target, one queue. The complexity that sank the predecessor — distributed
 dispatch, claim leases, weighted fair queueing, capacity budgets, eligibility
 gates — existed to coordinate *multiple* workers and is **out of scope**.
 Resisting that creep is a first-class design goal. See `docs/adr/` for the
 decisions; this file summarizes them.
 ## Topology (ADR-0001, ADR-0002)
 ```
 orgrimmar:  foreman  (Go binary + SQLite queue + HTTP API + worker loop)
              |  HTTP over the trusted VLAN / Tailscale
              v
 M1 Pro Mac:  Ollama only  (models on disk, no foreman logic)
 ```
 - One foreman process per Ollama target, configured by a single base URL
  (default: the Mac's Tailscale address). A second worker = a second foreman.
 - foreman runs on the homelab, containerized, deployed via Komodo. The Mac stays
  a dumb appliance.
 - The target is a laptop and may sleep. Unreachability is transient/recoverable,
  never fatal (poller degraded mode + job retry below).
 ## API surfaces (ADR-0003, ADR-0004)
 1. **Primary — transparent native Ollama passthrough:** `/api/chat`, `/api/tags`,
   `/api/ps`. foreman looks exactly like an Ollama server. Synchronous: calls are
   queued internally but the HTTP response blocks until completion. SSE streaming
   supported (ADR-0012). This is the `go-llm` target path.
 2. **Async jobs — `POST /jobs`, `GET /jobs/{id}`:** body is a native-chat payload
   plus optional `state_webhook_url`. Returns `202` + `{ "job_id": "<ulid>" }`
   immediately. For fire-and-forget orchestration callers.
 3. **Optional OpenAI-compat `/v1/chat/completions` + `/v1/models`:** deferred;
   added only if a non-go-llm caller needs it.
 Job lifecycle: `queued → loading → working → done` (+ terminal `failed`). A
 connection failure to the target re-queues the job with backoff (bounded retries
 guard poison jobs). IDs are ULIDs (sortable, timestamped).
 ## Webhooks & artifacts (ADR-0005, ADR-0006)
 - On each state transition, POST a JSON event to `state_webhook_url`
  (`job_id`, `state`, `previous_state`, `timestamp`, `model`, `attempt`, and on
  completion `result` / `artifacts` / `error`).
 - At-least-once delivery; callers must be idempotent on `job_id`+`state`; missed
  events reconcile via `GET /jobs/{id}`. Retry with bounded backoff. Optional
  `X-Foreman-Signature` HMAC when a webhook secret is configured.
 - Artifacts are named typed blobs; the completion is always artifact `completion`.
  Inline under ~256KB, otherwise fetched via `GET /jobs/{id}/artifacts/{name}`.
 ## Model inventory (ADR-0007)
 - A poller hits the target's `/api/tags` (default ~30s) to keep an in-sync model
  list; backs foreman's `/api/tags` passthrough and job validation.
 - `/api/ps` tells foreman what's resident, feeding the scheduler.
 - Jobs naming an uninstalled model are rejected at submit time (one re-check on
  miss). Target unreachable → retain last-known list, mark degraded on a health
  endpoint; do not reject wholesale on a single failed poll.
 ## Execution (ADR-0009)
 - **Concurrency against the target is 1.** A single worker loop pulls a job,
  ensures the right model is resident, executes, records the result.
 - **Drain-by-model:** finish every queued job for the currently-resident model
  before paying a swap (`ORDER BY (model != current), created_at`). A heuristic,
  not a scheduler. No priorities, fairness, or budgets.
 - Pin residency with Ollama `keep_alive`; target runs `OLLAMA_MAX_LOADED_MODELS=1`
  and `OLLAMA_CONTEXT_LENGTH=8192`+.
 ## Persistence (ADR-0008)
 - SQLite, WAL mode, pure-Go `modernc.org/sqlite` (no CGO → trivial Komodo builds).
 - `jobs` + `artifacts` tables; single writer (the worker) + HTTP readers. TTL
  sweep for pruning. No external broker.
 ## Models served
 foreman serves **any installed model** named in a request; it does not own a
 role→model mapping (the caller picks the model, e.g. go-llm `.Model(...)`).
 Recommended roster to pull on the Mac (32GB, ~26-28GB usable, single-resident
 swap):
 - **parse / data** — `qwen3:14b` (~9GB, structured/JSON output).
 - **agent + code** — `qwen3.6:35b` (MoE, ~3B active, ~20GB, fast tool-calling).
 - Split a dedicated dense coder (`qwen3.6:27b`) off later only if `35b`'s code
  quality disappoints; it's bandwidth-bound and slow on this Mac.
 - Verify exact tags against the Ollama library before pulling; the registry moves.
 ## go-llm integration (ADR-0011)
 Verified: `llm.OllamaCloud(key, WithBaseURL(...))` already targets a private
 authenticated native-Ollama endpoint — which foreman is. Integration is a thin
 constructor, no new provider:
 - **Level 0 (now):** `llm.Foreman(baseURL, token).Model("qwen3.6:35b")` — delegates
  to the ollama provider; transparent, synchronous, full tool/think/stream.
 - **Level 1 (later):** a `foreman` client package — synchronous facade over the
  async `/jobs` surface (manages a webhook receiver, blocks to done).
 - **Level 2 (if needed):** a dedicated `provider.Provider` surfacing job IDs/state.
 ## Security (ADR-0010)
 - Network is the boundary: target `:11434` firewalled to foreman, and/or both on
  Tailscale. foreman is **not** on a public Traefik entrypoint.
 - Optional static bearer: validate `Authorization: Bearer <token>`, which reuses
  the header `go-llm` already sends via the Foreman/OllamaCloud path.
 - No Authentik/SSO, no per-caller identities for v1. No financial/identity data
  ever transits foreman.
 ## Stack & conventions
 - Go, stdlib `net/http`, minimal deps. SQLite via `modernc.org/sqlite`.
 - No UI. HTTP API + small CLI only.
 - Match go-llm house style: standard Go tabs; `camelCase`/`PascalCase`; check
  errors immediately and wrap with `fmt.Errorf("%w: ...", err)`; imports stdlib →
  third-party → internal. The worker loop never panics; it logs, marks the job,
  continues.
 - ADRs in `docs/adr/` (one decision each, append/supersede). Living `progress.md`
  at repo root. Repo: `gitea.stevedudenhoeffer.com`.
 ## Out of scope (anti-creep guardrails — ADR-0001)
 Distributed dispatch, multiple workers, claim leases, weighted fair queueing,
 capacity budgets, eligibility gates, an auth framework / SSO, a GUI, and managing
 more than one target per daemon. Keep the ollama client behind a small interface
 so a future second backend is additive — but do not build for it now.
 ## Milestones
 - **M0** — native `/api/chat` passthrough + SQLite queue + single-worker loop, one
  model end to end, synchronous.
 - **M1** — model poller + `/api/tags`/`/api/ps`, drain-by-model, async `/jobs` +
  `state_webhook_url` + artifacts + retry-on-unreachable, the CLI, and the
  `llm.Foreman()` constructor in go-llm.
 - **M2 (later)** — optional OpenAI-compat `/v1`, Level-1 client / dedicated
  provider, metrics.
@@ -0,0 +1,37 @@
 # ADR-0001: One daemon per Ollama target
 **Status:** Accepted — 2026-05-23
 ## Context
 `peon-overseer` ballooned because it coordinated *multiple* workers from a
 central service: pull-based dispatch, claim leases, weighted fair queueing,
 capacity budgets, eligibility gates. All of that complexity existed solely to
 arbitrate shared workers. We want none of it back.
 The system being built fronts inference hardware (initially the M1 Pro running
 Ollama) and exposes it as a managed job endpoint.
 ## Decision
 Each `foreman` process is bound to **exactly one** Ollama target, configured by a
 single base URL. One target = one daemon = one queue. There is no cross-daemon
 awareness and no shared state between daemons.
 If a second worker is added later (the 4090 box, the M5 Max), it gets its own
 `foreman` instance. Any fan-out across workers is the concern of a *separate*
 higher-level router that talks to multiple foreman instances — explicitly out of
 scope here and not to be anticipated in this codebase.
 ## Consequences
 - The daemon is radically simple: one target, one serialized work stream.
 - Horizontal scale is "run another daemon," an operational act, not a code change.
 - No lease/fairness/budget machinery is permitted in this repo. If a change
  starts to require it, that is the signal that the multi-worker router (a
  different project) is what's actually needed.
 ## Alternatives considered
 - **One daemon managing many targets.** Rejected: reintroduces the scheduling and
  arbitration complexity that sank the predecessor.
@@ -0,0 +1,36 @@
 # ADR-0002: Daemon placement and remote target configuration
 **Status:** Accepted — 2026-05-23
 ## Context
 The inference box is an M1 Pro MacBook — a laptop, not always-on infrastructure.
 The rest of steveternet runs on the homelab and is deployed/managed through
 Komodo. We do not want bespoke job-controller logic living on the Mac.
 ## Decision
 `foreman` runs on the homelab (e.g. orgrimmar), containerized and deployed via
 Komodo like everything else. It is **given** its Ollama target as a configurable
 base URL (default: the Mac's Tailscale address) and reaches it over the network.
 The Mac runs Ollama and nothing `foreman`-specific. It stays a dumb appliance.
 ## Consequences
 - Ops consistency: foreman is a normal Komodo-managed container.
 - The target URL is config, never hardcoded — swapping the Mac for another
  backend is a config edit (within the one-target-per-daemon rule of ADR-0001).
 - The Mac is a laptop and may sleep or change networks. The daemon must treat an
  unreachable target as a transient, recoverable condition (see ADR-0007 for the
  model poller's degraded mode and ADR-0004 for job retry semantics), never as a
  fatal error. Operationally: `caffeinate`/`pmset` keeps the Mac awake; Tailscale
  keeps its address stable.
 - Network is now the trust boundary; Ollama has no auth of its own (see ADR-0010).
 ## Alternatives considered
 - **Co-locate foreman on the Mac.** Rejected: contradicts the stated preference to
  keep controller logic off the laptop, and complicates Komodo-based deployment.
  Note that "given a target URL" keeps this reversible — co-location would just be
  pointing the URL at localhost.
@@ -0,0 +1,51 @@
 # ADR-0003: API surface — native Ollama passthrough vs OpenAI-compat
 **Status:** Accepted — 2026-05-23 (resolved in favor of native Ollama)
 ## Context
 Two goals were in mild tension: the original phrasing asked for an
 "OpenAI-compatible API," while the stated ultimate goal is to use the M1 Pro
 **simply as a target for `go-llm`**.
 `go-llm`'s `v2/CLAUDE.md` Key Design Decision #8 is explicit: its Ollama provider
 deliberately uses native `/api/chat`, *not* OpenAI-compat `/v1`, for `think:false`
 support, more reliable tool calling, and ~15-20% lower latency.
 **Verified in code (`v2/constructors.go`).** `llm.OllamaCloud(apiKey, opts...)`
 sends the key as `Authorization: Bearer <key>` over native `/api/chat`, and its
 doc comment says to "use `WithBaseURL` to point at a private Ollama deployment
 that requires auth." So go-llm *already* has a first-class path for a private,
 authenticated, native-Ollama endpoint — exactly what foreman is on the wire.
 Choosing OpenAI-compat would push go-llm onto a path its own author rejected, for
 no benefit to the primary caller.
 ## Decision
 Native Ollama is **the** surface for v1. foreman speaks native `/api/chat`,
 `/api/tags`, and `/api/ps`, optionally behind a Bearer token (ADR-0010). To
 go-llm and any Ollama client it is indistinguishable from a private Ollama
 deployment.
 The synchronous passthrough is transparent: calls are queued internally
 (ADR-0009) but the HTTP response blocks until the job completes. Async features
 (job IDs, `state_webhook_url`, artifacts) live on a separate `/jobs` surface
 (ADR-0004), not bolted onto the passthrough.
 OpenAI-compat `/v1/chat/completions` is **deferred**, added in a later milestone
 only if a non-go-llm caller needs it.
 ## Consequences
 - "Set up the Mac as a go-llm target" needs zero provider changes — a thin
  constructor only (ADR-0011).
 - Preserves `think:false`, reliable tool calls, and lower latency.
 - foreman must faithfully proxy native `/api/chat` semantics, including SSE
  streaming (ADR-0012).
 ## Alternatives considered
 - **OpenAI-compat as primary/only surface.** Matches the original phrasing but
  contradicts go-llm DD#8 and adds nothing for the primary caller. Rejected.
 - **Native-only, never add OpenAI-compat.** Fully serves the goal; the secondary
  surface is kept as an option, not a commitment.
@@ -0,0 +1,52 @@
 # ADR-0004: Async job surface, job IDs, and queued execution
 **Status:** Accepted — 2026-05-23
 ## Context
 The transparent passthrough (ADR-0003) is synchronous: the caller holds an HTTP
 connection until the completion returns. That is fine for interactive-length work
 and for go-llm, but two needs aren't served by it:
 - Long-running jobs held open through Traefik risk idle-connection timeouts.
 - Orchestration callers (mort/ratchet/werk-style) want fire-and-forget: submit,
  get an ID back immediately, and be told asynchronously when the work is done.
 ## Decision
 Add a distinct async surface: `POST /jobs`.
 - The body carries a chat payload (native-Ollama-shaped, mirroring `/api/chat`)
  plus optional extension fields, notably `state_webhook_url` (ADR-0005).
 - foreman enqueues the job, assigns it a **ULID** (sortable, timestamped), and
  immediately returns `202 Accepted` with `{ "job_id": "<ulid>" }`.
 - The caller correlates later webhook callbacks to its request via `job_id`.
 - `GET /jobs/{id}` returns current state, result, and artifact references for
  polling-style callers or for recovery after a missed webhook.
 Every unit of work is a row in the queue (ADR-0008) regardless of which surface
 created it; the synchronous passthrough is simply a `/jobs` submission whose
 handler blocks on the job's completion instead of returning the ID.
 ### Job lifecycle
 `queued → loading → working → done`, plus terminal `failed`. A job whose target
 is unreachable re-enters `queued` with a backoff (it is retryable, never
 auto-failed on a connection error — the target is a laptop, ADR-0002). A bounded
 retry count guards against poison jobs; exceeding it moves the job to `failed`
 with the last error recorded.
 ## Consequences
 - One queue, one execution engine, two entry points (sync passthrough, async
  `/jobs`).
 - Job IDs are stable, sortable, and meaningful to correlate webhooks.
 - `GET /jobs/{id}` gives at-least-once webhook delivery a recovery path.
 ## Alternatives considered
 - **Reuse the OpenAI response `id` field instead of a separate `/jobs` surface.**
  Workable for sync, but doesn't give async callers an immediate handle before
  completion. The explicit `/jobs` surface is clearer.
 - **UUIDv4 for IDs.** Rejected in favor of ULID for natural time-ordering in the
  queue and logs.
@@ -0,0 +1,63 @@
 # ADR-0005: Webhook state-update protocol
 **Status:** Accepted — 2026-05-23
 ## Context
 Async callers (ADR-0004) need to know how their job is progressing without
 polling. The requirement: periodically push state updates
 (`queued → loading → working → done`) and deliver results/artifacts on
 completion.
 ## Decision
 When a job is submitted with `state_webhook_url`, foreman POSTs a JSON event to
 that URL on every state transition.
 ### Event payload
 ```json
 {
  "job_id": "01J...",
  "state": "loading",
  "previous_state": "queued",
  "timestamp": "2026-05-23T12:00:00Z",
  "model": "qwen3.6:35b",
  "attempt": 1,
  "error": null,
  "result": null,
  "artifacts": null
 }
 ```
 - `state`: one of `queued`, `loading`, `working`, `done`, `failed`.
 - On `done`: `result` holds the completion (native-Ollama-shaped) and `artifacts`
  holds artifact references (ADR-0006).
 - On `failed`: `error` holds a message; `result` is null.
 ### Delivery semantics
 - **At-least-once.** Callers must be idempotent on `job_id` + `state`. A missed
  webhook can always be reconciled via `GET /jobs/{id}` (ADR-0004).
 - **Retry with backoff** on non-2xx or connection failure, bounded attempts, then
  the event is dropped (the job state itself is unaffected and remains queryable).
 - **Ordering is not guaranteed** across retries; `previous_state` + `timestamp`
  let callers order/deduplicate.
 - **Optional HMAC signing:** if a webhook secret is configured, foreman sends an
  `X-Foreman-Signature` header (HMAC-SHA256 of the body) so receivers can verify
  authenticity. Off by default; recommended once foreman is reachable beyond a
  fully trusted network.
 ## Consequences
 - Callers get push observability with a polling fallback.
 - Idempotency is pushed onto the caller — documented as a hard requirement.
 - Webhook delivery is decoupled from job execution: a flaky receiver never blocks
  or fails the job.
 ## Alternatives considered
 - **Polling only.** Simpler for foreman, worse for callers; rejected since
  webhooks were an explicit requirement. (Polling is still available as fallback.)
 - **WebSocket/SSE for state.** Heavier; SSE is reserved for token streaming on the
  sync surface (ADR-0012), not job-state fan-out.
@@ -0,0 +1,53 @@
 # ADR-0006: Artifact handling and transport
 **Status:** Accepted — 2026-05-23
 ## Context
 Jobs must "transmit artifacts when done." For a chat completion the obvious
 artifact is the assistant's text/tool-call output, but the term is deliberately
 broader: a job may produce structured data, multiple named outputs, or content
 too large to embed comfortably in a webhook body.
 ## Decision
 An **artifact** is a named, typed blob attached to a completed job:
 ```json
 { "name": "completion", "content_type": "application/json", "size": 1234,
  "inline": { ... }, "url": null }
 ```
 - The primary completion is always emitted as an artifact named `completion`
  (the native-Ollama response shape), so there is one consistent access pattern.
 - Additional artifacts use distinct names.
 ### Transport: inline vs fetch
 - **Small artifacts** (under a configurable threshold, default ~256 KB) are
  delivered **inline** in the `done` webhook (`inline` populated, `url` null) and
  in `GET /jobs/{id}`.
 - **Large artifacts** exceed the threshold: the webhook/`GET` carries metadata
  plus a `url` (`GET /jobs/{id}/artifacts/{name}`), and the bytes are fetched
  on demand. This keeps webhook payloads bounded and avoids shipping megabytes
  through a callback POST.
 ### Retention
 Artifacts are stored alongside the job in SQLite (ADR-0008) and pruned with the
 job after a configurable TTL. No separate blob store in v1; revisit only if
 artifact sizes outgrow SQLite comfort (single-digit MB).
 ## Consequences
 - One uniform way to read output (`completion` artifact), extensible to richer
  jobs later without protocol changes.
 - Webhook bodies stay small; large outputs don't bloat or break delivery.
 - A pull endpoint for artifacts means a missed/oversized webhook never loses data.
 ## Alternatives considered
 - **Always inline.** Simple but risks huge webhook bodies and SQLite row bloat in
  the hot path. Rejected.
 - **External object store (S3/MinIO) from day one.** Over-engineered for the
  expected sizes; deferred behind the TTL/threshold knobs.
@@ -0,0 +1,48 @@
 # ADR-0007: Model inventory polling and discovery
 **Status:** Accepted — 2026-05-23
 ## Context
 foreman needs a "relatively in-sync" view of which models are installed on its
 target so it can (a) advertise them to callers, (b) reject jobs for missing
 models early instead of failing mid-execution, and (c) know what is currently
 resident to inform scheduling (ADR-0009).
 ## Decision
 A background poller queries the target on a configurable interval (default ~30s):
 - `GET /api/tags` → the installed-model inventory. Cached in memory; this cache
  backs foreman's own `/api/tags` passthrough (ADR-0003) and `/v1/models` if the
  OpenAI-compat surface is enabled.
 - `GET /api/ps` → which model(s) are currently loaded, their VRAM/where-resident,
  and the unload timer. Used by the scheduler to decide whether the next job
  requires a swap.
 ### Behavior
 - **Early validation:** a job naming a model absent from the cached inventory is
  rejected at submit time with a clear error (and, for async jobs, the inventory
  is recent enough that this is reliable). A small grace path allows a job for a
  model that appears between polls by re-checking once on a miss.
 - **Degraded mode:** if the target is unreachable, the last-known inventory is
  retained and foreman marks itself degraded (surfaced on a health endpoint).
  Jobs are not rejected wholesale on a single failed poll — the target is a
  laptop that may briefly sleep (ADR-0002). Execution-time unreachability is
  handled by job retry (ADR-0004).
 ## Consequences
 - Callers can discover available models through the normal Ollama/OpenAI
  endpoints; no foreman-specific discovery API needed.
 - Bad-model jobs fail fast and cheaply.
 - A health/status endpoint exposing degraded state and last-poll time is required.
 ## Alternatives considered
 - **No caching; proxy `/api/tags` live per request.** Simpler but couples every
  discovery call to target availability and adds latency. Rejected; the poller
  also feeds the scheduler, so the cache is needed regardless.
 - **Push/event-based inventory.** Ollama offers no such mechanism; polling is the
  only option.
@@ -0,0 +1,42 @@
 # ADR-0008: Durable SQLite-backed queue
 **Status:** Accepted — 2026-05-23
 ## Context
 Jobs are queued, carry state, and may be retried across target sleep/restart. A
 caller that submitted an async job and is waiting on a webhook must not lose its
 job because foreman restarted. State must survive process restarts.
 ## Decision
 The job queue and all job state (including artifacts, ADR-0006) live in **SQLite**
 in WAL mode, via the pure-Go `modernc.org/sqlite` driver (no CGO, so the Komodo
 container build stays trivial).
 ### Schema sketch
 - `jobs(id TEXT PK, state TEXT, model TEXT, request BLOB, result BLOB,
  error TEXT, webhook_url TEXT, attempt INT, created_at, updated_at, …)`
 - `artifacts(job_id TEXT, name TEXT, content_type TEXT, size INT, inline BLOB,
  PRIMARY KEY(job_id, name))`
 A single writer (the worker, ADR-0009) plus the HTTP handlers; WAL handles the
 concurrent-reader / single-writer pattern well at this scale.
 ## Consequences
 - Jobs and results are durable across restarts; webhook recovery via
  `GET /jobs/{id}` (ADR-0004) is meaningful.
 - Pure-Go driver keeps cross-compilation and container builds painless.
 - Pruning is a TTL sweep over `jobs`/`artifacts`; no external store to operate.
 - SQLite caps practical artifact size at single-digit MB — acceptable per ADR-0006
  thresholds; revisit if outputs grow.
 ## Alternatives considered
 - **In-memory queue.** Loses async jobs on restart; unacceptable given webhooks.
 - **Redis / external broker.** Another moving part to run for a single-worker
  daemon; over-engineered. Rejected.
 - **`mattn/go-sqlite3` (CGO).** Faster in some cases but complicates static builds
  and container images. Pure-Go preferred for ops simplicity.
@@ -0,0 +1,44 @@
 # ADR-0009: Single-worker serialization and drain-by-model scheduling
 **Status:** Accepted — 2026-05-23
 ## Context
 The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast
 at a time; loading a different model is a 5-10s cold start. Running two models
 concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism
 against a single target buys nothing and would reintroduce coordination logic.
 ## Decision
 **Concurrency against the target is 1.** A single worker loop pulls the next job
 from the queue, ensures the right model is resident, executes, and records the
 result.
 **Drain-by-model scheduling:** before incurring a model swap, the worker finishes
 every queued job that targets the **currently-resident** model (observed via
 `/api/ps`, ADR-0007). Only when no job for the hot model remains does it select a
 job for a different model and pay the swap cost.
 This is an `ORDER BY (model != current_model), created_at` style selection — a
 heuristic, not a scheduler. There is intentionally **no** priority system,
 fairness weighting, or capacity budgeting (those sank the predecessor; see
 ADR-0001).
 Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded
 between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it
 to single-resident swap.
 ## Consequences
 - Swap thrash is minimized without any complex scheduling.
 - A long run of same-model jobs can delay a different-model job — acceptable for a
  background box, and bounded by queue depth. If starvation ever becomes a real
  problem, that is a signal to reconsider, not to pre-build fairness.
 - Throughput is dominated by how well callers batch work by model.
 ## Alternatives considered
 - **FIFO with naive swapping.** Correct but pays a cold start on every model
  change; wasteful when jobs interleave models. Rejected.
 - **Priority/fair scheduling.** Explicitly rejected as scope creep (ADR-0001).
@@ -0,0 +1,51 @@
 # ADR-0010: Authentication and security boundary
 **Status:** Accepted — 2026-05-23
 ## Context
 Ollama itself has no authentication — anyone who can reach `:11434` can drive it.
 foreman sits in front of it and is the network-facing component. We need a real
 boundary without dragging in an auth framework (the kind of scope creep ADR-0001
 guards against).
 ## Decision
 **Primary boundary is the network.** foreman and its Ollama target sit on a
 trusted segment: the target's `:11434` is firewalled to foreman only, and/or
 both are bound to the Tailscale interface. foreman is **not** exposed through a
 public Traefik entrypoint.
 **Optional static bearer token.** If a token is configured, foreman validates the
 `Authorization: Bearer <token>` header on incoming requests. This reuses headers
 that clients already send:
 - `go-llm` via `llm.Ollama()` sends no auth (fine on a trusted segment); via
  `ollama.New(key, baseURL)` it sends `Authorization: Bearer <key>` — so a
  configured foreman token slots straight into the existing provider with no new
  code.
 - The OpenAI-compat surface (if enabled, ADR-0003) carries the same header.
 foreman → target auth: an optional bearer the daemon attaches to its own calls to
 Ollama, for the Ollama-Cloud-style case; empty for a local/LAN target.
 ## Out of scope for v1
 - Authentik / SSO. It is painful for service-to-service traffic and adds nothing
  over network isolation here.
 - Per-caller identities, scopes, rate limiting. Not needed for a single-tenant
  homelab daemon.
 ## Consequences
 - Minimal but real security: network isolation always, plus an optional shared
  secret that integrates with existing clients for free.
 - Webhook authenticity is handled separately by optional HMAC signing (ADR-0005).
 - No financial/identity/credential data ever transits foreman; it brokers chat
  jobs only.
 ## Alternatives considered
 - **No auth, network-only.** Acceptable on a fully trusted tailnet; the optional
  token exists for when foreman's reachability widens.
 - **Full auth framework / SSO.** Rejected as scope creep.
@@ -0,0 +1,73 @@
 # ADR-0011: Go integration — the `Foreman` interface
 **Status:** Accepted — 2026-05-23
 ## Context
 The ultimate goal: use the M1 Pro **simply as a target for `go-llm`**.
 **Verified (`v2/constructors.go`, `v2/ollama/ollama.go`):** `llm.OllamaCloud(key,
 WithBaseURL(...))` already targets "a private Ollama deployment that requires
 auth" — native `/api/chat` + `Authorization: Bearer <key>` against any base URL.
 foreman is exactly that on the wire (ADR-0003). So integration needs **no new
 provider** — only a clean, intent-revealing seam so call sites say "foreman," not
 "Ollama."
 `go-llm`'s provider contract (`v2/provider`) is two methods, `Complete` and
 `Stream`; a future dedicated provider would implement them.
 ## Decision
 Add a `llm.Foreman(baseURL, apiKey, opts...)` constructor to go-llm that delegates
 to the ollama native provider — the ollama translation happens behind the scenes:
 ```go
 func Foreman(baseURL, apiKey string, opts ...ClientOption) *Client {
    cfg := &clientConfig{}
    for _, opt := range opts {
        opt(cfg)
    }
    if cfg.baseURL != "" {
        baseURL = cfg.baseURL
    }
    return NewClient(ollamaProvider.New(apiKey, baseURL))
 }
 // model := llm.Foreman("http://foreman.orgrimmar:PORT", token).Model("qwen3.6:35b")
 ```
 `baseURL` is required (foreman has no default public address). This is a
 deliberate **seam**: v1 is a pass-through to the `ollama` provider; a dedicated
 foreman provider can later replace the delegate to surface job IDs / async state
 without changing call sites.
 ### Three escalating levels
 - **Level 0 — `llm.Foreman(...)` (now, the headline goal).** Transparent,
  synchronous, full native tool-calling / `think:false` / streaming. Queueing and
  model-swap management happen invisibly inside the daemon. Zero provider code.
 - **Level 1 — `foreman` client package (when an orchestration caller needs it).**
  A synchronous facade over the async `/jobs` surface: given messages, it manages
  an ephemeral webhook receiver, blocks until `done`, and returns result +
  artifacts (falling back to `GET /jobs/{id}` polling if it can't receive
  callbacks). For callers wanting async semantics — surfaced job IDs, no
  long-held connection — with a synchronous call signature.
 - **Level 2 — dedicated `provider.Provider` (only if needed).** Wraps Level 1 so
  foreman is a first-class go-llm backend exposing job IDs / state / artifacts the
  plain ollama provider can't. Built only if Level 0 proves insufficient.
 ## Consequences
 - Headline goal met with one constructor and no provider code.
 - Call sites are foreman-named and future-proofed by the seam.
 - Async ergonomics are available later without forcing webhook plumbing on
  callers, and without touching Level-0 users.
 ## Alternatives considered
 - **Just tell users to call `OllamaCloud` with a base URL.** Works identically
  today, but leaks the implementation ("it's Ollama") and offers no seam for
  future foreman-specific behavior. The named constructor is the requested
  "foreman interface."
 - **Ship a dedicated provider from day one (Level 2 first).** More code; bypasses
  the zero-friction win. Deferred.
@@ -0,0 +1,41 @@
 # ADR-0012: Streaming support
 **Status:** Accepted — 2026-05-23
 ## Context
 `go-llm`'s provider interface has a `Stream()` method, and Ollama's native
 `/api/chat` streams token-by-token by default. The synchronous passthrough
 (ADR-0003) must not break streaming clients. Separately, the async `/jobs`
 surface (ADR-0004) reports progress via discrete state webhooks, which is a
 different granularity than token streaming.
 ## Decision
 - **Sync passthrough: support streaming.** When a `/api/chat` request sets
  `stream: true`, foreman streams the target's token deltas back to the caller
  (SSE/chunked, matching Ollama's native streaming). A streamed job still moves
  through the queue; streaming begins once the job reaches `working`, so a job
  waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when
  its turn comes. go-llm's `Stream()` works against foreman unchanged.
 - **Async `/jobs` surface: no token streaming in v1.** Webhooks carry coarse state
  transitions (ADR-0005) and the final result/artifacts, not per-token deltas.
  Token-level streaming over a fire-and-forget webhook job is deliberately
  deferred — it adds a transport (persistent connection or chunked webhook) whose
  complexity isn't justified yet.
 ## Consequences
 - Interactive go-llm usage gets real streaming through the transparent surface.
 - Orchestration callers get state + final artifacts, which is what they need;
  they can use the sync streaming surface directly if they want tokens.
 - The job state machine and webhook protocol stay simple (no streaming transport
  to design or operate).
 ## Alternatives considered
 - **Stream tokens over the async surface too.** Deferred: requires either a
  long-lived connection (defeats the point of async) or chunked-delta webhooks
  (complex, rarely needed). Revisit only on a concrete need.
 - **No streaming at all.** Would break go-llm's `Stream()` and interactive use on
  the very path that is the primary goal. Rejected.
@@ -0,0 +1,41 @@
 # foreman — Architecture Decision Records
 `foreman` is a small daemon that fronts **one** Ollama target. It turns a single
 Ollama instance into a queued, observable job endpoint: it polls the target's
 installed models, serializes jobs through the target (managing model swaps),
 assigns every job an ID, and reports progress + artifacts via webhooks. It also
 ships a Go client so the target is trivial to use from `go-llm`.
 It is the deliberately pared-down successor to `peon-overseer`. One daemon, one
 worker, one queue. No distributed dispatch, no leases, no fair queueing.
 ## Index
 | ADR | Title | Status |
 |-----|-------|--------|
 | 0001 | One daemon per Ollama target | Accepted |
 | 0002 | Daemon placement and remote target configuration | Accepted |
 | 0003 | API surface: native Ollama passthrough vs OpenAI-compat | Accepted |
 | 0004 | Async job surface, job IDs, and queued execution | Accepted |
 | 0005 | Webhook state-update protocol | Accepted |
 | 0006 | Artifact handling and transport | Accepted |
 | 0007 | Model inventory polling and discovery | Accepted |
 | 0008 | Durable SQLite-backed queue | Accepted |
 | 0009 | Single-worker serialization and drain-by-model scheduling | Accepted |
 | 0010 | Authentication and security boundary | Accepted |
 | 0011 | Go client library and go-llm integration | Accepted |
 | 0012 | Streaming support | Accepted |
 ADR-0003 was resolved in favor of **native Ollama** as the v1 surface: foreman is,
 on the wire, a private authenticated Ollama deployment, so `go-llm` integrates via
 a thin `llm.Foreman(baseURL, token)` constructor that delegates to the existing
 ollama provider (ADR-0011). OpenAI-compat `/v1` is deferred.
 These ADRs refine the API/integration sections of the project `CLAUDE.md`. The
 queue, single-worker, drain-by-model, and security guardrails carry forward
 unchanged.
 ## Format
 Each ADR: Status, Context, Decision, Consequences, and Alternatives where useful.
 One decision per file. Append new ADRs; supersede rather than rewrite.