initial commit

This commit is contained in:
2026-05-23 16:41:20 -04:00
commit 8fde024281
15 changed files with 803 additions and 0 deletions
+27
View File
@@ -0,0 +1,27 @@
# Compiled binary (cmd/foreman)
/foreman
/dist/
*.exe
# Test & coverage output
*.out
*.test
coverage.*
# SQLite queue + artifacts (local dev data — never commit)
*.db
*.db-wal
*.db-shm
*.sqlite
*.sqlite3
# Local config / secrets (commit .env.example, not .env)
.env
.env.local
*.local
# Editor / OS cruft
.DS_Store
.idea/
.vscode/
*.swp
+144
View File
@@ -0,0 +1,144 @@
# foreman
A small, always-on daemon that fronts **one** Ollama target. It turns a single
Ollama instance into a queued, observable job endpoint: it polls the target's
installed models, serializes work through the target (managing model swaps),
assigns every job an ID, and reports progress + artifacts via webhooks. On the
wire it speaks **native Ollama**, so it doubles as a drop-in `go-llm` target.
foreman is the deliberately pared-down successor to `peon-overseer`. One daemon,
one target, one queue. The complexity that sank the predecessor — distributed
dispatch, claim leases, weighted fair queueing, capacity budgets, eligibility
gates — existed to coordinate *multiple* workers and is **out of scope**.
Resisting that creep is a first-class design goal. See `docs/adr/` for the
decisions; this file summarizes them.
## Topology (ADR-0001, ADR-0002)
```
orgrimmar: foreman (Go binary + SQLite queue + HTTP API + worker loop)
| HTTP over the trusted VLAN / Tailscale
v
M1 Pro Mac: Ollama only (models on disk, no foreman logic)
```
- One foreman process per Ollama target, configured by a single base URL
(default: the Mac's Tailscale address). A second worker = a second foreman.
- foreman runs on the homelab, containerized, deployed via Komodo. The Mac stays
a dumb appliance.
- The target is a laptop and may sleep. Unreachability is transient/recoverable,
never fatal (poller degraded mode + job retry below).
## API surfaces (ADR-0003, ADR-0004)
1. **Primary — transparent native Ollama passthrough:** `/api/chat`, `/api/tags`,
`/api/ps`. foreman looks exactly like an Ollama server. Synchronous: calls are
queued internally but the HTTP response blocks until completion. SSE streaming
supported (ADR-0012). This is the `go-llm` target path.
2. **Async jobs — `POST /jobs`, `GET /jobs/{id}`:** body is a native-chat payload
plus optional `state_webhook_url`. Returns `202` + `{ "job_id": "<ulid>" }`
immediately. For fire-and-forget orchestration callers.
3. **Optional OpenAI-compat `/v1/chat/completions` + `/v1/models`:** deferred;
added only if a non-go-llm caller needs it.
Job lifecycle: `queued → loading → working → done` (+ terminal `failed`). A
connection failure to the target re-queues the job with backoff (bounded retries
guard poison jobs). IDs are ULIDs (sortable, timestamped).
## Webhooks & artifacts (ADR-0005, ADR-0006)
- On each state transition, POST a JSON event to `state_webhook_url`
(`job_id`, `state`, `previous_state`, `timestamp`, `model`, `attempt`, and on
completion `result` / `artifacts` / `error`).
- At-least-once delivery; callers must be idempotent on `job_id`+`state`; missed
events reconcile via `GET /jobs/{id}`. Retry with bounded backoff. Optional
`X-Foreman-Signature` HMAC when a webhook secret is configured.
- Artifacts are named typed blobs; the completion is always artifact `completion`.
Inline under ~256KB, otherwise fetched via `GET /jobs/{id}/artifacts/{name}`.
## Model inventory (ADR-0007)
- A poller hits the target's `/api/tags` (default ~30s) to keep an in-sync model
list; backs foreman's `/api/tags` passthrough and job validation.
- `/api/ps` tells foreman what's resident, feeding the scheduler.
- Jobs naming an uninstalled model are rejected at submit time (one re-check on
miss). Target unreachable → retain last-known list, mark degraded on a health
endpoint; do not reject wholesale on a single failed poll.
## Execution (ADR-0009)
- **Concurrency against the target is 1.** A single worker loop pulls a job,
ensures the right model is resident, executes, records the result.
- **Drain-by-model:** finish every queued job for the currently-resident model
before paying a swap (`ORDER BY (model != current), created_at`). A heuristic,
not a scheduler. No priorities, fairness, or budgets.
- Pin residency with Ollama `keep_alive`; target runs `OLLAMA_MAX_LOADED_MODELS=1`
and `OLLAMA_CONTEXT_LENGTH=8192`+.
## Persistence (ADR-0008)
- SQLite, WAL mode, pure-Go `modernc.org/sqlite` (no CGO → trivial Komodo builds).
- `jobs` + `artifacts` tables; single writer (the worker) + HTTP readers. TTL
sweep for pruning. No external broker.
## Models served
foreman serves **any installed model** named in a request; it does not own a
role→model mapping (the caller picks the model, e.g. go-llm `.Model(...)`).
Recommended roster to pull on the Mac (32GB, ~26-28GB usable, single-resident
swap):
- **parse / data** — `qwen3:14b` (~9GB, structured/JSON output).
- **agent + code** — `qwen3.6:35b` (MoE, ~3B active, ~20GB, fast tool-calling).
- Split a dedicated dense coder (`qwen3.6:27b`) off later only if `35b`'s code
quality disappoints; it's bandwidth-bound and slow on this Mac.
- Verify exact tags against the Ollama library before pulling; the registry moves.
## go-llm integration (ADR-0011)
Verified: `llm.OllamaCloud(key, WithBaseURL(...))` already targets a private
authenticated native-Ollama endpoint — which foreman is. Integration is a thin
constructor, no new provider:
- **Level 0 (now):** `llm.Foreman(baseURL, token).Model("qwen3.6:35b")` — delegates
to the ollama provider; transparent, synchronous, full tool/think/stream.
- **Level 1 (later):** a `foreman` client package — synchronous facade over the
async `/jobs` surface (manages a webhook receiver, blocks to done).
- **Level 2 (if needed):** a dedicated `provider.Provider` surfacing job IDs/state.
## Security (ADR-0010)
- Network is the boundary: target `:11434` firewalled to foreman, and/or both on
Tailscale. foreman is **not** on a public Traefik entrypoint.
- Optional static bearer: validate `Authorization: Bearer <token>`, which reuses
the header `go-llm` already sends via the Foreman/OllamaCloud path.
- No Authentik/SSO, no per-caller identities for v1. No financial/identity data
ever transits foreman.
## Stack & conventions
- Go, stdlib `net/http`, minimal deps. SQLite via `modernc.org/sqlite`.
- No UI. HTTP API + small CLI only.
- Match go-llm house style: standard Go tabs; `camelCase`/`PascalCase`; check
errors immediately and wrap with `fmt.Errorf("%w: ...", err)`; imports stdlib →
third-party → internal. The worker loop never panics; it logs, marks the job,
continues.
- ADRs in `docs/adr/` (one decision each, append/supersede). Living `progress.md`
at repo root. Repo: `gitea.stevedudenhoeffer.com`.
## Out of scope (anti-creep guardrails — ADR-0001)
Distributed dispatch, multiple workers, claim leases, weighted fair queueing,
capacity budgets, eligibility gates, an auth framework / SSO, a GUI, and managing
more than one target per daemon. Keep the ollama client behind a small interface
so a future second backend is additive — but do not build for it now.
## Milestones
- **M0** — native `/api/chat` passthrough + SQLite queue + single-worker loop, one
model end to end, synchronous.
- **M1** — model poller + `/api/tags`/`/api/ps`, drain-by-model, async `/jobs` +
`state_webhook_url` + artifacts + retry-on-unreachable, the CLI, and the
`llm.Foreman()` constructor in go-llm.
- **M2 (later)** — optional OpenAI-compat `/v1`, Level-1 client / dedicated
provider, metrics.
+37
View File
@@ -0,0 +1,37 @@
# ADR-0001: One daemon per Ollama target
**Status:** Accepted — 2026-05-23
## Context
`peon-overseer` ballooned because it coordinated *multiple* workers from a
central service: pull-based dispatch, claim leases, weighted fair queueing,
capacity budgets, eligibility gates. All of that complexity existed solely to
arbitrate shared workers. We want none of it back.
The system being built fronts inference hardware (initially the M1 Pro running
Ollama) and exposes it as a managed job endpoint.
## Decision
Each `foreman` process is bound to **exactly one** Ollama target, configured by a
single base URL. One target = one daemon = one queue. There is no cross-daemon
awareness and no shared state between daemons.
If a second worker is added later (the 4090 box, the M5 Max), it gets its own
`foreman` instance. Any fan-out across workers is the concern of a *separate*
higher-level router that talks to multiple foreman instances — explicitly out of
scope here and not to be anticipated in this codebase.
## Consequences
- The daemon is radically simple: one target, one serialized work stream.
- Horizontal scale is "run another daemon," an operational act, not a code change.
- No lease/fairness/budget machinery is permitted in this repo. If a change
starts to require it, that is the signal that the multi-worker router (a
different project) is what's actually needed.
## Alternatives considered
- **One daemon managing many targets.** Rejected: reintroduces the scheduling and
arbitration complexity that sank the predecessor.
+36
View File
@@ -0,0 +1,36 @@
# ADR-0002: Daemon placement and remote target configuration
**Status:** Accepted — 2026-05-23
## Context
The inference box is an M1 Pro MacBook — a laptop, not always-on infrastructure.
The rest of steveternet runs on the homelab and is deployed/managed through
Komodo. We do not want bespoke job-controller logic living on the Mac.
## Decision
`foreman` runs on the homelab (e.g. orgrimmar), containerized and deployed via
Komodo like everything else. It is **given** its Ollama target as a configurable
base URL (default: the Mac's Tailscale address) and reaches it over the network.
The Mac runs Ollama and nothing `foreman`-specific. It stays a dumb appliance.
## Consequences
- Ops consistency: foreman is a normal Komodo-managed container.
- The target URL is config, never hardcoded — swapping the Mac for another
backend is a config edit (within the one-target-per-daemon rule of ADR-0001).
- The Mac is a laptop and may sleep or change networks. The daemon must treat an
unreachable target as a transient, recoverable condition (see ADR-0007 for the
model poller's degraded mode and ADR-0004 for job retry semantics), never as a
fatal error. Operationally: `caffeinate`/`pmset` keeps the Mac awake; Tailscale
keeps its address stable.
- Network is now the trust boundary; Ollama has no auth of its own (see ADR-0010).
## Alternatives considered
- **Co-locate foreman on the Mac.** Rejected: contradicts the stated preference to
keep controller logic off the laptop, and complicates Komodo-based deployment.
Note that "given a target URL" keeps this reversible — co-location would just be
pointing the URL at localhost.
+51
View File
@@ -0,0 +1,51 @@
# ADR-0003: API surface — native Ollama passthrough vs OpenAI-compat
**Status:** Accepted — 2026-05-23 (resolved in favor of native Ollama)
## Context
Two goals were in mild tension: the original phrasing asked for an
"OpenAI-compatible API," while the stated ultimate goal is to use the M1 Pro
**simply as a target for `go-llm`**.
`go-llm`'s `v2/CLAUDE.md` Key Design Decision #8 is explicit: its Ollama provider
deliberately uses native `/api/chat`, *not* OpenAI-compat `/v1`, for `think:false`
support, more reliable tool calling, and ~15-20% lower latency.
**Verified in code (`v2/constructors.go`).** `llm.OllamaCloud(apiKey, opts...)`
sends the key as `Authorization: Bearer <key>` over native `/api/chat`, and its
doc comment says to "use `WithBaseURL` to point at a private Ollama deployment
that requires auth." So go-llm *already* has a first-class path for a private,
authenticated, native-Ollama endpoint — exactly what foreman is on the wire.
Choosing OpenAI-compat would push go-llm onto a path its own author rejected, for
no benefit to the primary caller.
## Decision
Native Ollama is **the** surface for v1. foreman speaks native `/api/chat`,
`/api/tags`, and `/api/ps`, optionally behind a Bearer token (ADR-0010). To
go-llm and any Ollama client it is indistinguishable from a private Ollama
deployment.
The synchronous passthrough is transparent: calls are queued internally
(ADR-0009) but the HTTP response blocks until the job completes. Async features
(job IDs, `state_webhook_url`, artifacts) live on a separate `/jobs` surface
(ADR-0004), not bolted onto the passthrough.
OpenAI-compat `/v1/chat/completions` is **deferred**, added in a later milestone
only if a non-go-llm caller needs it.
## Consequences
- "Set up the Mac as a go-llm target" needs zero provider changes — a thin
constructor only (ADR-0011).
- Preserves `think:false`, reliable tool calls, and lower latency.
- foreman must faithfully proxy native `/api/chat` semantics, including SSE
streaming (ADR-0012).
## Alternatives considered
- **OpenAI-compat as primary/only surface.** Matches the original phrasing but
contradicts go-llm DD#8 and adds nothing for the primary caller. Rejected.
- **Native-only, never add OpenAI-compat.** Fully serves the goal; the secondary
surface is kept as an option, not a commitment.
+52
View File
@@ -0,0 +1,52 @@
# ADR-0004: Async job surface, job IDs, and queued execution
**Status:** Accepted — 2026-05-23
## Context
The transparent passthrough (ADR-0003) is synchronous: the caller holds an HTTP
connection until the completion returns. That is fine for interactive-length work
and for go-llm, but two needs aren't served by it:
- Long-running jobs held open through Traefik risk idle-connection timeouts.
- Orchestration callers (mort/ratchet/werk-style) want fire-and-forget: submit,
get an ID back immediately, and be told asynchronously when the work is done.
## Decision
Add a distinct async surface: `POST /jobs`.
- The body carries a chat payload (native-Ollama-shaped, mirroring `/api/chat`)
plus optional extension fields, notably `state_webhook_url` (ADR-0005).
- foreman enqueues the job, assigns it a **ULID** (sortable, timestamped), and
immediately returns `202 Accepted` with `{ "job_id": "<ulid>" }`.
- The caller correlates later webhook callbacks to its request via `job_id`.
- `GET /jobs/{id}` returns current state, result, and artifact references for
polling-style callers or for recovery after a missed webhook.
Every unit of work is a row in the queue (ADR-0008) regardless of which surface
created it; the synchronous passthrough is simply a `/jobs` submission whose
handler blocks on the job's completion instead of returning the ID.
### Job lifecycle
`queued → loading → working → done`, plus terminal `failed`. A job whose target
is unreachable re-enters `queued` with a backoff (it is retryable, never
auto-failed on a connection error — the target is a laptop, ADR-0002). A bounded
retry count guards against poison jobs; exceeding it moves the job to `failed`
with the last error recorded.
## Consequences
- One queue, one execution engine, two entry points (sync passthrough, async
`/jobs`).
- Job IDs are stable, sortable, and meaningful to correlate webhooks.
- `GET /jobs/{id}` gives at-least-once webhook delivery a recovery path.
## Alternatives considered
- **Reuse the OpenAI response `id` field instead of a separate `/jobs` surface.**
Workable for sync, but doesn't give async callers an immediate handle before
completion. The explicit `/jobs` surface is clearer.
- **UUIDv4 for IDs.** Rejected in favor of ULID for natural time-ordering in the
queue and logs.
+63
View File
@@ -0,0 +1,63 @@
# ADR-0005: Webhook state-update protocol
**Status:** Accepted — 2026-05-23
## Context
Async callers (ADR-0004) need to know how their job is progressing without
polling. The requirement: periodically push state updates
(`queued → loading → working → done`) and deliver results/artifacts on
completion.
## Decision
When a job is submitted with `state_webhook_url`, foreman POSTs a JSON event to
that URL on every state transition.
### Event payload
```json
{
"job_id": "01J...",
"state": "loading",
"previous_state": "queued",
"timestamp": "2026-05-23T12:00:00Z",
"model": "qwen3.6:35b",
"attempt": 1,
"error": null,
"result": null,
"artifacts": null
}
```
- `state`: one of `queued`, `loading`, `working`, `done`, `failed`.
- On `done`: `result` holds the completion (native-Ollama-shaped) and `artifacts`
holds artifact references (ADR-0006).
- On `failed`: `error` holds a message; `result` is null.
### Delivery semantics
- **At-least-once.** Callers must be idempotent on `job_id` + `state`. A missed
webhook can always be reconciled via `GET /jobs/{id}` (ADR-0004).
- **Retry with backoff** on non-2xx or connection failure, bounded attempts, then
the event is dropped (the job state itself is unaffected and remains queryable).
- **Ordering is not guaranteed** across retries; `previous_state` + `timestamp`
let callers order/deduplicate.
- **Optional HMAC signing:** if a webhook secret is configured, foreman sends an
`X-Foreman-Signature` header (HMAC-SHA256 of the body) so receivers can verify
authenticity. Off by default; recommended once foreman is reachable beyond a
fully trusted network.
## Consequences
- Callers get push observability with a polling fallback.
- Idempotency is pushed onto the caller — documented as a hard requirement.
- Webhook delivery is decoupled from job execution: a flaky receiver never blocks
or fails the job.
## Alternatives considered
- **Polling only.** Simpler for foreman, worse for callers; rejected since
webhooks were an explicit requirement. (Polling is still available as fallback.)
- **WebSocket/SSE for state.** Heavier; SSE is reserved for token streaming on the
sync surface (ADR-0012), not job-state fan-out.
+53
View File
@@ -0,0 +1,53 @@
# ADR-0006: Artifact handling and transport
**Status:** Accepted — 2026-05-23
## Context
Jobs must "transmit artifacts when done." For a chat completion the obvious
artifact is the assistant's text/tool-call output, but the term is deliberately
broader: a job may produce structured data, multiple named outputs, or content
too large to embed comfortably in a webhook body.
## Decision
An **artifact** is a named, typed blob attached to a completed job:
```json
{ "name": "completion", "content_type": "application/json", "size": 1234,
"inline": { ... }, "url": null }
```
- The primary completion is always emitted as an artifact named `completion`
(the native-Ollama response shape), so there is one consistent access pattern.
- Additional artifacts use distinct names.
### Transport: inline vs fetch
- **Small artifacts** (under a configurable threshold, default ~256 KB) are
delivered **inline** in the `done` webhook (`inline` populated, `url` null) and
in `GET /jobs/{id}`.
- **Large artifacts** exceed the threshold: the webhook/`GET` carries metadata
plus a `url` (`GET /jobs/{id}/artifacts/{name}`), and the bytes are fetched
on demand. This keeps webhook payloads bounded and avoids shipping megabytes
through a callback POST.
### Retention
Artifacts are stored alongside the job in SQLite (ADR-0008) and pruned with the
job after a configurable TTL. No separate blob store in v1; revisit only if
artifact sizes outgrow SQLite comfort (single-digit MB).
## Consequences
- One uniform way to read output (`completion` artifact), extensible to richer
jobs later without protocol changes.
- Webhook bodies stay small; large outputs don't bloat or break delivery.
- A pull endpoint for artifacts means a missed/oversized webhook never loses data.
## Alternatives considered
- **Always inline.** Simple but risks huge webhook bodies and SQLite row bloat in
the hot path. Rejected.
- **External object store (S3/MinIO) from day one.** Over-engineered for the
expected sizes; deferred behind the TTL/threshold knobs.
+48
View File
@@ -0,0 +1,48 @@
# ADR-0007: Model inventory polling and discovery
**Status:** Accepted — 2026-05-23
## Context
foreman needs a "relatively in-sync" view of which models are installed on its
target so it can (a) advertise them to callers, (b) reject jobs for missing
models early instead of failing mid-execution, and (c) know what is currently
resident to inform scheduling (ADR-0009).
## Decision
A background poller queries the target on a configurable interval (default ~30s):
- `GET /api/tags` → the installed-model inventory. Cached in memory; this cache
backs foreman's own `/api/tags` passthrough (ADR-0003) and `/v1/models` if the
OpenAI-compat surface is enabled.
- `GET /api/ps` → which model(s) are currently loaded, their VRAM/where-resident,
and the unload timer. Used by the scheduler to decide whether the next job
requires a swap.
### Behavior
- **Early validation:** a job naming a model absent from the cached inventory is
rejected at submit time with a clear error (and, for async jobs, the inventory
is recent enough that this is reliable). A small grace path allows a job for a
model that appears between polls by re-checking once on a miss.
- **Degraded mode:** if the target is unreachable, the last-known inventory is
retained and foreman marks itself degraded (surfaced on a health endpoint).
Jobs are not rejected wholesale on a single failed poll — the target is a
laptop that may briefly sleep (ADR-0002). Execution-time unreachability is
handled by job retry (ADR-0004).
## Consequences
- Callers can discover available models through the normal Ollama/OpenAI
endpoints; no foreman-specific discovery API needed.
- Bad-model jobs fail fast and cheaply.
- A health/status endpoint exposing degraded state and last-poll time is required.
## Alternatives considered
- **No caching; proxy `/api/tags` live per request.** Simpler but couples every
discovery call to target availability and adds latency. Rejected; the poller
also feeds the scheduler, so the cache is needed regardless.
- **Push/event-based inventory.** Ollama offers no such mechanism; polling is the
only option.
+42
View File
@@ -0,0 +1,42 @@
# ADR-0008: Durable SQLite-backed queue
**Status:** Accepted — 2026-05-23
## Context
Jobs are queued, carry state, and may be retried across target sleep/restart. A
caller that submitted an async job and is waiting on a webhook must not lose its
job because foreman restarted. State must survive process restarts.
## Decision
The job queue and all job state (including artifacts, ADR-0006) live in **SQLite**
in WAL mode, via the pure-Go `modernc.org/sqlite` driver (no CGO, so the Komodo
container build stays trivial).
### Schema sketch
- `jobs(id TEXT PK, state TEXT, model TEXT, request BLOB, result BLOB,
error TEXT, webhook_url TEXT, attempt INT, created_at, updated_at, …)`
- `artifacts(job_id TEXT, name TEXT, content_type TEXT, size INT, inline BLOB,
PRIMARY KEY(job_id, name))`
A single writer (the worker, ADR-0009) plus the HTTP handlers; WAL handles the
concurrent-reader / single-writer pattern well at this scale.
## Consequences
- Jobs and results are durable across restarts; webhook recovery via
`GET /jobs/{id}` (ADR-0004) is meaningful.
- Pure-Go driver keeps cross-compilation and container builds painless.
- Pruning is a TTL sweep over `jobs`/`artifacts`; no external store to operate.
- SQLite caps practical artifact size at single-digit MB — acceptable per ADR-0006
thresholds; revisit if outputs grow.
## Alternatives considered
- **In-memory queue.** Loses async jobs on restart; unacceptable given webhooks.
- **Redis / external broker.** Another moving part to run for a single-worker
daemon; over-engineered. Rejected.
- **`mattn/go-sqlite3` (CGO).** Faster in some cases but complicates static builds
and container images. Pure-Go preferred for ops simplicity.
@@ -0,0 +1,44 @@
# ADR-0009: Single-worker serialization and drain-by-model scheduling
**Status:** Accepted — 2026-05-23
## Context
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast
at a time; loading a different model is a 5-10s cold start. Running two models
concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism
against a single target buys nothing and would reintroduce coordination logic.
## Decision
**Concurrency against the target is 1.** A single worker loop pulls the next job
from the queue, ensures the right model is resident, executes, and records the
result.
**Drain-by-model scheduling:** before incurring a model swap, the worker finishes
every queued job that targets the **currently-resident** model (observed via
`/api/ps`, ADR-0007). Only when no job for the hot model remains does it select a
job for a different model and pay the swap cost.
This is an `ORDER BY (model != current_model), created_at` style selection — a
heuristic, not a scheduler. There is intentionally **no** priority system,
fairness weighting, or capacity budgeting (those sank the predecessor; see
ADR-0001).
Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded
between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it
to single-resident swap.
## Consequences
- Swap thrash is minimized without any complex scheduling.
- A long run of same-model jobs can delay a different-model job — acceptable for a
background box, and bounded by queue depth. If starvation ever becomes a real
problem, that is a signal to reconsider, not to pre-build fairness.
- Throughput is dominated by how well callers batch work by model.
## Alternatives considered
- **FIFO with naive swapping.** Correct but pays a cold start on every model
change; wasteful when jobs interleave models. Rejected.
- **Priority/fair scheduling.** Explicitly rejected as scope creep (ADR-0001).
+51
View File
@@ -0,0 +1,51 @@
# ADR-0010: Authentication and security boundary
**Status:** Accepted — 2026-05-23
## Context
Ollama itself has no authentication — anyone who can reach `:11434` can drive it.
foreman sits in front of it and is the network-facing component. We need a real
boundary without dragging in an auth framework (the kind of scope creep ADR-0001
guards against).
## Decision
**Primary boundary is the network.** foreman and its Ollama target sit on a
trusted segment: the target's `:11434` is firewalled to foreman only, and/or
both are bound to the Tailscale interface. foreman is **not** exposed through a
public Traefik entrypoint.
**Optional static bearer token.** If a token is configured, foreman validates the
`Authorization: Bearer <token>` header on incoming requests. This reuses headers
that clients already send:
- `go-llm` via `llm.Ollama()` sends no auth (fine on a trusted segment); via
`ollama.New(key, baseURL)` it sends `Authorization: Bearer <key>` — so a
configured foreman token slots straight into the existing provider with no new
code.
- The OpenAI-compat surface (if enabled, ADR-0003) carries the same header.
foreman → target auth: an optional bearer the daemon attaches to its own calls to
Ollama, for the Ollama-Cloud-style case; empty for a local/LAN target.
## Out of scope for v1
- Authentik / SSO. It is painful for service-to-service traffic and adds nothing
over network isolation here.
- Per-caller identities, scopes, rate limiting. Not needed for a single-tenant
homelab daemon.
## Consequences
- Minimal but real security: network isolation always, plus an optional shared
secret that integrates with existing clients for free.
- Webhook authenticity is handled separately by optional HMAC signing (ADR-0005).
- No financial/identity/credential data ever transits foreman; it brokers chat
jobs only.
## Alternatives considered
- **No auth, network-only.** Acceptable on a fully trusted tailnet; the optional
token exists for when foreman's reachability widens.
- **Full auth framework / SSO.** Rejected as scope creep.
@@ -0,0 +1,73 @@
# ADR-0011: Go integration — the `Foreman` interface
**Status:** Accepted — 2026-05-23
## Context
The ultimate goal: use the M1 Pro **simply as a target for `go-llm`**.
**Verified (`v2/constructors.go`, `v2/ollama/ollama.go`):** `llm.OllamaCloud(key,
WithBaseURL(...))` already targets "a private Ollama deployment that requires
auth" — native `/api/chat` + `Authorization: Bearer <key>` against any base URL.
foreman is exactly that on the wire (ADR-0003). So integration needs **no new
provider** — only a clean, intent-revealing seam so call sites say "foreman," not
"Ollama."
`go-llm`'s provider contract (`v2/provider`) is two methods, `Complete` and
`Stream`; a future dedicated provider would implement them.
## Decision
Add a `llm.Foreman(baseURL, apiKey, opts...)` constructor to go-llm that delegates
to the ollama native provider — the ollama translation happens behind the scenes:
```go
func Foreman(baseURL, apiKey string, opts ...ClientOption) *Client {
cfg := &clientConfig{}
for _, opt := range opts {
opt(cfg)
}
if cfg.baseURL != "" {
baseURL = cfg.baseURL
}
return NewClient(ollamaProvider.New(apiKey, baseURL))
}
// model := llm.Foreman("http://foreman.orgrimmar:PORT", token).Model("qwen3.6:35b")
```
`baseURL` is required (foreman has no default public address). This is a
deliberate **seam**: v1 is a pass-through to the `ollama` provider; a dedicated
foreman provider can later replace the delegate to surface job IDs / async state
without changing call sites.
### Three escalating levels
- **Level 0 — `llm.Foreman(...)` (now, the headline goal).** Transparent,
synchronous, full native tool-calling / `think:false` / streaming. Queueing and
model-swap management happen invisibly inside the daemon. Zero provider code.
- **Level 1 — `foreman` client package (when an orchestration caller needs it).**
A synchronous facade over the async `/jobs` surface: given messages, it manages
an ephemeral webhook receiver, blocks until `done`, and returns result +
artifacts (falling back to `GET /jobs/{id}` polling if it can't receive
callbacks). For callers wanting async semantics — surfaced job IDs, no
long-held connection — with a synchronous call signature.
- **Level 2 — dedicated `provider.Provider` (only if needed).** Wraps Level 1 so
foreman is a first-class go-llm backend exposing job IDs / state / artifacts the
plain ollama provider can't. Built only if Level 0 proves insufficient.
## Consequences
- Headline goal met with one constructor and no provider code.
- Call sites are foreman-named and future-proofed by the seam.
- Async ergonomics are available later without forcing webhook plumbing on
callers, and without touching Level-0 users.
## Alternatives considered
- **Just tell users to call `OllamaCloud` with a base URL.** Works identically
today, but leaks the implementation ("it's Ollama") and offers no seam for
future foreman-specific behavior. The named constructor is the requested
"foreman interface."
- **Ship a dedicated provider from day one (Level 2 first).** More code; bypasses
the zero-friction win. Deferred.
+41
View File
@@ -0,0 +1,41 @@
# ADR-0012: Streaming support
**Status:** Accepted — 2026-05-23
## Context
`go-llm`'s provider interface has a `Stream()` method, and Ollama's native
`/api/chat` streams token-by-token by default. The synchronous passthrough
(ADR-0003) must not break streaming clients. Separately, the async `/jobs`
surface (ADR-0004) reports progress via discrete state webhooks, which is a
different granularity than token streaming.
## Decision
- **Sync passthrough: support streaming.** When a `/api/chat` request sets
`stream: true`, foreman streams the target's token deltas back to the caller
(SSE/chunked, matching Ollama's native streaming). A streamed job still moves
through the queue; streaming begins once the job reaches `working`, so a job
waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when
its turn comes. go-llm's `Stream()` works against foreman unchanged.
- **Async `/jobs` surface: no token streaming in v1.** Webhooks carry coarse state
transitions (ADR-0005) and the final result/artifacts, not per-token deltas.
Token-level streaming over a fire-and-forget webhook job is deliberately
deferred — it adds a transport (persistent connection or chunked webhook) whose
complexity isn't justified yet.
## Consequences
- Interactive go-llm usage gets real streaming through the transparent surface.
- Orchestration callers get state + final artifacts, which is what they need;
they can use the sync streaming surface directly if they want tokens.
- The job state machine and webhook protocol stay simple (no streaming transport
to design or operate).
## Alternatives considered
- **Stream tokens over the async surface too.** Deferred: requires either a
long-lived connection (defeats the point of async) or chunked-delta webhooks
(complex, rarely needed). Revisit only on a concrete need.
- **No streaming at all.** Would break go-llm's `Stream()` and interactive use on
the very path that is the primary goal. Rejected.
+41
View File
@@ -0,0 +1,41 @@
# foreman — Architecture Decision Records
`foreman` is a small daemon that fronts **one** Ollama target. It turns a single
Ollama instance into a queued, observable job endpoint: it polls the target's
installed models, serializes jobs through the target (managing model swaps),
assigns every job an ID, and reports progress + artifacts via webhooks. It also
ships a Go client so the target is trivial to use from `go-llm`.
It is the deliberately pared-down successor to `peon-overseer`. One daemon, one
worker, one queue. No distributed dispatch, no leases, no fair queueing.
## Index
| ADR | Title | Status |
|-----|-------|--------|
| 0001 | One daemon per Ollama target | Accepted |
| 0002 | Daemon placement and remote target configuration | Accepted |
| 0003 | API surface: native Ollama passthrough vs OpenAI-compat | Accepted |
| 0004 | Async job surface, job IDs, and queued execution | Accepted |
| 0005 | Webhook state-update protocol | Accepted |
| 0006 | Artifact handling and transport | Accepted |
| 0007 | Model inventory polling and discovery | Accepted |
| 0008 | Durable SQLite-backed queue | Accepted |
| 0009 | Single-worker serialization and drain-by-model scheduling | Accepted |
| 0010 | Authentication and security boundary | Accepted |
| 0011 | Go client library and go-llm integration | Accepted |
| 0012 | Streaming support | Accepted |
ADR-0003 was resolved in favor of **native Ollama** as the v1 surface: foreman is,
on the wire, a private authenticated Ollama deployment, so `go-llm` integrates via
a thin `llm.Foreman(baseURL, token)` constructor that delegates to the existing
ollama provider (ADR-0011). OpenAI-compat `/v1` is deferred.
These ADRs refine the API/integration sections of the project `CLAUDE.md`. The
queue, single-worker, drain-by-model, and security guardrails carry forward
unchanged.
## Format
Each ADR: Status, Context, Decision, Consequences, and Alternatives where useful.
One decision per file. Append new ADRs; supersede rather than rewrite.