initial commit
This commit is contained in:
+27
@@ -0,0 +1,27 @@
|
|||||||
|
# Compiled binary (cmd/foreman)
|
||||||
|
/foreman
|
||||||
|
/dist/
|
||||||
|
*.exe
|
||||||
|
|
||||||
|
# Test & coverage output
|
||||||
|
*.out
|
||||||
|
*.test
|
||||||
|
coverage.*
|
||||||
|
|
||||||
|
# SQLite queue + artifacts (local dev data — never commit)
|
||||||
|
*.db
|
||||||
|
*.db-wal
|
||||||
|
*.db-shm
|
||||||
|
*.sqlite
|
||||||
|
*.sqlite3
|
||||||
|
|
||||||
|
# Local config / secrets (commit .env.example, not .env)
|
||||||
|
.env
|
||||||
|
.env.local
|
||||||
|
*.local
|
||||||
|
|
||||||
|
# Editor / OS cruft
|
||||||
|
.DS_Store
|
||||||
|
.idea/
|
||||||
|
.vscode/
|
||||||
|
*.swp
|
||||||
@@ -0,0 +1,144 @@
|
|||||||
|
# foreman
|
||||||
|
|
||||||
|
A small, always-on daemon that fronts **one** Ollama target. It turns a single
|
||||||
|
Ollama instance into a queued, observable job endpoint: it polls the target's
|
||||||
|
installed models, serializes work through the target (managing model swaps),
|
||||||
|
assigns every job an ID, and reports progress + artifacts via webhooks. On the
|
||||||
|
wire it speaks **native Ollama**, so it doubles as a drop-in `go-llm` target.
|
||||||
|
|
||||||
|
foreman is the deliberately pared-down successor to `peon-overseer`. One daemon,
|
||||||
|
one target, one queue. The complexity that sank the predecessor — distributed
|
||||||
|
dispatch, claim leases, weighted fair queueing, capacity budgets, eligibility
|
||||||
|
gates — existed to coordinate *multiple* workers and is **out of scope**.
|
||||||
|
Resisting that creep is a first-class design goal. See `docs/adr/` for the
|
||||||
|
decisions; this file summarizes them.
|
||||||
|
|
||||||
|
## Topology (ADR-0001, ADR-0002)
|
||||||
|
|
||||||
|
```
|
||||||
|
orgrimmar: foreman (Go binary + SQLite queue + HTTP API + worker loop)
|
||||||
|
| HTTP over the trusted VLAN / Tailscale
|
||||||
|
v
|
||||||
|
M1 Pro Mac: Ollama only (models on disk, no foreman logic)
|
||||||
|
```
|
||||||
|
|
||||||
|
- One foreman process per Ollama target, configured by a single base URL
|
||||||
|
(default: the Mac's Tailscale address). A second worker = a second foreman.
|
||||||
|
- foreman runs on the homelab, containerized, deployed via Komodo. The Mac stays
|
||||||
|
a dumb appliance.
|
||||||
|
- The target is a laptop and may sleep. Unreachability is transient/recoverable,
|
||||||
|
never fatal (poller degraded mode + job retry below).
|
||||||
|
|
||||||
|
## API surfaces (ADR-0003, ADR-0004)
|
||||||
|
|
||||||
|
1. **Primary — transparent native Ollama passthrough:** `/api/chat`, `/api/tags`,
|
||||||
|
`/api/ps`. foreman looks exactly like an Ollama server. Synchronous: calls are
|
||||||
|
queued internally but the HTTP response blocks until completion. SSE streaming
|
||||||
|
supported (ADR-0012). This is the `go-llm` target path.
|
||||||
|
2. **Async jobs — `POST /jobs`, `GET /jobs/{id}`:** body is a native-chat payload
|
||||||
|
plus optional `state_webhook_url`. Returns `202` + `{ "job_id": "<ulid>" }`
|
||||||
|
immediately. For fire-and-forget orchestration callers.
|
||||||
|
3. **Optional OpenAI-compat `/v1/chat/completions` + `/v1/models`:** deferred;
|
||||||
|
added only if a non-go-llm caller needs it.
|
||||||
|
|
||||||
|
Job lifecycle: `queued → loading → working → done` (+ terminal `failed`). A
|
||||||
|
connection failure to the target re-queues the job with backoff (bounded retries
|
||||||
|
guard poison jobs). IDs are ULIDs (sortable, timestamped).
|
||||||
|
|
||||||
|
## Webhooks & artifacts (ADR-0005, ADR-0006)
|
||||||
|
|
||||||
|
- On each state transition, POST a JSON event to `state_webhook_url`
|
||||||
|
(`job_id`, `state`, `previous_state`, `timestamp`, `model`, `attempt`, and on
|
||||||
|
completion `result` / `artifacts` / `error`).
|
||||||
|
- At-least-once delivery; callers must be idempotent on `job_id`+`state`; missed
|
||||||
|
events reconcile via `GET /jobs/{id}`. Retry with bounded backoff. Optional
|
||||||
|
`X-Foreman-Signature` HMAC when a webhook secret is configured.
|
||||||
|
- Artifacts are named typed blobs; the completion is always artifact `completion`.
|
||||||
|
Inline under ~256KB, otherwise fetched via `GET /jobs/{id}/artifacts/{name}`.
|
||||||
|
|
||||||
|
## Model inventory (ADR-0007)
|
||||||
|
|
||||||
|
- A poller hits the target's `/api/tags` (default ~30s) to keep an in-sync model
|
||||||
|
list; backs foreman's `/api/tags` passthrough and job validation.
|
||||||
|
- `/api/ps` tells foreman what's resident, feeding the scheduler.
|
||||||
|
- Jobs naming an uninstalled model are rejected at submit time (one re-check on
|
||||||
|
miss). Target unreachable → retain last-known list, mark degraded on a health
|
||||||
|
endpoint; do not reject wholesale on a single failed poll.
|
||||||
|
|
||||||
|
## Execution (ADR-0009)
|
||||||
|
|
||||||
|
- **Concurrency against the target is 1.** A single worker loop pulls a job,
|
||||||
|
ensures the right model is resident, executes, records the result.
|
||||||
|
- **Drain-by-model:** finish every queued job for the currently-resident model
|
||||||
|
before paying a swap (`ORDER BY (model != current), created_at`). A heuristic,
|
||||||
|
not a scheduler. No priorities, fairness, or budgets.
|
||||||
|
- Pin residency with Ollama `keep_alive`; target runs `OLLAMA_MAX_LOADED_MODELS=1`
|
||||||
|
and `OLLAMA_CONTEXT_LENGTH=8192`+.
|
||||||
|
|
||||||
|
## Persistence (ADR-0008)
|
||||||
|
|
||||||
|
- SQLite, WAL mode, pure-Go `modernc.org/sqlite` (no CGO → trivial Komodo builds).
|
||||||
|
- `jobs` + `artifacts` tables; single writer (the worker) + HTTP readers. TTL
|
||||||
|
sweep for pruning. No external broker.
|
||||||
|
|
||||||
|
## Models served
|
||||||
|
|
||||||
|
foreman serves **any installed model** named in a request; it does not own a
|
||||||
|
role→model mapping (the caller picks the model, e.g. go-llm `.Model(...)`).
|
||||||
|
Recommended roster to pull on the Mac (32GB, ~26-28GB usable, single-resident
|
||||||
|
swap):
|
||||||
|
|
||||||
|
- **parse / data** — `qwen3:14b` (~9GB, structured/JSON output).
|
||||||
|
- **agent + code** — `qwen3.6:35b` (MoE, ~3B active, ~20GB, fast tool-calling).
|
||||||
|
- Split a dedicated dense coder (`qwen3.6:27b`) off later only if `35b`'s code
|
||||||
|
quality disappoints; it's bandwidth-bound and slow on this Mac.
|
||||||
|
- Verify exact tags against the Ollama library before pulling; the registry moves.
|
||||||
|
|
||||||
|
## go-llm integration (ADR-0011)
|
||||||
|
|
||||||
|
Verified: `llm.OllamaCloud(key, WithBaseURL(...))` already targets a private
|
||||||
|
authenticated native-Ollama endpoint — which foreman is. Integration is a thin
|
||||||
|
constructor, no new provider:
|
||||||
|
|
||||||
|
- **Level 0 (now):** `llm.Foreman(baseURL, token).Model("qwen3.6:35b")` — delegates
|
||||||
|
to the ollama provider; transparent, synchronous, full tool/think/stream.
|
||||||
|
- **Level 1 (later):** a `foreman` client package — synchronous facade over the
|
||||||
|
async `/jobs` surface (manages a webhook receiver, blocks to done).
|
||||||
|
- **Level 2 (if needed):** a dedicated `provider.Provider` surfacing job IDs/state.
|
||||||
|
|
||||||
|
## Security (ADR-0010)
|
||||||
|
|
||||||
|
- Network is the boundary: target `:11434` firewalled to foreman, and/or both on
|
||||||
|
Tailscale. foreman is **not** on a public Traefik entrypoint.
|
||||||
|
- Optional static bearer: validate `Authorization: Bearer <token>`, which reuses
|
||||||
|
the header `go-llm` already sends via the Foreman/OllamaCloud path.
|
||||||
|
- No Authentik/SSO, no per-caller identities for v1. No financial/identity data
|
||||||
|
ever transits foreman.
|
||||||
|
|
||||||
|
## Stack & conventions
|
||||||
|
|
||||||
|
- Go, stdlib `net/http`, minimal deps. SQLite via `modernc.org/sqlite`.
|
||||||
|
- No UI. HTTP API + small CLI only.
|
||||||
|
- Match go-llm house style: standard Go tabs; `camelCase`/`PascalCase`; check
|
||||||
|
errors immediately and wrap with `fmt.Errorf("%w: ...", err)`; imports stdlib →
|
||||||
|
third-party → internal. The worker loop never panics; it logs, marks the job,
|
||||||
|
continues.
|
||||||
|
- ADRs in `docs/adr/` (one decision each, append/supersede). Living `progress.md`
|
||||||
|
at repo root. Repo: `gitea.stevedudenhoeffer.com`.
|
||||||
|
|
||||||
|
## Out of scope (anti-creep guardrails — ADR-0001)
|
||||||
|
|
||||||
|
Distributed dispatch, multiple workers, claim leases, weighted fair queueing,
|
||||||
|
capacity budgets, eligibility gates, an auth framework / SSO, a GUI, and managing
|
||||||
|
more than one target per daemon. Keep the ollama client behind a small interface
|
||||||
|
so a future second backend is additive — but do not build for it now.
|
||||||
|
|
||||||
|
## Milestones
|
||||||
|
|
||||||
|
- **M0** — native `/api/chat` passthrough + SQLite queue + single-worker loop, one
|
||||||
|
model end to end, synchronous.
|
||||||
|
- **M1** — model poller + `/api/tags`/`/api/ps`, drain-by-model, async `/jobs` +
|
||||||
|
`state_webhook_url` + artifacts + retry-on-unreachable, the CLI, and the
|
||||||
|
`llm.Foreman()` constructor in go-llm.
|
||||||
|
- **M2 (later)** — optional OpenAI-compat `/v1`, Level-1 client / dedicated
|
||||||
|
provider, metrics.
|
||||||
@@ -0,0 +1,37 @@
|
|||||||
|
# ADR-0001: One daemon per Ollama target
|
||||||
|
|
||||||
|
**Status:** Accepted — 2026-05-23
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
`peon-overseer` ballooned because it coordinated *multiple* workers from a
|
||||||
|
central service: pull-based dispatch, claim leases, weighted fair queueing,
|
||||||
|
capacity budgets, eligibility gates. All of that complexity existed solely to
|
||||||
|
arbitrate shared workers. We want none of it back.
|
||||||
|
|
||||||
|
The system being built fronts inference hardware (initially the M1 Pro running
|
||||||
|
Ollama) and exposes it as a managed job endpoint.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Each `foreman` process is bound to **exactly one** Ollama target, configured by a
|
||||||
|
single base URL. One target = one daemon = one queue. There is no cross-daemon
|
||||||
|
awareness and no shared state between daemons.
|
||||||
|
|
||||||
|
If a second worker is added later (the 4090 box, the M5 Max), it gets its own
|
||||||
|
`foreman` instance. Any fan-out across workers is the concern of a *separate*
|
||||||
|
higher-level router that talks to multiple foreman instances — explicitly out of
|
||||||
|
scope here and not to be anticipated in this codebase.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- The daemon is radically simple: one target, one serialized work stream.
|
||||||
|
- Horizontal scale is "run another daemon," an operational act, not a code change.
|
||||||
|
- No lease/fairness/budget machinery is permitted in this repo. If a change
|
||||||
|
starts to require it, that is the signal that the multi-worker router (a
|
||||||
|
different project) is what's actually needed.
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **One daemon managing many targets.** Rejected: reintroduces the scheduling and
|
||||||
|
arbitration complexity that sank the predecessor.
|
||||||
@@ -0,0 +1,36 @@
|
|||||||
|
# ADR-0002: Daemon placement and remote target configuration
|
||||||
|
|
||||||
|
**Status:** Accepted — 2026-05-23
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The inference box is an M1 Pro MacBook — a laptop, not always-on infrastructure.
|
||||||
|
The rest of steveternet runs on the homelab and is deployed/managed through
|
||||||
|
Komodo. We do not want bespoke job-controller logic living on the Mac.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
`foreman` runs on the homelab (e.g. orgrimmar), containerized and deployed via
|
||||||
|
Komodo like everything else. It is **given** its Ollama target as a configurable
|
||||||
|
base URL (default: the Mac's Tailscale address) and reaches it over the network.
|
||||||
|
|
||||||
|
The Mac runs Ollama and nothing `foreman`-specific. It stays a dumb appliance.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Ops consistency: foreman is a normal Komodo-managed container.
|
||||||
|
- The target URL is config, never hardcoded — swapping the Mac for another
|
||||||
|
backend is a config edit (within the one-target-per-daemon rule of ADR-0001).
|
||||||
|
- The Mac is a laptop and may sleep or change networks. The daemon must treat an
|
||||||
|
unreachable target as a transient, recoverable condition (see ADR-0007 for the
|
||||||
|
model poller's degraded mode and ADR-0004 for job retry semantics), never as a
|
||||||
|
fatal error. Operationally: `caffeinate`/`pmset` keeps the Mac awake; Tailscale
|
||||||
|
keeps its address stable.
|
||||||
|
- Network is now the trust boundary; Ollama has no auth of its own (see ADR-0010).
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **Co-locate foreman on the Mac.** Rejected: contradicts the stated preference to
|
||||||
|
keep controller logic off the laptop, and complicates Komodo-based deployment.
|
||||||
|
Note that "given a target URL" keeps this reversible — co-location would just be
|
||||||
|
pointing the URL at localhost.
|
||||||
@@ -0,0 +1,51 @@
|
|||||||
|
# ADR-0003: API surface — native Ollama passthrough vs OpenAI-compat
|
||||||
|
|
||||||
|
**Status:** Accepted — 2026-05-23 (resolved in favor of native Ollama)
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Two goals were in mild tension: the original phrasing asked for an
|
||||||
|
"OpenAI-compatible API," while the stated ultimate goal is to use the M1 Pro
|
||||||
|
**simply as a target for `go-llm`**.
|
||||||
|
|
||||||
|
`go-llm`'s `v2/CLAUDE.md` Key Design Decision #8 is explicit: its Ollama provider
|
||||||
|
deliberately uses native `/api/chat`, *not* OpenAI-compat `/v1`, for `think:false`
|
||||||
|
support, more reliable tool calling, and ~15-20% lower latency.
|
||||||
|
|
||||||
|
**Verified in code (`v2/constructors.go`).** `llm.OllamaCloud(apiKey, opts...)`
|
||||||
|
sends the key as `Authorization: Bearer <key>` over native `/api/chat`, and its
|
||||||
|
doc comment says to "use `WithBaseURL` to point at a private Ollama deployment
|
||||||
|
that requires auth." So go-llm *already* has a first-class path for a private,
|
||||||
|
authenticated, native-Ollama endpoint — exactly what foreman is on the wire.
|
||||||
|
Choosing OpenAI-compat would push go-llm onto a path its own author rejected, for
|
||||||
|
no benefit to the primary caller.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Native Ollama is **the** surface for v1. foreman speaks native `/api/chat`,
|
||||||
|
`/api/tags`, and `/api/ps`, optionally behind a Bearer token (ADR-0010). To
|
||||||
|
go-llm and any Ollama client it is indistinguishable from a private Ollama
|
||||||
|
deployment.
|
||||||
|
|
||||||
|
The synchronous passthrough is transparent: calls are queued internally
|
||||||
|
(ADR-0009) but the HTTP response blocks until the job completes. Async features
|
||||||
|
(job IDs, `state_webhook_url`, artifacts) live on a separate `/jobs` surface
|
||||||
|
(ADR-0004), not bolted onto the passthrough.
|
||||||
|
|
||||||
|
OpenAI-compat `/v1/chat/completions` is **deferred**, added in a later milestone
|
||||||
|
only if a non-go-llm caller needs it.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- "Set up the Mac as a go-llm target" needs zero provider changes — a thin
|
||||||
|
constructor only (ADR-0011).
|
||||||
|
- Preserves `think:false`, reliable tool calls, and lower latency.
|
||||||
|
- foreman must faithfully proxy native `/api/chat` semantics, including SSE
|
||||||
|
streaming (ADR-0012).
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **OpenAI-compat as primary/only surface.** Matches the original phrasing but
|
||||||
|
contradicts go-llm DD#8 and adds nothing for the primary caller. Rejected.
|
||||||
|
- **Native-only, never add OpenAI-compat.** Fully serves the goal; the secondary
|
||||||
|
surface is kept as an option, not a commitment.
|
||||||
@@ -0,0 +1,52 @@
|
|||||||
|
# ADR-0004: Async job surface, job IDs, and queued execution
|
||||||
|
|
||||||
|
**Status:** Accepted — 2026-05-23
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The transparent passthrough (ADR-0003) is synchronous: the caller holds an HTTP
|
||||||
|
connection until the completion returns. That is fine for interactive-length work
|
||||||
|
and for go-llm, but two needs aren't served by it:
|
||||||
|
|
||||||
|
- Long-running jobs held open through Traefik risk idle-connection timeouts.
|
||||||
|
- Orchestration callers (mort/ratchet/werk-style) want fire-and-forget: submit,
|
||||||
|
get an ID back immediately, and be told asynchronously when the work is done.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Add a distinct async surface: `POST /jobs`.
|
||||||
|
|
||||||
|
- The body carries a chat payload (native-Ollama-shaped, mirroring `/api/chat`)
|
||||||
|
plus optional extension fields, notably `state_webhook_url` (ADR-0005).
|
||||||
|
- foreman enqueues the job, assigns it a **ULID** (sortable, timestamped), and
|
||||||
|
immediately returns `202 Accepted` with `{ "job_id": "<ulid>" }`.
|
||||||
|
- The caller correlates later webhook callbacks to its request via `job_id`.
|
||||||
|
- `GET /jobs/{id}` returns current state, result, and artifact references for
|
||||||
|
polling-style callers or for recovery after a missed webhook.
|
||||||
|
|
||||||
|
Every unit of work is a row in the queue (ADR-0008) regardless of which surface
|
||||||
|
created it; the synchronous passthrough is simply a `/jobs` submission whose
|
||||||
|
handler blocks on the job's completion instead of returning the ID.
|
||||||
|
|
||||||
|
### Job lifecycle
|
||||||
|
|
||||||
|
`queued → loading → working → done`, plus terminal `failed`. A job whose target
|
||||||
|
is unreachable re-enters `queued` with a backoff (it is retryable, never
|
||||||
|
auto-failed on a connection error — the target is a laptop, ADR-0002). A bounded
|
||||||
|
retry count guards against poison jobs; exceeding it moves the job to `failed`
|
||||||
|
with the last error recorded.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- One queue, one execution engine, two entry points (sync passthrough, async
|
||||||
|
`/jobs`).
|
||||||
|
- Job IDs are stable, sortable, and meaningful to correlate webhooks.
|
||||||
|
- `GET /jobs/{id}` gives at-least-once webhook delivery a recovery path.
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **Reuse the OpenAI response `id` field instead of a separate `/jobs` surface.**
|
||||||
|
Workable for sync, but doesn't give async callers an immediate handle before
|
||||||
|
completion. The explicit `/jobs` surface is clearer.
|
||||||
|
- **UUIDv4 for IDs.** Rejected in favor of ULID for natural time-ordering in the
|
||||||
|
queue and logs.
|
||||||
@@ -0,0 +1,63 @@
|
|||||||
|
# ADR-0005: Webhook state-update protocol
|
||||||
|
|
||||||
|
**Status:** Accepted — 2026-05-23
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Async callers (ADR-0004) need to know how their job is progressing without
|
||||||
|
polling. The requirement: periodically push state updates
|
||||||
|
(`queued → loading → working → done`) and deliver results/artifacts on
|
||||||
|
completion.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
When a job is submitted with `state_webhook_url`, foreman POSTs a JSON event to
|
||||||
|
that URL on every state transition.
|
||||||
|
|
||||||
|
### Event payload
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"job_id": "01J...",
|
||||||
|
"state": "loading",
|
||||||
|
"previous_state": "queued",
|
||||||
|
"timestamp": "2026-05-23T12:00:00Z",
|
||||||
|
"model": "qwen3.6:35b",
|
||||||
|
"attempt": 1,
|
||||||
|
"error": null,
|
||||||
|
"result": null,
|
||||||
|
"artifacts": null
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- `state`: one of `queued`, `loading`, `working`, `done`, `failed`.
|
||||||
|
- On `done`: `result` holds the completion (native-Ollama-shaped) and `artifacts`
|
||||||
|
holds artifact references (ADR-0006).
|
||||||
|
- On `failed`: `error` holds a message; `result` is null.
|
||||||
|
|
||||||
|
### Delivery semantics
|
||||||
|
|
||||||
|
- **At-least-once.** Callers must be idempotent on `job_id` + `state`. A missed
|
||||||
|
webhook can always be reconciled via `GET /jobs/{id}` (ADR-0004).
|
||||||
|
- **Retry with backoff** on non-2xx or connection failure, bounded attempts, then
|
||||||
|
the event is dropped (the job state itself is unaffected and remains queryable).
|
||||||
|
- **Ordering is not guaranteed** across retries; `previous_state` + `timestamp`
|
||||||
|
let callers order/deduplicate.
|
||||||
|
- **Optional HMAC signing:** if a webhook secret is configured, foreman sends an
|
||||||
|
`X-Foreman-Signature` header (HMAC-SHA256 of the body) so receivers can verify
|
||||||
|
authenticity. Off by default; recommended once foreman is reachable beyond a
|
||||||
|
fully trusted network.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Callers get push observability with a polling fallback.
|
||||||
|
- Idempotency is pushed onto the caller — documented as a hard requirement.
|
||||||
|
- Webhook delivery is decoupled from job execution: a flaky receiver never blocks
|
||||||
|
or fails the job.
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **Polling only.** Simpler for foreman, worse for callers; rejected since
|
||||||
|
webhooks were an explicit requirement. (Polling is still available as fallback.)
|
||||||
|
- **WebSocket/SSE for state.** Heavier; SSE is reserved for token streaming on the
|
||||||
|
sync surface (ADR-0012), not job-state fan-out.
|
||||||
@@ -0,0 +1,53 @@
|
|||||||
|
# ADR-0006: Artifact handling and transport
|
||||||
|
|
||||||
|
**Status:** Accepted — 2026-05-23
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Jobs must "transmit artifacts when done." For a chat completion the obvious
|
||||||
|
artifact is the assistant's text/tool-call output, but the term is deliberately
|
||||||
|
broader: a job may produce structured data, multiple named outputs, or content
|
||||||
|
too large to embed comfortably in a webhook body.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
An **artifact** is a named, typed blob attached to a completed job:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{ "name": "completion", "content_type": "application/json", "size": 1234,
|
||||||
|
"inline": { ... }, "url": null }
|
||||||
|
```
|
||||||
|
|
||||||
|
- The primary completion is always emitted as an artifact named `completion`
|
||||||
|
(the native-Ollama response shape), so there is one consistent access pattern.
|
||||||
|
- Additional artifacts use distinct names.
|
||||||
|
|
||||||
|
### Transport: inline vs fetch
|
||||||
|
|
||||||
|
- **Small artifacts** (under a configurable threshold, default ~256 KB) are
|
||||||
|
delivered **inline** in the `done` webhook (`inline` populated, `url` null) and
|
||||||
|
in `GET /jobs/{id}`.
|
||||||
|
- **Large artifacts** exceed the threshold: the webhook/`GET` carries metadata
|
||||||
|
plus a `url` (`GET /jobs/{id}/artifacts/{name}`), and the bytes are fetched
|
||||||
|
on demand. This keeps webhook payloads bounded and avoids shipping megabytes
|
||||||
|
through a callback POST.
|
||||||
|
|
||||||
|
### Retention
|
||||||
|
|
||||||
|
Artifacts are stored alongside the job in SQLite (ADR-0008) and pruned with the
|
||||||
|
job after a configurable TTL. No separate blob store in v1; revisit only if
|
||||||
|
artifact sizes outgrow SQLite comfort (single-digit MB).
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- One uniform way to read output (`completion` artifact), extensible to richer
|
||||||
|
jobs later without protocol changes.
|
||||||
|
- Webhook bodies stay small; large outputs don't bloat or break delivery.
|
||||||
|
- A pull endpoint for artifacts means a missed/oversized webhook never loses data.
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **Always inline.** Simple but risks huge webhook bodies and SQLite row bloat in
|
||||||
|
the hot path. Rejected.
|
||||||
|
- **External object store (S3/MinIO) from day one.** Over-engineered for the
|
||||||
|
expected sizes; deferred behind the TTL/threshold knobs.
|
||||||
@@ -0,0 +1,48 @@
|
|||||||
|
# ADR-0007: Model inventory polling and discovery
|
||||||
|
|
||||||
|
**Status:** Accepted — 2026-05-23
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
foreman needs a "relatively in-sync" view of which models are installed on its
|
||||||
|
target so it can (a) advertise them to callers, (b) reject jobs for missing
|
||||||
|
models early instead of failing mid-execution, and (c) know what is currently
|
||||||
|
resident to inform scheduling (ADR-0009).
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
A background poller queries the target on a configurable interval (default ~30s):
|
||||||
|
|
||||||
|
- `GET /api/tags` → the installed-model inventory. Cached in memory; this cache
|
||||||
|
backs foreman's own `/api/tags` passthrough (ADR-0003) and `/v1/models` if the
|
||||||
|
OpenAI-compat surface is enabled.
|
||||||
|
- `GET /api/ps` → which model(s) are currently loaded, their VRAM/where-resident,
|
||||||
|
and the unload timer. Used by the scheduler to decide whether the next job
|
||||||
|
requires a swap.
|
||||||
|
|
||||||
|
### Behavior
|
||||||
|
|
||||||
|
- **Early validation:** a job naming a model absent from the cached inventory is
|
||||||
|
rejected at submit time with a clear error (and, for async jobs, the inventory
|
||||||
|
is recent enough that this is reliable). A small grace path allows a job for a
|
||||||
|
model that appears between polls by re-checking once on a miss.
|
||||||
|
- **Degraded mode:** if the target is unreachable, the last-known inventory is
|
||||||
|
retained and foreman marks itself degraded (surfaced on a health endpoint).
|
||||||
|
Jobs are not rejected wholesale on a single failed poll — the target is a
|
||||||
|
laptop that may briefly sleep (ADR-0002). Execution-time unreachability is
|
||||||
|
handled by job retry (ADR-0004).
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Callers can discover available models through the normal Ollama/OpenAI
|
||||||
|
endpoints; no foreman-specific discovery API needed.
|
||||||
|
- Bad-model jobs fail fast and cheaply.
|
||||||
|
- A health/status endpoint exposing degraded state and last-poll time is required.
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **No caching; proxy `/api/tags` live per request.** Simpler but couples every
|
||||||
|
discovery call to target availability and adds latency. Rejected; the poller
|
||||||
|
also feeds the scheduler, so the cache is needed regardless.
|
||||||
|
- **Push/event-based inventory.** Ollama offers no such mechanism; polling is the
|
||||||
|
only option.
|
||||||
@@ -0,0 +1,42 @@
|
|||||||
|
# ADR-0008: Durable SQLite-backed queue
|
||||||
|
|
||||||
|
**Status:** Accepted — 2026-05-23
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Jobs are queued, carry state, and may be retried across target sleep/restart. A
|
||||||
|
caller that submitted an async job and is waiting on a webhook must not lose its
|
||||||
|
job because foreman restarted. State must survive process restarts.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
The job queue and all job state (including artifacts, ADR-0006) live in **SQLite**
|
||||||
|
in WAL mode, via the pure-Go `modernc.org/sqlite` driver (no CGO, so the Komodo
|
||||||
|
container build stays trivial).
|
||||||
|
|
||||||
|
### Schema sketch
|
||||||
|
|
||||||
|
- `jobs(id TEXT PK, state TEXT, model TEXT, request BLOB, result BLOB,
|
||||||
|
error TEXT, webhook_url TEXT, attempt INT, created_at, updated_at, …)`
|
||||||
|
- `artifacts(job_id TEXT, name TEXT, content_type TEXT, size INT, inline BLOB,
|
||||||
|
PRIMARY KEY(job_id, name))`
|
||||||
|
|
||||||
|
A single writer (the worker, ADR-0009) plus the HTTP handlers; WAL handles the
|
||||||
|
concurrent-reader / single-writer pattern well at this scale.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Jobs and results are durable across restarts; webhook recovery via
|
||||||
|
`GET /jobs/{id}` (ADR-0004) is meaningful.
|
||||||
|
- Pure-Go driver keeps cross-compilation and container builds painless.
|
||||||
|
- Pruning is a TTL sweep over `jobs`/`artifacts`; no external store to operate.
|
||||||
|
- SQLite caps practical artifact size at single-digit MB — acceptable per ADR-0006
|
||||||
|
thresholds; revisit if outputs grow.
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **In-memory queue.** Loses async jobs on restart; unacceptable given webhooks.
|
||||||
|
- **Redis / external broker.** Another moving part to run for a single-worker
|
||||||
|
daemon; over-engineered. Rejected.
|
||||||
|
- **`mattn/go-sqlite3` (CGO).** Faster in some cases but complicates static builds
|
||||||
|
and container images. Pure-Go preferred for ops simplicity.
|
||||||
@@ -0,0 +1,44 @@
|
|||||||
|
# ADR-0009: Single-worker serialization and drain-by-model scheduling
|
||||||
|
|
||||||
|
**Status:** Accepted — 2026-05-23
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The target is bandwidth-bound (the M1 Pro is ~200 GB/s). It runs one model fast
|
||||||
|
at a time; loading a different model is a 5-10s cold start. Running two models
|
||||||
|
concurrently on 32GB either OOMs or pages to a 5-10x slowdown. So parallelism
|
||||||
|
against a single target buys nothing and would reintroduce coordination logic.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
**Concurrency against the target is 1.** A single worker loop pulls the next job
|
||||||
|
from the queue, ensures the right model is resident, executes, and records the
|
||||||
|
result.
|
||||||
|
|
||||||
|
**Drain-by-model scheduling:** before incurring a model swap, the worker finishes
|
||||||
|
every queued job that targets the **currently-resident** model (observed via
|
||||||
|
`/api/ps`, ADR-0007). Only when no job for the hot model remains does it select a
|
||||||
|
job for a different model and pay the swap cost.
|
||||||
|
|
||||||
|
This is an `ORDER BY (model != current_model), created_at` style selection — a
|
||||||
|
heuristic, not a scheduler. There is intentionally **no** priority system,
|
||||||
|
fairness weighting, or capacity budgeting (those sank the predecessor; see
|
||||||
|
ADR-0001).
|
||||||
|
|
||||||
|
Residency is pinned with Ollama `keep_alive` so the hot model isn't unloaded
|
||||||
|
between closely-spaced jobs. `OLLAMA_MAX_LOADED_MODELS=1` on the target keeps it
|
||||||
|
to single-resident swap.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Swap thrash is minimized without any complex scheduling.
|
||||||
|
- A long run of same-model jobs can delay a different-model job — acceptable for a
|
||||||
|
background box, and bounded by queue depth. If starvation ever becomes a real
|
||||||
|
problem, that is a signal to reconsider, not to pre-build fairness.
|
||||||
|
- Throughput is dominated by how well callers batch work by model.
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **FIFO with naive swapping.** Correct but pays a cold start on every model
|
||||||
|
change; wasteful when jobs interleave models. Rejected.
|
||||||
|
- **Priority/fair scheduling.** Explicitly rejected as scope creep (ADR-0001).
|
||||||
@@ -0,0 +1,51 @@
|
|||||||
|
# ADR-0010: Authentication and security boundary
|
||||||
|
|
||||||
|
**Status:** Accepted — 2026-05-23
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Ollama itself has no authentication — anyone who can reach `:11434` can drive it.
|
||||||
|
foreman sits in front of it and is the network-facing component. We need a real
|
||||||
|
boundary without dragging in an auth framework (the kind of scope creep ADR-0001
|
||||||
|
guards against).
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
**Primary boundary is the network.** foreman and its Ollama target sit on a
|
||||||
|
trusted segment: the target's `:11434` is firewalled to foreman only, and/or
|
||||||
|
both are bound to the Tailscale interface. foreman is **not** exposed through a
|
||||||
|
public Traefik entrypoint.
|
||||||
|
|
||||||
|
**Optional static bearer token.** If a token is configured, foreman validates the
|
||||||
|
`Authorization: Bearer <token>` header on incoming requests. This reuses headers
|
||||||
|
that clients already send:
|
||||||
|
|
||||||
|
- `go-llm` via `llm.Ollama()` sends no auth (fine on a trusted segment); via
|
||||||
|
`ollama.New(key, baseURL)` it sends `Authorization: Bearer <key>` — so a
|
||||||
|
configured foreman token slots straight into the existing provider with no new
|
||||||
|
code.
|
||||||
|
- The OpenAI-compat surface (if enabled, ADR-0003) carries the same header.
|
||||||
|
|
||||||
|
foreman → target auth: an optional bearer the daemon attaches to its own calls to
|
||||||
|
Ollama, for the Ollama-Cloud-style case; empty for a local/LAN target.
|
||||||
|
|
||||||
|
## Out of scope for v1
|
||||||
|
|
||||||
|
- Authentik / SSO. It is painful for service-to-service traffic and adds nothing
|
||||||
|
over network isolation here.
|
||||||
|
- Per-caller identities, scopes, rate limiting. Not needed for a single-tenant
|
||||||
|
homelab daemon.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Minimal but real security: network isolation always, plus an optional shared
|
||||||
|
secret that integrates with existing clients for free.
|
||||||
|
- Webhook authenticity is handled separately by optional HMAC signing (ADR-0005).
|
||||||
|
- No financial/identity/credential data ever transits foreman; it brokers chat
|
||||||
|
jobs only.
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **No auth, network-only.** Acceptable on a fully trusted tailnet; the optional
|
||||||
|
token exists for when foreman's reachability widens.
|
||||||
|
- **Full auth framework / SSO.** Rejected as scope creep.
|
||||||
@@ -0,0 +1,73 @@
|
|||||||
|
# ADR-0011: Go integration — the `Foreman` interface
|
||||||
|
|
||||||
|
**Status:** Accepted — 2026-05-23
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The ultimate goal: use the M1 Pro **simply as a target for `go-llm`**.
|
||||||
|
|
||||||
|
**Verified (`v2/constructors.go`, `v2/ollama/ollama.go`):** `llm.OllamaCloud(key,
|
||||||
|
WithBaseURL(...))` already targets "a private Ollama deployment that requires
|
||||||
|
auth" — native `/api/chat` + `Authorization: Bearer <key>` against any base URL.
|
||||||
|
foreman is exactly that on the wire (ADR-0003). So integration needs **no new
|
||||||
|
provider** — only a clean, intent-revealing seam so call sites say "foreman," not
|
||||||
|
"Ollama."
|
||||||
|
|
||||||
|
`go-llm`'s provider contract (`v2/provider`) is two methods, `Complete` and
|
||||||
|
`Stream`; a future dedicated provider would implement them.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Add a `llm.Foreman(baseURL, apiKey, opts...)` constructor to go-llm that delegates
|
||||||
|
to the ollama native provider — the ollama translation happens behind the scenes:
|
||||||
|
|
||||||
|
```go
|
||||||
|
func Foreman(baseURL, apiKey string, opts ...ClientOption) *Client {
|
||||||
|
cfg := &clientConfig{}
|
||||||
|
for _, opt := range opts {
|
||||||
|
opt(cfg)
|
||||||
|
}
|
||||||
|
if cfg.baseURL != "" {
|
||||||
|
baseURL = cfg.baseURL
|
||||||
|
}
|
||||||
|
return NewClient(ollamaProvider.New(apiKey, baseURL))
|
||||||
|
}
|
||||||
|
|
||||||
|
// model := llm.Foreman("http://foreman.orgrimmar:PORT", token).Model("qwen3.6:35b")
|
||||||
|
```
|
||||||
|
|
||||||
|
`baseURL` is required (foreman has no default public address). This is a
|
||||||
|
deliberate **seam**: v1 is a pass-through to the `ollama` provider; a dedicated
|
||||||
|
foreman provider can later replace the delegate to surface job IDs / async state
|
||||||
|
without changing call sites.
|
||||||
|
|
||||||
|
### Three escalating levels
|
||||||
|
|
||||||
|
- **Level 0 — `llm.Foreman(...)` (now, the headline goal).** Transparent,
|
||||||
|
synchronous, full native tool-calling / `think:false` / streaming. Queueing and
|
||||||
|
model-swap management happen invisibly inside the daemon. Zero provider code.
|
||||||
|
- **Level 1 — `foreman` client package (when an orchestration caller needs it).**
|
||||||
|
A synchronous facade over the async `/jobs` surface: given messages, it manages
|
||||||
|
an ephemeral webhook receiver, blocks until `done`, and returns result +
|
||||||
|
artifacts (falling back to `GET /jobs/{id}` polling if it can't receive
|
||||||
|
callbacks). For callers wanting async semantics — surfaced job IDs, no
|
||||||
|
long-held connection — with a synchronous call signature.
|
||||||
|
- **Level 2 — dedicated `provider.Provider` (only if needed).** Wraps Level 1 so
|
||||||
|
foreman is a first-class go-llm backend exposing job IDs / state / artifacts the
|
||||||
|
plain ollama provider can't. Built only if Level 0 proves insufficient.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Headline goal met with one constructor and no provider code.
|
||||||
|
- Call sites are foreman-named and future-proofed by the seam.
|
||||||
|
- Async ergonomics are available later without forcing webhook plumbing on
|
||||||
|
callers, and without touching Level-0 users.
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **Just tell users to call `OllamaCloud` with a base URL.** Works identically
|
||||||
|
today, but leaks the implementation ("it's Ollama") and offers no seam for
|
||||||
|
future foreman-specific behavior. The named constructor is the requested
|
||||||
|
"foreman interface."
|
||||||
|
- **Ship a dedicated provider from day one (Level 2 first).** More code; bypasses
|
||||||
|
the zero-friction win. Deferred.
|
||||||
@@ -0,0 +1,41 @@
|
|||||||
|
# ADR-0012: Streaming support
|
||||||
|
|
||||||
|
**Status:** Accepted — 2026-05-23
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
`go-llm`'s provider interface has a `Stream()` method, and Ollama's native
|
||||||
|
`/api/chat` streams token-by-token by default. The synchronous passthrough
|
||||||
|
(ADR-0003) must not break streaming clients. Separately, the async `/jobs`
|
||||||
|
surface (ADR-0004) reports progress via discrete state webhooks, which is a
|
||||||
|
different granularity than token streaming.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
- **Sync passthrough: support streaming.** When a `/api/chat` request sets
|
||||||
|
`stream: true`, foreman streams the target's token deltas back to the caller
|
||||||
|
(SSE/chunked, matching Ollama's native streaming). A streamed job still moves
|
||||||
|
through the queue; streaming begins once the job reaches `working`, so a job
|
||||||
|
waiting behind the drain-by-model queue (ADR-0009) simply starts streaming when
|
||||||
|
its turn comes. go-llm's `Stream()` works against foreman unchanged.
|
||||||
|
- **Async `/jobs` surface: no token streaming in v1.** Webhooks carry coarse state
|
||||||
|
transitions (ADR-0005) and the final result/artifacts, not per-token deltas.
|
||||||
|
Token-level streaming over a fire-and-forget webhook job is deliberately
|
||||||
|
deferred — it adds a transport (persistent connection or chunked webhook) whose
|
||||||
|
complexity isn't justified yet.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Interactive go-llm usage gets real streaming through the transparent surface.
|
||||||
|
- Orchestration callers get state + final artifacts, which is what they need;
|
||||||
|
they can use the sync streaming surface directly if they want tokens.
|
||||||
|
- The job state machine and webhook protocol stay simple (no streaming transport
|
||||||
|
to design or operate).
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **Stream tokens over the async surface too.** Deferred: requires either a
|
||||||
|
long-lived connection (defeats the point of async) or chunked-delta webhooks
|
||||||
|
(complex, rarely needed). Revisit only on a concrete need.
|
||||||
|
- **No streaming at all.** Would break go-llm's `Stream()` and interactive use on
|
||||||
|
the very path that is the primary goal. Rejected.
|
||||||
@@ -0,0 +1,41 @@
|
|||||||
|
# foreman — Architecture Decision Records
|
||||||
|
|
||||||
|
`foreman` is a small daemon that fronts **one** Ollama target. It turns a single
|
||||||
|
Ollama instance into a queued, observable job endpoint: it polls the target's
|
||||||
|
installed models, serializes jobs through the target (managing model swaps),
|
||||||
|
assigns every job an ID, and reports progress + artifacts via webhooks. It also
|
||||||
|
ships a Go client so the target is trivial to use from `go-llm`.
|
||||||
|
|
||||||
|
It is the deliberately pared-down successor to `peon-overseer`. One daemon, one
|
||||||
|
worker, one queue. No distributed dispatch, no leases, no fair queueing.
|
||||||
|
|
||||||
|
## Index
|
||||||
|
|
||||||
|
| ADR | Title | Status |
|
||||||
|
|-----|-------|--------|
|
||||||
|
| 0001 | One daemon per Ollama target | Accepted |
|
||||||
|
| 0002 | Daemon placement and remote target configuration | Accepted |
|
||||||
|
| 0003 | API surface: native Ollama passthrough vs OpenAI-compat | Accepted |
|
||||||
|
| 0004 | Async job surface, job IDs, and queued execution | Accepted |
|
||||||
|
| 0005 | Webhook state-update protocol | Accepted |
|
||||||
|
| 0006 | Artifact handling and transport | Accepted |
|
||||||
|
| 0007 | Model inventory polling and discovery | Accepted |
|
||||||
|
| 0008 | Durable SQLite-backed queue | Accepted |
|
||||||
|
| 0009 | Single-worker serialization and drain-by-model scheduling | Accepted |
|
||||||
|
| 0010 | Authentication and security boundary | Accepted |
|
||||||
|
| 0011 | Go client library and go-llm integration | Accepted |
|
||||||
|
| 0012 | Streaming support | Accepted |
|
||||||
|
|
||||||
|
ADR-0003 was resolved in favor of **native Ollama** as the v1 surface: foreman is,
|
||||||
|
on the wire, a private authenticated Ollama deployment, so `go-llm` integrates via
|
||||||
|
a thin `llm.Foreman(baseURL, token)` constructor that delegates to the existing
|
||||||
|
ollama provider (ADR-0011). OpenAI-compat `/v1` is deferred.
|
||||||
|
|
||||||
|
These ADRs refine the API/integration sections of the project `CLAUDE.md`. The
|
||||||
|
queue, single-worker, drain-by-model, and security guardrails carry forward
|
||||||
|
unchanged.
|
||||||
|
|
||||||
|
## Format
|
||||||
|
|
||||||
|
Each ADR: Status, Context, Decision, Consequences, and Alternatives where useful.
|
||||||
|
One decision per file. Append new ADRs; supersede rather than rewrite.
|
||||||
Reference in New Issue
Block a user