diff --git a/CLAUDE.md b/CLAUDE.md index e8c1545..de196ea 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,7 +4,16 @@ A small, always-on daemon that fronts **one** Ollama target. It turns a single Ollama instance into a queued, observable job endpoint: it polls the target's installed models, serializes work through the target (managing model swaps), assigns every job an ID, and reports progress + artifacts via webhooks. On the -wire it speaks **native Ollama**, so it doubles as a drop-in `go-llm` target. +wire it speaks **native Ollama**, so it doubles as a drop-in client target — for +any Ollama client, and specifically for +[majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo) (the `go-llm` +library referenced throughout these docs is now majordomo) and the +[gadfly](https://gitea.stevedudenhoeffer.com/steve/gadfly) reviewer built on it. + +> This is a public, **vibe-coded** project (built largely by an AI agent). Keep +> that framing honest in the README; don't oversell it. Homelab specifics below +> (orgrimmar, the Macs, Komodo, Tailscale) are the author's deployment and are +> illustrative — the daemon itself is generic. foreman is the deliberately pared-down successor to `peon-overseer`. One daemon, one target, one queue. The complexity that sank the predecessor — distributed @@ -13,6 +22,26 @@ gates — existed to coordinate *multiple* workers and is **out of scope**. Resisting that creep is a first-class design goal. See `docs/adr/` for the decisions; this file summarizes them. +## Build / test / run + +```sh +go build ./cmd/foreman # the daemon binary +go test ./... # client/ + internal/* unit tests +go vet ./... && gofmt -l . # must be quiet / clean before committing +``` + +Run it locally against a real Ollama target (only `FOREMAN_OLLAMA_URL` is +required; full env reference in `.env.example` and the README table): + +```sh +FOREMAN_OLLAMA_URL=http://mac.tail:11434 go run ./cmd/foreman serve +curl -s localhost:8080/healthz # {"status":"ok","degraded":false} +scripts/pull-models.sh # pull the recommended roster on the target +``` + +Pure-Go only (`modernc.org/sqlite`, no CGO) so Docker/Komodo builds stay trivial +— keep it that way. The worker loop must never panic: log, mark the job, continue. + ## Topology (ADR-0001, ADR-0002) ``` @@ -33,12 +62,16 @@ M1 Pro Mac: Ollama only (models on disk, no foreman logic) 1. **Primary — transparent native Ollama passthrough:** `/api/chat`, `/api/tags`, `/api/ps`. foreman looks exactly like an Ollama server. Synchronous: calls are - queued internally but the HTTP response blocks until completion. SSE streaming - supported (ADR-0012). This is the `go-llm` target path. -2. **Async jobs — `POST /jobs`, `GET /jobs/{id}`:** body is a native-chat payload + queued internally but the HTTP response blocks until completion. NDJSON + streaming supported (`application/x-ndjson` — Ollama's native wire format, not + SSE; ADR-0012). This is the `go-llm` target path. +2. **Embeddings (bypass the queue) — `/api/embed`, `/api/embeddings`:** proxied + directly and concurrently to the always-resident embedder; never touch the + queue or worker loop (ADR-0013). +3. **Async jobs — `POST /jobs`, `GET /jobs/{id}`:** body is a native-chat payload plus optional `state_webhook_url`. Returns `202` + `{ "job_id": "" }` immediately. For fire-and-forget orchestration callers. -3. **Optional OpenAI-compat `/v1/chat/completions` + `/v1/models`:** deferred; +4. **Optional OpenAI-compat `/v1/chat/completions` + `/v1/models`:** deferred; added only if a non-go-llm caller needs it. Job lifecycle: `queued → loading → working → done` (+ terminal `failed`). A @@ -65,15 +98,18 @@ guard poison jobs). IDs are ULIDs (sortable, timestamped). miss). Target unreachable → retain last-known list, mark degraded on a health endpoint; do not reject wholesale on a single failed poll. -## Execution (ADR-0009) +## Execution (ADR-0009, ADR-0013) -- **Concurrency against the target is 1.** A single worker loop pulls a job, - ensures the right model is resident, executes, records the result. -- **Drain-by-model:** finish every queued job for the currently-resident model - before paying a swap (`ORDER BY (model != current), created_at`). A heuristic, - not a scheduler. No priorities, fairness, or budgets. -- Pin residency with Ollama `keep_alive`; target runs `OLLAMA_MAX_LOADED_MODELS=1` - and `OLLAMA_CONTEXT_LENGTH=8192`+. +- **Worker-model concurrency against the target is 1.** A single worker loop pulls + a job, ensures the right worker model is resident, executes, records the result. + Embeddings are not jobs and bypass this loop entirely (ADR-0013). +- **Drain-by-model:** finish every queued job for the currently-resident worker + model before paying a swap (`ORDER BY (model != current), created_at`). A + heuristic, not a scheduler. No priorities, fairness, or budgets. +- **Two resident slots:** target runs `OLLAMA_MAX_LOADED_MODELS=2` — slot 1 is the + always-resident embedder (`FOREMAN_EMBED_MODEL`, pinned `keep_alive: -1`, + warmed on startup/reconnect); slot 2 is the rotating worker model. Pin the + worker with `keep_alive`; set `OLLAMA_CONTEXT_LENGTH=8192`+. ## Persistence (ADR-0008) @@ -85,13 +121,17 @@ guard poison jobs). IDs are ULIDs (sortable, timestamped). foreman serves **any installed model** named in a request; it does not own a role→model mapping (the caller picks the model, e.g. go-llm `.Model(...)`). -Recommended roster to pull on the Mac (32GB, ~26-28GB usable, single-resident -swap): +Recommended roster to pull on the Mac (32GB; the embedder stays resident in slot +1, one worker model rotates through slot 2 — ADR-0013): +- **embedder (always resident)** — `nomic-embed-text` (~0.3GB) or + `qwen3-embedding:0.6b`; selected via `FOREMAN_EMBED_MODEL`. - **parse / data** — `qwen3:14b` (~9GB, structured/JSON output). -- **agent + code** — `qwen3.6:35b` (MoE, ~3B active, ~20GB, fast tool-calling). -- Split a dedicated dense coder (`qwen3.6:27b`) off later only if `35b`'s code - quality disappoints; it's bandwidth-bound and slow on this Mac. +- **agent + code** — `qwen3:30b` (Qwen3-30B-A3B MoE, ~3B active, ~19GB, fast + tool-calling). This is the default worker model. +- Add a dedicated dense coder only if `qwen3:30b`'s code quality disappoints: + `gpt-oss:20b` (~13GB, faster) or `qwen2.5-coder:32b` (~20GB, higher quality but + bandwidth-bound and slow on this Mac). - Verify exact tags against the Ollama library before pulling; the registry moves. ## go-llm integration (ADR-0011) @@ -100,7 +140,7 @@ Verified: `llm.OllamaCloud(key, WithBaseURL(...))` already targets a private authenticated native-Ollama endpoint — which foreman is. Integration is a thin constructor, no new provider: -- **Level 0 (now):** `llm.Foreman(baseURL, token).Model("qwen3.6:35b")` — delegates +- **Level 0 (now):** `llm.Foreman(baseURL, token).Model("qwen3:30b")` — delegates to the ollama provider; transparent, synchronous, full tool/think/stream. - **Level 1 (later):** a `foreman` client package — synchronous facade over the async `/jobs` surface (manages a webhook receiver, blocks to done). @@ -117,7 +157,7 @@ constructor, no new provider: ## Stack & conventions -- Go, stdlib `net/http`, minimal deps. SQLite via `modernc.org/sqlite`. +- Go 1.26, stdlib `net/http`, minimal deps. SQLite via `modernc.org/sqlite`. - No UI. HTTP API + small CLI only. - Match go-llm house style: standard Go tabs; `camelCase`/`PascalCase`; check errors immediately and wrap with `fmt.Errorf("%w: ...", err)`; imports stdlib → @@ -137,8 +177,8 @@ so a future second backend is additive — but do not build for it now. - **M0** — native `/api/chat` passthrough + SQLite queue + single-worker loop, one model end to end, synchronous. -- **M1** — model poller + `/api/tags`/`/api/ps`, drain-by-model, async `/jobs` + - `state_webhook_url` + artifacts + retry-on-unreachable, the CLI, and the - `llm.Foreman()` constructor in go-llm. +- **M1** — model poller + `/api/tags`/`/api/ps`, drain-by-model, embedding bypass, + async `/jobs` + `state_webhook_url` + artifacts + retry-on-unreachable, the CLI, + and the `llm.Foreman()` constructor in go-llm. - **M2 (later)** — optional OpenAI-compat `/v1`, Level-1 client / dedicated provider, metrics. diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..6ab7069 --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2026 Steve Dudenhoeffer + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md index 59a496e..b35e5c6 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,22 @@ # foreman -A small, always-on Go daemon that fronts **one** Ollama target. It turns a +🪓 A small, always-on Go daemon that fronts **one** Ollama target. It turns a single Ollama instance into a queued, observable job endpoint: it polls the target's installed models, serializes work through the target (managing model swaps), assigns every job an ID, and reports progress via webhooks. -On the wire it speaks **native Ollama**, so it doubles as a drop-in `go-llm` -target. +On the wire it speaks **native Ollama**, so it doubles as a drop-in target for +any Ollama client — including [majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo) +(via its `ollama.Foreman(url, token)` preset) and, through that, +[gadfly](https://gitea.stevedudenhoeffer.com/steve/gadfly). Point a client at the +foreman URL instead of the raw Ollama and you get queuing + model-swap +serialization for free. + +> **This is a public, vibe-coded project** (built largely by an AI agent). It runs +> the author's homelab but is intentionally generic — one daemon, one target, one +> queue. Treat the homelab specifics in the docs as illustrative, and don't +> oversell it: it's a deliberately small queue in front of Ollama, not a +> distributed scheduler. ## Quickstart @@ -61,3 +71,7 @@ See [`docs/adr/`](docs/adr/) for design decisions. Key points: - Single worker loop with drain-by-model scheduling (ADR-0009) - Native Ollama passthrough + async `/jobs` surface (ADR-0003, ADR-0004) - Embeddings bypass the queue entirely (ADR-0013) + +## License + +[MIT](LICENSE) © 2026 Steve Dudenhoeffer.