docs: MIT license + public-readiness framing

Add MIT LICENSE (matches gadfly/majordomo, same author). README + CLAUDE.md: note this is a public, vibe-coded project; clarify the `go-llm` referenced in the docs is now majordomo, and link it + gadfly as the downstream consumers (foreman is a drop-in native-Ollama target via majordomo's ollama.Foreman preset). CLAUDE.md gains a Build / test / run section. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 20:30:52 -04:00
parent 7cd7eaff8b
commit 823c0b4ca8
3 changed files with 101 additions and 26 deletions
@@ -4,7 +4,16 @@ A small, always-on daemon that fronts **one** Ollama target. It turns a single
 Ollama instance into a queued, observable job endpoint: it polls the target's
 installed models, serializes work through the target (managing model swaps),
 assigns every job an ID, and reports progress + artifacts via webhooks. On the
-wire it speaks **native Ollama**, so it doubles as a drop-in `go-llm` target.
+wire it speaks **native Ollama**, so it doubles as a drop-in client target — for
+any Ollama client, and specifically for
+[majordomo](https://gitea.stevedudenhoeffer.com/steve/majordomo) (the `go-llm`
+library referenced throughout these docs is now majordomo) and the
+[gadfly](https://gitea.stevedudenhoeffer.com/steve/gadfly) reviewer built on it.
+
+> This is a public, **vibe-coded** project (built largely by an AI agent). Keep
+> that framing honest in the README; don't oversell it. Homelab specifics below
+> (orgrimmar, the Macs, Komodo, Tailscale) are the author's deployment and are
+> illustrative — the daemon itself is generic.

 foreman is the deliberately pared-down successor to `peon-overseer`. One daemon,
 one target, one queue. The complexity that sank the predecessor — distributed
@@ -13,6 +22,26 @@ gates — existed to coordinate *multiple* workers and is **out of scope**.
 Resisting that creep is a first-class design goal. See `docs/adr/` for the
 decisions; this file summarizes them.

+## Build / test / run
+
+```sh
+go build ./cmd/foreman        # the daemon binary
+go test ./...                 # client/ + internal/* unit tests
+go vet ./... && gofmt -l .    # must be quiet / clean before committing
+```
+
+Run it locally against a real Ollama target (only `FOREMAN_OLLAMA_URL` is
+required; full env reference in `.env.example` and the README table):
+
+```sh
+FOREMAN_OLLAMA_URL=http://mac.tail:11434 go run ./cmd/foreman serve
+curl -s localhost:8080/healthz          # {"status":"ok","degraded":false}
+scripts/pull-models.sh                  # pull the recommended roster on the target
+```
+
+Pure-Go only (`modernc.org/sqlite`, no CGO) so Docker/Komodo builds stay trivial
+— keep it that way. The worker loop must never panic: log, mark the job, continue.
+
 ## Topology (ADR-0001, ADR-0002)

 ```
@@ -33,12 +62,16 @@ M1 Pro Mac:  Ollama only  (models on disk, no foreman logic)

 1. **Primary — transparent native Ollama passthrough:** `/api/chat`, `/api/tags`,
   `/api/ps`. foreman looks exactly like an Ollama server. Synchronous: calls are
-   queued internally but the HTTP response blocks until completion. SSE streaming
-   supported (ADR-0012). This is the `go-llm` target path.
-2. **Async jobs — `POST /jobs`, `GET /jobs/{id}`:** body is a native-chat payload
+   queued internally but the HTTP response blocks until completion. NDJSON
+   streaming supported (`application/x-ndjson` — Ollama's native wire format, not
+   SSE; ADR-0012). This is the `go-llm` target path.
+2. **Embeddings (bypass the queue) — `/api/embed`, `/api/embeddings`:** proxied
+   directly and concurrently to the always-resident embedder; never touch the
+   queue or worker loop (ADR-0013).
+3. **Async jobs — `POST /jobs`, `GET /jobs/{id}`:** body is a native-chat payload
   plus optional `state_webhook_url`. Returns `202` + `{ "job_id": "<ulid>" }`
   immediately. For fire-and-forget orchestration callers.
-3. **Optional OpenAI-compat `/v1/chat/completions` + `/v1/models`:** deferred;
+4. **Optional OpenAI-compat `/v1/chat/completions` + `/v1/models`:** deferred;
   added only if a non-go-llm caller needs it.

 Job lifecycle: `queued → loading → working → done` (+ terminal `failed`). A
@@ -65,15 +98,18 @@ guard poison jobs). IDs are ULIDs (sortable, timestamped).
  miss). Target unreachable → retain last-known list, mark degraded on a health
  endpoint; do not reject wholesale on a single failed poll.

-## Execution (ADR-0009)
+## Execution (ADR-0009, ADR-0013)

- **Concurrency against the target is 1.** A single worker loop pulls a job,
-  ensures the right model is resident, executes, records the result.
- **Drain-by-model:** finish every queued job for the currently-resident model
-  before paying a swap (`ORDER BY (model != current), created_at`). A heuristic,
-  not a scheduler. No priorities, fairness, or budgets.
- Pin residency with Ollama `keep_alive`; target runs `OLLAMA_MAX_LOADED_MODELS=1`
-  and `OLLAMA_CONTEXT_LENGTH=8192`+.
+- **Worker-model concurrency against the target is 1.** A single worker loop pulls
+  a job, ensures the right worker model is resident, executes, records the result.
+  Embeddings are not jobs and bypass this loop entirely (ADR-0013).
+- **Drain-by-model:** finish every queued job for the currently-resident worker
+  model before paying a swap (`ORDER BY (model != current), created_at`). A
+  heuristic, not a scheduler. No priorities, fairness, or budgets.
+- **Two resident slots:** target runs `OLLAMA_MAX_LOADED_MODELS=2` — slot 1 is the
+  always-resident embedder (`FOREMAN_EMBED_MODEL`, pinned `keep_alive: -1`,
+  warmed on startup/reconnect); slot 2 is the rotating worker model. Pin the
+  worker with `keep_alive`; set `OLLAMA_CONTEXT_LENGTH=8192`+.

 ## Persistence (ADR-0008)

@@ -85,13 +121,17 @@ guard poison jobs). IDs are ULIDs (sortable, timestamped).

 foreman serves **any installed model** named in a request; it does not own a
 role→model mapping (the caller picks the model, e.g. go-llm `.Model(...)`).
-Recommended roster to pull on the Mac (32GB, ~26-28GB usable, single-resident
-swap):
+Recommended roster to pull on the Mac (32GB; the embedder stays resident in slot
+1, one worker model rotates through slot 2 — ADR-0013):

+- **embedder (always resident)** — `nomic-embed-text` (~0.3GB) or
+  `qwen3-embedding:0.6b`; selected via `FOREMAN_EMBED_MODEL`.
 - **parse / data** — `qwen3:14b` (~9GB, structured/JSON output).
- **agent + code** — `qwen3.6:35b` (MoE, ~3B active, ~20GB, fast tool-calling).
- Split a dedicated dense coder (`qwen3.6:27b`) off later only if `35b`'s code
-  quality disappoints; it's bandwidth-bound and slow on this Mac.
+- **agent + code** — `qwen3:30b` (Qwen3-30B-A3B MoE, ~3B active, ~19GB, fast
+  tool-calling). This is the default worker model.
+- Add a dedicated dense coder only if `qwen3:30b`'s code quality disappoints:
+  `gpt-oss:20b` (~13GB, faster) or `qwen2.5-coder:32b` (~20GB, higher quality but
+  bandwidth-bound and slow on this Mac).
 - Verify exact tags against the Ollama library before pulling; the registry moves.

 ## go-llm integration (ADR-0011)
@@ -100,7 +140,7 @@ Verified: `llm.OllamaCloud(key, WithBaseURL(...))` already targets a private
 authenticated native-Ollama endpoint — which foreman is. Integration is a thin
 constructor, no new provider:

- **Level 0 (now):** `llm.Foreman(baseURL, token).Model("qwen3.6:35b")` — delegates
+- **Level 0 (now):** `llm.Foreman(baseURL, token).Model("qwen3:30b")` — delegates
  to the ollama provider; transparent, synchronous, full tool/think/stream.
 - **Level 1 (later):** a `foreman` client package — synchronous facade over the
  async `/jobs` surface (manages a webhook receiver, blocks to done).
@@ -117,7 +157,7 @@ constructor, no new provider:

 ## Stack & conventions

- Go, stdlib `net/http`, minimal deps. SQLite via `modernc.org/sqlite`.
+- Go 1.26, stdlib `net/http`, minimal deps. SQLite via `modernc.org/sqlite`.
 - No UI. HTTP API + small CLI only.
 - Match go-llm house style: standard Go tabs; `camelCase`/`PascalCase`; check
  errors immediately and wrap with `fmt.Errorf("%w: ...", err)`; imports stdlib →
@@ -137,8 +177,8 @@ so a future second backend is additive — but do not build for it now.

 - **M0** — native `/api/chat` passthrough + SQLite queue + single-worker loop, one
  model end to end, synchronous.
- **M1** — model poller + `/api/tags`/`/api/ps`, drain-by-model, async `/jobs` +
-  `state_webhook_url` + artifacts + retry-on-unreachable, the CLI, and the
-  `llm.Foreman()` constructor in go-llm.
+- **M1** — model poller + `/api/tags`/`/api/ps`, drain-by-model, embedding bypass,
+  async `/jobs` + `state_webhook_url` + artifacts + retry-on-unreachable, the CLI,
+  and the `llm.Foreman()` constructor in go-llm.
 - **M2 (later)** — optional OpenAI-compat `/v1`, Level-1 client / dedicated
  provider, metrics.