majordomo/CLAUDE.md

# CLAUDE.md — majordomo operating manual

majordomo is a clean-slate Go substrate for LLM-backed agents:
target-agnostic model access, a parseable model naming / failover / tiering
system with health tracking, multimodality, tool calls, structured output,
and agents composed from model + system prompt + toolboxes + skills.

> **Public, vibe-coded project.** This is built almost entirely by an AI agent
> (Claude Code) and is public. Keep that framing honest in the README — don't
> oversell it — and keep the README/support-matrix/examples updated in the same
> commit as the behavior they describe (that in-sync promise is part of the
> project's credibility).

**North star:** majordomo exists to re-architect mort's agentic layer. mort
is the first consumer and the design's acceptance test — when a choice is a
toss-up, pick what makes mort's tiers, failover chains, toolboxes, and
skills cleanest to express. But majordomo itself stays general-purpose and
mort-agnostic: no mort types, no Discord, no mort config.

## Module & stack

- Module: `gitea.stevedudenhoeffer.com/steve/majordomo`, Go 1.26.
- Stdlib-first (ADR-0007): hand-rolled `net/http` clients for
  OpenAI(+compat), Anthropic(+compat), Ollama (cloud+local), foreman. The
  one approved dependency is `google.golang.org/genai` (Google provider).
  Anything else needs an ADR. No `go-llm`, no `go-agentkit` — importing
  either is an automatic failure.

## Package map (ADR-0001)

```
majordomo        Registry, Parse, env-DSN loading, chain executor, re-exports
  llm/           canonical contract: Message/Part/Request/Response/Option,
                 Tool/Toolbox, Capabilities, Stream, Model, Provider, errors
  imagegen/      canonical text-to-image contract: Request/Result/Model/
                 Provider (separate from llm; Image = llm.ImagePart)  (ADR-0016)
  health/        clock-injected health tracker (bench/backoff)
  media/         image normalization to target capabilities (sniff real
                 format, downscale, transcode, byte ladder; ErrUnsupported
                 for what can't fit) — chains normalize PER TARGET
  provider/fake/ scriptable in-memory provider for hermetic tests
  provider/openai/    Chat Completions client (+ all OpenAI-compat targets)
  provider/anthropic/ Messages API client (+ Anthropic-compat targets)
  provider/ollama/    one native /api/chat client serving the ollama,
                      ollama-cloud, and foreman built-ins via presets
  provider/llamaswap/ llama-swap proxy: chat delegates to provider/openai,
                      plus management methods + imagegen image client (ADR-0015)
  provider/google/    Gemini on google.golang.org/genai (the one approved
                      dependency; lazy client, raw-JSON-schema tools,
                      ThinkingLevel reasoning, iter.Pull2 streaming)
  agent/         Agent run loop                               (Phase 5)
  skill/         Skill interface + composition                (Phase 6)
  examples/      one runnable example per hard requirement    (Phase 7-8)
```

Canonical types live in leaf package `llm`; the root re-exports them via
type aliases. Providers import `llm`, never each other, never the root.

## Parse grammar (ADR-0003)

```
spec    := element ("," element)*       # ordered failover chain
element := target | alias
target  := provider "/" model           # model id VERBATIM after first "/"
alias   := bare token (no slash), expands INLINE, recursively, cycle-checked
```

- `Parse("ollama-cloud/minimax-m3:cloud,ollama-cloud/kimi-k2.6:cloud,anthropic/opus-4.8")`
  → try head-to-tail. Appending `,thinking` expands the registered alias in
  place at the tail.
- Provider resolution: registry (built-ins, RegisterProvider, eager env) →
  lazy `LLM_{UPPER(name)}` env DSN → error.
- Single element ≡ chain of one; same Model interface, same semantics.
- No reasoning suffixes (`:high` etc. are NOT stripped — model ids are
  verbatim). Reasoning effort becomes a request option (provider phases).

## LLM_* env-DSN providers (ADR-0004, go-llm parity)

`LLM_<NAME>=scheme://[token@]host[/path]` — e.g.
`LLM_M5=foreman://token@foreman-m5.example` defines provider `m5`; then
`m5/qwen3:30b` works in Parse, chains, and aliases. Scheme ∈ {foreman,
ollama, ollama-cloud, openai, anthropic, google, gemini, llama-swap,
llama-swaps} ∪ RegisterScheme. Token = credential; base URL = `https://host`
always — **except `llama-swap`, which builds `http://host` (local-first);
`llama-swaps` is its TLS twin (`https://host`), mirroring redis/rediss
(ADR-0015).** `New()` scans the process env eagerly; unknown names also resolve
lazily at Parse time (`my-prov` → `LLM_MY_PROV`). Malformed entries fail on use,
not at startup.

## Health & failover (ADR-0006, ADR-0008)

- Transient (408/429/5xx, timeouts, conn refused/reset, DNS, deadline) vs
  permanent (400/401/403/404/405/422, model-not-found, ctx.Canceled).
  Unknown → transient. Classifier overridable.
- One transient error → retry same target (default 1 retry). Every failed
  attempt counts; at threshold (default 2 consecutive) the target is
  benched for base 5s × 2^n, capped 5m. Success fully resets. Chains skip
  benched targets; 404 advances penalty-free; auth/malformed fail fast
  (configurable); exhaustion returns a joined error naming every target.
- **Empty response = failover.** A target that returns *without error* but
  with no usable output — no content parts and no tool calls (`Response.IsEmpty`;
  a media/image part counts as content) — is treated as a per-target failure
  (`llm.ErrEmptyResponse`, classified transient). Unlike an ordinary
  transient it is **not** retried on the same target (the model just did
  this; the call is expensive): the chain penalizes health and advances
  immediately. If every target comes back empty the call fails with
  `ErrChainExhausted` rather than a hollow "successful" empty completion, so
  a single flaky model can't silently end an agent run with nothing.
- Tracker is in-memory, process-local, clock-injected. No persistence.

## House conventions (mirror foreman)

- gofmt; check errors immediately and wrap with `fmt.Errorf("%w: ...")`;
  imports stdlib → third-party → internal; `// Why:` doc comments where
  rationale isn't obvious.
- ADRs in `docs/adr/`, one decision each, append-only, indexed in its
  README. progress.md gets a dated entry per phase.
- Conventional commits (`feat:`, `test:`, `docs:`, `chore:`, `refactor:`).
- Tests are hermetic: fake provider + fake clock; provider clients test
  against `httptest`; **no network or credentials in the default suite**.
  Live tests sit behind `//go:build live` / `examples/live/` and skip
  without their env vars.
- `.env` holds live keys (gitignored, never committed/printed/quoted);
  `.env.example` carries placeholders.

## Gates (every phase; what CI runs)

```
go build ./...
go vet ./...
go test -race -count=1 ./...
go mod tidy && git diff --exit-code go.mod go.sum
```

CI: `.gitea/workflows/ci.yaml` (Gitea Actions, mirrors foreman). README.md
must match reality in the same commit that changes behavior — no
aspirational docs; unbuilt features are marked pending in the matrix.

## Adversarial review loop (Gadfly)

Ship work through PRs and let Gadfly review it before merge:

- **Push to a PR, never straight to `main`.** Branch, push, open a PR.
  `.gitea/workflows/adversarial-review.yml` runs Gadfly (the standalone
  agentic adversarial reviewer) — a fleet of 6 ollama-cloud models, each
  running the 3-lens suite (security, correctness, error-handling). Advisory
  only; it never blocks the merge.
- **Wait for Gadfly to finish, then read its output.** Don't merge while the
  review is still running. Each model posts one consolidated comment; weigh
  every finding on its merits and fix the real ones (Gadfly is a simple
  system — findings are advisory, so confirm before acting).
- **Grade the findings back to the Gadfly MCP.** For each finding, call
  `mcp__gadfly__record_finding_grade`: `is_real=true` + a `severity`
  (trivial|small|medium|high|critical) for a genuine problem, or
  `is_real=false` for a false positive; add `notes`/`usefulness` when
  useful. Use `mcp__gadfly__list_findings` (`only_ungraded=true`) to find
  what still needs grading and `mcp__gadfly__scoreboard` for the per-model
  rollup. This telemetry is how we measure whether each model earns its keep.

## Out of scope (anti-creep)

No persistent store (health is in-memory behind the registry), no
observability/metrics stack, no config-file framework beyond LLM_* env
DSNs, no CLI beyond examples, no provider-specific features leaking into
the canonical API, nothing mort-specific in the library.