steve/majordomo

Fork 0

Files

T

steve 98a2164aba

Adversarial Review (Gadfly) / review (pull_request) Successful in 5m27s

Details

CI / Tidy (pull_request) Successful in 9m31s

Details

CI / Build & Test (pull_request) Successful in 9m48s

Details

ci(gadfly): trim the weakest reviewers from the swarm

Drop the four lowest-graded reviewers — m5/qwen3.6:35b-mlx, gemma4:cloud,
gpt-oss:120b-cloud, kimi-k2.7-code:cloud. Removing m5/qwen3.6 takes the last
local Mac out, so this is now a cloud-only fleet of 6 ollama-cloud models;
GADFLY_ENDPOINT_M5 and the m5 concurrency entry are gone and the per-job timeout
drops to 45m. README/CLAUDE.md kept in sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-27 18:07:27 -04:00

8.5 KiB

Raw Permalink Blame History

CLAUDE.md — majordomo operating manual

majordomo is a clean-slate Go substrate for LLM-backed agents: target-agnostic model access, a parseable model naming / failover / tiering system with health tracking, multimodality, tool calls, structured output, and agents composed from model + system prompt + toolboxes + skills.

Public, vibe-coded project. This is built almost entirely by an AI agent (Claude Code) and is public. Keep that framing honest in the README — don't oversell it — and keep the README/support-matrix/examples updated in the same commit as the behavior they describe (that in-sync promise is part of the project's credibility).

North star: majordomo exists to re-architect mort's agentic layer. mort is the first consumer and the design's acceptance test — when a choice is a toss-up, pick what makes mort's tiers, failover chains, toolboxes, and skills cleanest to express. But majordomo itself stays general-purpose and mort-agnostic: no mort types, no Discord, no mort config.

Module & stack

Module: gitea.stevedudenhoeffer.com/steve/majordomo, Go 1.26.
Stdlib-first (ADR-0007): hand-rolled net/http clients for OpenAI(+compat), Anthropic(+compat), Ollama (cloud+local), foreman. The one approved dependency is google.golang.org/genai (Google provider). Anything else needs an ADR. No go-llm, no go-agentkit — importing either is an automatic failure.

Package map (ADR-0001)

majordomo        Registry, Parse, env-DSN loading, chain executor, re-exports
  llm/           canonical contract: Message/Part/Request/Response/Option,
                 Tool/Toolbox, Capabilities, Stream, Model, Provider, errors
  imagegen/      canonical text-to-image contract: Request/Result/Model/
                 Provider (separate from llm; Image = llm.ImagePart)  (ADR-0016)
  health/        clock-injected health tracker (bench/backoff)
  media/         image normalization to target capabilities (sniff real
                 format, downscale, transcode, byte ladder; ErrUnsupported
                 for what can't fit) — chains normalize PER TARGET
  provider/fake/ scriptable in-memory provider for hermetic tests
  provider/openai/    Chat Completions client (+ all OpenAI-compat targets)
  provider/anthropic/ Messages API client (+ Anthropic-compat targets)
  provider/ollama/    one native /api/chat client serving the ollama,
                      ollama-cloud, and foreman built-ins via presets
  provider/llamaswap/ llama-swap proxy: chat delegates to provider/openai,
                      plus management methods + imagegen image client (ADR-0015)
  provider/google/    Gemini on google.golang.org/genai (the one approved
                      dependency; lazy client, raw-JSON-schema tools,
                      ThinkingLevel reasoning, iter.Pull2 streaming)
  agent/         Agent run loop                               (Phase 5)
  skill/         Skill interface + composition                (Phase 6)
  examples/      one runnable example per hard requirement    (Phase 7-8)

Canonical types live in leaf package llm; the root re-exports them via type aliases. Providers import llm, never each other, never the root.

Parse grammar (ADR-0003)

spec    := element ("," element)*       # ordered failover chain
element := target | alias
target  := provider "/" model           # model id VERBATIM after first "/"
alias   := bare token (no slash), expands INLINE, recursively, cycle-checked

Parse("ollama-cloud/minimax-m3:cloud,ollama-cloud/kimi-k2.6:cloud,anthropic/opus-4.8") → try head-to-tail. Appending ,thinking expands the registered alias in place at the tail.
Provider resolution: registry (built-ins, RegisterProvider, eager env) → lazy LLM_{UPPER(name)} env DSN → error.
Single element ≡ chain of one; same Model interface, same semantics.
No reasoning suffixes (:high etc. are NOT stripped — model ids are verbatim). Reasoning effort becomes a request option (provider phases).

LLM_* env-DSN providers (ADR-0004, go-llm parity)

LLM_<NAME>=scheme://[token@]host[/path] — e.g. LLM_M5=foreman://token@foreman-m5.example defines provider m5; then m5/qwen3:30b works in Parse, chains, and aliases. Scheme ∈ {foreman, ollama, ollama-cloud, openai, anthropic, google, gemini, llama-swap} ∪ RegisterScheme. Token = credential; base URL = https://host always — except llama-swap, which builds http://host (local-first; ADR-0015). New() scans the process env eagerly; unknown names also resolve lazily at Parse time (my-prov → LLM_MY_PROV). Malformed entries fail on use, not at startup.

Health & failover (ADR-0006, ADR-0008)

Transient (408/429/5xx, timeouts, conn refused/reset, DNS, deadline) vs permanent (400/401/403/404/405/422, model-not-found, ctx.Canceled). Unknown → transient. Classifier overridable.
One transient error → retry same target (default 1 retry). Every failed attempt counts; at threshold (default 2 consecutive) the target is benched for base 5s × 2^n, capped 5m. Success fully resets. Chains skip benched targets; 404 advances penalty-free; auth/malformed fail fast (configurable); exhaustion returns a joined error naming every target.
Empty response = failover. A target that returns without error but with no usable output — no content parts and no tool calls (Response.IsEmpty; a media/image part counts as content) — is treated as a per-target failure (llm.ErrEmptyResponse, classified transient). Unlike an ordinary transient it is not retried on the same target (the model just did this; the call is expensive): the chain penalizes health and advances immediately. If every target comes back empty the call fails with ErrChainExhausted rather than a hollow "successful" empty completion, so a single flaky model can't silently end an agent run with nothing.
Tracker is in-memory, process-local, clock-injected. No persistence.

House conventions (mirror foreman)

gofmt; check errors immediately and wrap with fmt.Errorf("%w: ..."); imports stdlib → third-party → internal; // Why: doc comments where rationale isn't obvious.
ADRs in docs/adr/, one decision each, append-only, indexed in its README. progress.md gets a dated entry per phase.
Conventional commits (feat:, test:, docs:, chore:, refactor:).
Tests are hermetic: fake provider + fake clock; provider clients test against httptest; no network or credentials in the default suite. Live tests sit behind //go:build live / examples/live/ and skip without their env vars.
.env holds live keys (gitignored, never committed/printed/quoted); .env.example carries placeholders.

Gates (every phase; what CI runs)

go build ./...
go vet ./...
go test -race -count=1 ./...
go mod tidy && git diff --exit-code go.mod go.sum

CI: .gitea/workflows/ci.yaml (Gitea Actions, mirrors foreman). README.md must match reality in the same commit that changes behavior — no aspirational docs; unbuilt features are marked pending in the matrix.

Adversarial review loop (Gadfly)

Ship work through PRs and let Gadfly review it before merge:

Push to a PR, never straight to main. Branch, push, open a PR. .gitea/workflows/adversarial-review.yml runs Gadfly (the standalone agentic adversarial reviewer) — a fleet of 6 ollama-cloud models, each running the 3-lens suite (security, correctness, error-handling). Advisory only; it never blocks the merge.
Wait for Gadfly to finish, then read its output. Don't merge while the review is still running. Each model posts one consolidated comment; weigh every finding on its merits and fix the real ones (Gadfly is a simple system — findings are advisory, so confirm before acting).
Grade the findings back to the Gadfly MCP. For each finding, call mcp__gadfly__record_finding_grade: is_real=true + a severity (trivial|small|medium|high|critical) for a genuine problem, or is_real=false for a false positive; add notes/usefulness when useful. Use mcp__gadfly__list_findings (only_ungraded=true) to find what still needs grading and mcp__gadfly__scoreboard for the per-model rollup. This telemetry is how we measure whether each model earns its keep.

Out of scope (anti-creep)

No persistent store (health is in-memory behind the registry), no observability/metrics stack, no config-file framework beyond LLM_* env DSNs, no CLI beyond examples, no provider-specific features leaking into the canonical API, nothing mort-specific in the library.

8.5 KiB Raw Permalink Blame History Unescape Escape

CLAUDE.md — majordomo operating manual

Module & stack

Package map (ADR-0001)

Parse grammar (ADR-0003)

LLM_* env-DSN providers (ADR-0004, go-llm parity)

Health & failover (ADR-0006, ADR-0008)

House conventions (mirror foreman)

Gates (every phase; what CI runs)

Adversarial review loop (Gadfly)

Out of scope (anti-creep)

8.5 KiB

Raw Permalink Blame History