P3: meta + primitive tool group (think/now/cite + classify/extract/summarize)

Grow executus/tools into a real generic tool library: - Register(reg): the always-available, zero-config tools — think, now (UTC unless a CurrentTimeProvider is wired), cite (inert unless a CitationStorage is wired). All nil-safe; a light host calls Register and is useful. - RegisterMeta(reg, MetaDeps): the LLM-backed meta tools — classify, extract_entities, summarize — over the llmmeta helper. Budget defaults to the shipped in-memory per-run cap; Files optional; caps default. - Seams moved (interface/type-only, no host coupling): research_providers.go (CurrentTimeProvider/CitationStorage/SearchBudget/PageExtractor/PDFFetcher/…) and file_storage.go (FileStorage + FileDomainMeta). Plus the in-memory budget default (research_defaults.go) and scope_validate.go. calculate deferred (drags github.com/Krognol/go-wolfram + a module-path replace — not worth it in the lean core for one tool). Core go.sum still free of gorm/redis/discordgo/sqlite/wolfram. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 21:00:45 -04:00
parent df95425bb5
commit 1e201550b3
11 changed files with 1802 additions and 17 deletions
@@ -0,0 +1,332 @@
+// Package tools — research provider plumbing for v11.
+//
+// This file declares the narrow interfaces v11's research tools
+// (web_search, read_page, read_video, read_pdf, verify_url, etc.) need
+// at execute time. Production wiring lives in pkg/logic/mort.go and
+// closes over the searcher chain, the extractor / chromedp client, the
+// PDF extractor, and the yt-dlp wrapper.
+//
+// Why narrow interfaces (vs importing pkg/logic/searcher / extractor
+// directly): the same cycle-break pattern used by KVStorage, FileStorage,
+// HTTPConfigProvider — keeps pkg/skilltools/tools free of the wiring
+// layer so tests can stub each dependency. Each provider is nil-safe:
+// the tool surfaces "not configured" at first call rather than failing
+// at registration.
+//
+// Test: each tool under pkg/skilltools/tools/ wired against these
+// interfaces has its own *_test.go using the in-package fakes in
+// research_providers_fakes_test.go.
+package tools
+
+import (
+	"context"
+	"errors"
+	"time"
+)
+
+// PageCache is the narrow surface read_page (and read_pdf) consult to
+// avoid re-fetching the same URL within the cache's TTL. Production
+// wiring bridges this interface to the legacy *cache.Cache held by
+// pkg/logic/query.System so a `.query foo.com` and a
+// `.skill query foo.com` for the same URL share one cache slot.
+//
+// Why a narrow interface (vs importing the cache package directly):
+// same cycle-break pattern as KVStorage / FileStorage / CitationStorage
+// — keeps pkg/skilltools/tools free of the wiring layer. The legacy
+// cache slot key is `sha256(url)`; the production adapter is
+// responsible for hashing so this interface stays clean (raw URL in/out)
+// and skill-tool authors never need to know the slot shape.
+//
+// nil-safe: a tool constructed with a nil PageCache simply skips the
+// cache layer (always treat Get as a miss; Set is a no-op).
+//
+// Test: tests pass a fake PageCache that records Get/Set calls and
+// returns canned hits. See page_cache_test.go for the read_page hit /
+// miss scenarios.
+type PageCache interface {
+	// Get returns the cached body for urlStr and true on hit, or
+	// (nil, false) on miss. Implementations MUST treat any backing-
+	// store error as a miss (best-effort, never fail the caller).
+	Get(ctx context.Context, urlStr string) ([]byte, bool)
+
+	// Set writes body under the slot for urlStr with the supplied TTL.
+	// Implementations MUST swallow backing-store errors (best-effort
+	// caching is correct: a write failure should not propagate to the
+	// agent loop).
+	Set(ctx context.Context, urlStr string, body []byte, ttl time.Duration)
+}
+
+// PageCacheTTL is the default TTL applied by tools that consult a
+// PageCache. Mirrors the legacy `query.pageCacheTTL` constant
+// (1 hour) so a `.query`-warmed slot reads back from a `.skill query`
+// (and vice versa) within the same window.
+//
+// Tools that want a different TTL pass an explicit value to
+// PageCache.Set; this constant is the project default the v11 / v-research
+// tools all use.
+const PageCacheTTL = 1 * time.Hour
+
+// PageExtractor is the narrow surface read_page needs at execute
+// time. The production adapter wraps mort's existing extractor
+// (Ollama web_fetch first, chromedp fallback on JS-heavy pages).
+//
+// nil-safe: a tool constructed with a nil PageExtractor surfaces
+// "not configured" at first call.
+//
+// Why: read_page used to be a thin io.ReadAll over the URL — it
+// missed JS rendering, didn't honour the v6 page cache, and could
+// not surface the underlying provider name. v11 routes through this
+// interface so the production wiring (mort.go) can plug in the
+// existing query-side extractor without exposing query.Agent.
+type PageExtractor interface {
+	// ExtractPage fetches and extracts readable text from urlStr.
+	// Returns the extracted body, a final URL (after any redirects
+	// the extractor followed), the provider name ("ollama" |
+	// "chromedp" | "ytdlp"), and an error.
+	//
+	// The returned body is the FULL extracted text — callers apply
+	// the v10 byte-vs-reference cap before surfacing to the agent.
+	//
+	// bypassCache=true skips any page cache and forces a fresh
+	// extraction. Default false.
+	ExtractPage(ctx context.Context, urlStr string, bypassCache bool) (text string, finalURL string, provider string, err error)
+}
+
+// VideoTranscriber is the narrow surface read_video needs at
+// execute time. Production wiring wraps internal/ytdlp.
+//
+// nil-safe: tool surfaces "not configured" at first call.
+//
+// Why a separate interface from PageExtractor: video is a different
+// shape (transcript + metadata) and a different binary (yt-dlp).
+// Keeping them distinct lets tests stub each independently.
+type VideoTranscriber interface {
+	// ExtractVideoTranscript returns the transcript text and the
+	// best-effort metadata (title, duration in seconds, channel).
+	// Implementations MUST return a non-empty transcript or an
+	// error — empty-transcript success is interpreted by the tool
+	// as a "transcript_unavailable" failure.
+	ExtractVideoTranscript(ctx context.Context, urlStr string) (transcript string, meta VideoMeta, err error)
+}
+
+// VideoMeta is best-effort metadata returned alongside a video
+// transcript. Any field may be empty/zero if the implementation
+// could not extract it.
+type VideoMeta struct {
+	Title           string
+	Channel         string
+	DurationSeconds int
+}
+
+// PDFFetcher is the narrow surface read_pdf needs at execute time.
+// Production wiring uses an HTTP-aware fetcher that HEAD-validates
+// content-type before downloading the body.
+//
+// nil-safe: tool surfaces "not configured" at first call.
+//
+// Why: a tool that just embedded PDF extraction would couple
+// fetching + parsing. Splitting the fetch (allowlist + SSRF +
+// HEAD check) from the extract (page-level parsing) keeps each
+// step testable and lets the same fetcher serve verify_url one
+// day if we want a PDF-aware fast path.
+type PDFFetcher interface {
+	// FetchPDF downloads the PDF at urlStr (after HEAD-validating
+	// content-type) and returns the raw bytes plus the final URL.
+	// HEAD-validation rejects a URL whose Content-Type is not a
+	// PDF mime AND whose path does not end in .pdf.
+	FetchPDF(ctx context.Context, urlStr string) (body []byte, finalURL string, err error)
+}
+
+// PDFExtractor parses PDF bytes into plain text + page count.
+// Production wires internal.ExtractPDFText.
+//
+// Why split from PDFFetcher: tests want to vary the fetch (mock
+// server returning bytes) without rebuilding the extractor.
+type PDFExtractor interface {
+	// ExtractPDFText returns the concatenated plain-text content
+	// of the PDF along with the page count. The caller applies any
+	// per-page cap and the v10 byte-vs-reference cap on the result.
+	ExtractPDFText(ctx context.Context, body []byte, maxPages int) (text string, pageCount int, truncated bool, err error)
+}
+
+// HEADChecker is the narrow surface verify_url needs at execute
+// time. Production wiring uses the same SSRF-pinned transport as
+// http_get so the security envelope is consistent.
+//
+// Why a separate interface (vs reusing HTTPConfigProvider+doHTTP):
+// verify_url's contract is simpler — HEAD only, no body bytes
+// returned, and the agent only cares about reachable / status /
+// final URL / content-type. A bespoke surface lets the production
+// adapter optimise for that path (no body buffer, no body close).
+type HEADChecker interface {
+	// HEAD performs a HEAD request against urlStr (with SSRF +
+	// allowlist enforcement) and returns the final URL after any
+	// redirects, the HTTP status code, and the Content-Type header.
+	// Returns reachable=false with a non-nil err for transport
+	// failures (DNS, TCP, allowlist rejection); reachable=true with
+	// any HTTP status (including 4xx/5xx) is the success shape —
+	// the agent decides whether the URL is "real".
+	HEAD(ctx context.Context, urlStr string) (finalURL string, status int, contentType string, reachable bool, err error)
+}
+
+// CitationStorage is the narrow surface cite() needs at execute
+// time. Production wires *skills.System.Storage(); tests stub.
+//
+// nil-safe: tool surfaces "not configured" at first call.
+//
+// Why a narrow interface (vs importing pkg/logic/skills): same
+// cycle constraint as KVStorage / FileStorage. Production adapter
+// in mort.go bridges to skills.Storage's RecordCitation /
+// ListCitations methods AND a separate URL-history tracker.
+//
+// Two responsibilities, deliberately separate:
+//
+//  1. RecordCitation writes a row into skill_run_sources — this is
+//     the user-visible citations table for the Sources panel and
+//     CSV export. ONLY rows the agent successfully cited via
+//     cite() land here.
+//  2. RecordURLTouch / GetTouchedURLs maintains a per-run set of
+//     URLs the agent has interacted with (web_search results,
+//     read_page input, read_pdf input, read_video input). cite()
+//     reads this set to reject claims for URLs the agent never
+//     touched. This set lives in a different table or scope from
+//     the citations table — it's working state, not a record.
+type CitationStorage interface {
+	// RecordCitation appends one (run_id, url, claim, cited_at)
+	// row to the citations table (skill_run_sources). cited_at is
+	// set by the storage layer to time.Now() when zero. The caller
+	// has already verified the URL is in the touched-URL set
+	// (via GetTouchedURLs); this method is the persistence step.
+	RecordCitation(ctx context.Context, runID, url, claim string) error
+
+	// RecordURLTouch records that the agent has interacted with
+	// `url` during `runID`. Called by web_search (per result),
+	// read_page, read_pdf, and read_video. Idempotent — repeat
+	// calls for the same (run_id, url) are no-ops at the storage
+	// layer.
+	RecordURLTouch(ctx context.Context, runID, url string) error
+
+	// GetTouchedURLs returns the set of URLs the run has
+	// interacted with. Used by cite() to verify that a claim's
+	// URL is one the agent actually visited. Empty for a fresh
+	// run — cite() then rejects every claim with
+	// "url_not_in_run_history".
+	GetTouchedURLs(ctx context.Context, runID string) (map[string]struct{}, error)
+
+	// ListCitations returns all citations recorded for the run, in
+	// insertion order. Powers the /skills/{id}/runs/{run_id}
+	// Sources panel.
+	ListCitations(ctx context.Context, runID string) ([]CitationRow, error)
+}
+
+// CitationRow mirrors the skill_run_sources row shape. Fields
+// match the spec: run_id is implicit in the query, url + claim are
+// what the agent submitted, cited_at is the wall-clock timestamp
+// at insert.
+type CitationRow struct {
+	URL     string
+	Claim   string
+	CitedAt int64 // unix-seconds; storage adapter normalises from time.Time
+}
+
+// CurrentTimeProvider exposes a "now" + per-user timezone lookup.
+// Production wiring closes over the bot's member-config getter.
+//
+// nil-safe: a tool constructed with a nil provider falls back to
+// server-time + UTC (current behaviour of NewNow before v11).
+type CurrentTimeProvider interface {
+	// UserTimezone returns the IANA timezone name configured for
+	// the given Discord member ID, or "" when the member has no
+	// timezone configured. Empty fallback is "UTC".
+	UserTimezone(ctx context.Context, memberID string) string
+}
+
+// SearchBudget is the narrow surface web_search reads at execute
+// time to honour skills.web_search.max_per_run.
+//
+// Production wiring closes over a per-run counter held by the
+// executor. nil-safe: tool falls back to a built-in package
+// counter (process-wide, NOT per-run) — useful for tests but NOT
+// production-correct because budget bleeds across runs. The
+// production adapter MUST be wired.
+type SearchBudget interface {
+	// CheckAndIncrement returns the current count AFTER incrementing
+	// for the given runID, the configured max, and an error when
+	// the call would exceed the cap. The handler returns a clean
+	// "search_budget_exceeded" string on exceed (not an error so
+	// the agent can react).
+	CheckAndIncrement(ctx context.Context, runID, kind string) (count, max int, exceeded bool)
+}
+
+// ResearchConfig is the narrow surface that read_page / read_video /
+// read_pdf / verify_url read at execute time for per-tool budget caps
+// and inline-vs-file_id thresholds. Production wiring closes over
+// the relevant convars.
+//
+// nil-safe: tools fall back to package defaults.
+type ResearchConfig interface {
+	// MaxInlineBytes returns the cap above which extracted text is
+	// persisted as a file_id under run-scope (v10 byte-vs-reference
+	// principle). Default 12 KiB.
+	MaxInlineBytes(ctx context.Context) int
+
+	// PDFMaxPages returns the cap on pages extracted from a PDF
+	// before truncation. Default 50.
+	PDFMaxPages(ctx context.Context) int
+
+	// WebSearchEnabled is the master switch for web_search.
+	WebSearchEnabled(ctx context.Context) bool
+
+	// WebSearchMaxPerRun is the per-run search cap.
+	WebSearchMaxPerRun(ctx context.Context) int
+
+	// ReadPageMaxPerRun is the per-run page-read cap.
+	ReadPageMaxPerRun(ctx context.Context) int
+
+	// VideoMaxPerRun is the per-run video-read cap.
+	VideoMaxPerRun(ctx context.Context) int
+
+	// VerifyURLMaxPerRun is the per-run HEAD-check cap.
+	VerifyURLMaxPerRun(ctx context.Context) int
+
+	// ReadPDFMaxPerRun is the per-run PDF-read cap.
+	ReadPDFMaxPerRun(ctx context.Context) int
+
+	// HTTPGetMaxPerRun (v15.2) is the per-run http_get cap. The agent
+	// otherwise can retry-storm through random URLs and bloat its own
+	// context with each tool result. Default 20.
+	HTTPGetMaxPerRun(ctx context.Context) int
+
+	// HTTPPostMaxPerRun (v15.2) is the per-run http_post cap. Default 20.
+	HTTPPostMaxPerRun(ctx context.Context) int
+
+	// WebSearchAugmentThreshold is the minimum number of primary
+	// (Ollama) results required to skip the secondary (DDG/Brave)
+	// search. When the primary backend returns fewer than this many
+	// results, the augmented searcher also queries the secondary and
+	// merges both result sets. Default 5.
+	WebSearchAugmentThreshold(ctx context.Context) int
+
+	// ReplyChainDepthMax is unused here; placeholder shape for
+	// future per-tool caps. Kept off this interface — callers reach
+	// into the convar reader directly when they need it.
+}
+
+// ErrPageExtractionFailed is the sentinel returned by a PageExtractor
+// when both Ollama and chromedp paths produce empty content.
+var ErrPageExtractionFailed = errors.New("page extraction failed: empty content")
+
+// ErrVideoTranscriptUnavailable is the sentinel returned by a
+// VideoTranscriber when no captions / transcript could be obtained.
+var ErrVideoTranscriptUnavailable = errors.New("video transcript unavailable")
+
+// ErrPDFNotPDF is the sentinel returned by a PDFFetcher when the
+// HEAD response indicates a non-PDF content-type AND the URL path
+// has no .pdf extension. Surfaces a clean "url_is_not_a_pdf"
+// rejection rather than a generic transport error.
+var ErrPDFNotPDF = errors.New("url does not serve a PDF")
+
+// ErrPDFEncrypted is returned by a PDFExtractor when the PDF refuses
+// extraction because it is password-protected. Surfaces a clean
+// "pdf_encrypted" rejection.
+var ErrPDFEncrypted = errors.New("pdf is encrypted")