# go-extractor Developer Guide ## Project Overview **Repository:** gitea.stevedudenhoeffer.com/steve/go-extractor **Language:** Go 1.24 **Primary Dependencies:** - `github.com/playwright-community/playwright-go` - Browser automation - `github.com/go-shiori/go-readability` - Article content extraction - `github.com/urfave/cli/v3` - CLI framework (cmd tools only) - `golang.org/x/text` - Currency formatting (megamillions) go-extractor is a browser-based web scraping library. It wraps Playwright to provide a clean Go API for opening pages, selecting DOM elements, extracting content, and interacting with pages. ## Architecture ### Core Types | File | Type | Purpose | |------|------|---------| | `browser.go` | `Browser` interface | Opens pages, manages browser lifecycle | | `document.go` | `Document` interface | Represents an open page (URL, content, refresh, wait) | | `node.go` | `Node` interface | DOM element operations (select, click, type, text, attr) | | `nodes.go` | `Nodes` type | Collection of Nodes with bulk operations | | `article.go` | `Article` struct | Readability extraction result | | `cookiejar.go` | `CookieJar` interface | Cookie storage abstraction | | `interactive.go` | `InteractiveBrowser` interface | Low-level mouse/keyboard/screenshot control | ### Implementation Flow ``` NewBrowser(ctx, opts) └─> initBrowser(opts) # browser_init.go ├─> playwright.Run() # start Playwright ├─> bt.Connect() or Launch # remote or local browser ├─> browser.NewContext() # with UA, viewport, cookies └─> return browserInitResult Browser.Open(ctx, url, opts) └─> openPage() # playwright.go - creates page, navigates └─> updateCookies() # sync cookies back to jar └─> newDocument() # document.go - wraps page as Document ``` ### Site Extractors Each site extractor in `sites/` follows the same pattern: 1. `Config` struct with site-specific options and `validate()` method 2. `DefaultConfig` package-level variable 3. Methods on Config that take `(ctx, Browser)` and return parsed data 4. A `cmd/` subdirectory with a CLI tool using `urfave/cli` ## Development Guidelines ### Error Handling - Always check and propagate errors with `fmt.Errorf("context: %w", err)` - Never discard errors with `_ =` unless explicitly intended (like `DeferClose`) - Check for nil before calling methods on `SelectFirst()` results — it returns nil if no element matches - Move `defer DeferClose(x)` after the error check, not before ### Testing - Core types have unit tests using mock implementations in `mock_test.go` and `nodes_test.go` - `mockDocument` and `mockNode` implement the interfaces without Playwright - Site extractors currently lack mock-based tests — they need HTML fixtures - Run tests: `go test ./...` - Tests that need a browser should use build tags or skip when Playwright is unavailable ### Adding a New Site Extractor 1. Create `sites/mysite/mysite.go`: ```go package mysite type Config struct{} var DefaultConfig = Config{} func (c Config) validate() Config { return c } func (c Config) Extract(ctx context.Context, b extractor.Browser) (Result, error) { doc, err := b.Open(ctx, "https://mysite.com", extractor.OpenPageOptions{}) if err != nil { return Result{}, fmt.Errorf("failed to open page: %w", err) } defer extractor.DeferClose(doc) // ... extract data using doc.Select(), doc.ForEach(), etc. } ``` 2. Create `sites/mysite/cmd/mysite/main.go` with a CLI wrapper 3. Add tests in `sites/mysite/mysite_test.go` ### Browser Options When creating browsers, understand the option merging behavior: - `mergeOptions()` in `browser_init.go` merges variadic `BrowserOptions` - String/pointer fields: only overwritten if non-zero - Boolean fields: `RequireServer` and `UseLocalOnly` are one-way (only set to true); `ShowBrowser` always overwrites (known issue #16) ### Playwright Server The library supports connecting to a remote Playwright server: - Environment variables: `PLAYWRIGHT_SERVER_ADDRESS_FIREFOX`, `PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM`, `PLAYWRIGHT_SERVER_ADDRESS_WEBKIT` - `BrowserOptions.ServerAddress` overrides the env var - `RequireServer: true` prevents fallback to local browser - `UseLocalOnly: true` skips server connection entirely ### DOM Interaction Use the `Node` interface for all DOM operations: - `Select(selector)` returns `Nodes` (may be empty, never nil panic) - `SelectFirst(selector)` returns `Node` or **nil** — always nil-check before use - `ForEach(selector, fn)` iterates over matching elements - `SetAttribute(name, value)` uses JavaScript evaluation — be aware of escaping limitations (see #12) ## Building ```bash go build ./... go test ./... ``` CLI tools: ```bash go build ./cmd/browser go build ./sites/duckduckgo/cmd/duckduckgo ``` ## Issue Labels ### Priority Labels | Label | Color | Usage | |-------|-------|-------| | `priority/critical` | `#B60205` | Showstopper — security vulnerability, data loss, or crash | | `priority/high` | `#D93F0B` | Important — significant bug or high-value improvement | | `priority/medium` | `#FBCA04` | Normal — standard improvement or non-critical bug | | `priority/low` | `#0E8A16` | Nice to have — minor improvement or cleanup | ### Type Labels | Label | Color | Usage | |-------|-------|-------| | `type/epic` | `#5319E7` | Parent issue grouping related stories/tasks | | `type/task` | `#0075CA` | Concrete implementation work item | | `type/refactor` | `#D4C5F9` | Code restructuring without behavior change | ### Category Labels | Label | Color | Usage | |-------|-------|-------| | `bug` | `#D73A4A` | Something isn't working correctly | | `enhancement` | `#A2EEEF` | New feature or improvement | | `security` | `#B60205` | Security-related issue | | `testing` | `#BFD4F2` | Test coverage or infrastructure | | `documentation` | `#0075CA` | Documentation improvements | | `performance` | `#FBCA04` | Performance optimization | ### Hierarchy - **Epics** (`type/epic`) group related issues. Reference the parent epic with `**Parent:** #N` in sub-task descriptions. - **Tasks** (`type/task`) are concrete work items, usually children of an epic. - An issue should have exactly one `priority/*` label and one type/category label.