docs: add README.md and CLAUDE.md

Add project documentation: - README.md with installation, usage examples, API reference, and project structure - CLAUDE.md with developer guide, architecture overview, conventions, and issue label docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 11:10:28 -05:00
parent 05ca15b165
commit 49f294e884
2 changed files with 417 additions and 0 deletions
@@ -0,0 +1,164 @@
+# go-extractor Developer Guide
+
+## Project Overview
+
+**Repository:** gitea.stevedudenhoeffer.com/steve/go-extractor
+**Language:** Go 1.24
+**Primary Dependencies:**
+- `github.com/playwright-community/playwright-go` - Browser automation
+- `github.com/go-shiori/go-readability` - Article content extraction
+- `github.com/urfave/cli/v3` - CLI framework (cmd tools only)
+- `golang.org/x/text` - Currency formatting (megamillions)
+
+go-extractor is a browser-based web scraping library. It wraps Playwright to provide a clean Go API for opening pages, selecting DOM elements, extracting content, and interacting with pages.
+
+## Architecture
+
+### Core Types
+
+| File | Type | Purpose |
+|------|------|---------|
+| `browser.go` | `Browser` interface | Opens pages, manages browser lifecycle |
+| `document.go` | `Document` interface | Represents an open page (URL, content, refresh, wait) |
+| `node.go` | `Node` interface | DOM element operations (select, click, type, text, attr) |
+| `nodes.go` | `Nodes` type | Collection of Nodes with bulk operations |
+| `article.go` | `Article` struct | Readability extraction result |
+| `cookiejar.go` | `CookieJar` interface | Cookie storage abstraction |
+| `interactive.go` | `InteractiveBrowser` interface | Low-level mouse/keyboard/screenshot control |
+
+### Implementation Flow
+
+```
+NewBrowser(ctx, opts)
+  └─> initBrowser(opts)           # browser_init.go
+      ├─> playwright.Run()        # start Playwright
+      ├─> bt.Connect() or Launch  # remote or local browser
+      ├─> browser.NewContext()     # with UA, viewport, cookies
+      └─> return browserInitResult
+
+Browser.Open(ctx, url, opts)
+  └─> openPage()                  # playwright.go - creates page, navigates
+      └─> updateCookies()         # sync cookies back to jar
+          └─> newDocument()       # document.go - wraps page as Document
+```
+
+### Site Extractors
+
+Each site extractor in `sites/` follows the same pattern:
+1. `Config` struct with site-specific options and `validate()` method
+2. `DefaultConfig` package-level variable
+3. Methods on Config that take `(ctx, Browser)` and return parsed data
+4. A `cmd/` subdirectory with a CLI tool using `urfave/cli`
+
+## Development Guidelines
+
+### Error Handling
+
+- Always check and propagate errors with `fmt.Errorf("context: %w", err)`
+- Never discard errors with `_ =` unless explicitly intended (like `DeferClose`)
+- Check for nil before calling methods on `SelectFirst()` results — it returns nil if no element matches
+- Move `defer DeferClose(x)` after the error check, not before
+
+### Testing
+
+- Core types have unit tests using mock implementations in `mock_test.go` and `nodes_test.go`
+- `mockDocument` and `mockNode` implement the interfaces without Playwright
+- Site extractors currently lack mock-based tests — they need HTML fixtures
+- Run tests: `go test ./...`
+- Tests that need a browser should use build tags or skip when Playwright is unavailable
+
+### Adding a New Site Extractor
+
+1. Create `sites/mysite/mysite.go`:
+   ```go
+   package mysite
+
+   type Config struct{}
+   var DefaultConfig = Config{}
+
+   func (c Config) validate() Config { return c }
+
+   func (c Config) Extract(ctx context.Context, b extractor.Browser) (Result, error) {
+       doc, err := b.Open(ctx, "https://mysite.com", extractor.OpenPageOptions{})
+       if err != nil {
+           return Result{}, fmt.Errorf("failed to open page: %w", err)
+       }
+       defer extractor.DeferClose(doc)
+       // ... extract data using doc.Select(), doc.ForEach(), etc.
+   }
+   ```
+
+2. Create `sites/mysite/cmd/mysite/main.go` with a CLI wrapper
+3. Add tests in `sites/mysite/mysite_test.go`
+
+### Browser Options
+
+When creating browsers, understand the option merging behavior:
+- `mergeOptions()` in `browser_init.go` merges variadic `BrowserOptions`
+- String/pointer fields: only overwritten if non-zero
+- Boolean fields: `RequireServer` and `UseLocalOnly` are one-way (only set to true); `ShowBrowser` always overwrites (known issue #16)
+
+### Playwright Server
+
+The library supports connecting to a remote Playwright server:
+- Environment variables: `PLAYWRIGHT_SERVER_ADDRESS_FIREFOX`, `PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM`, `PLAYWRIGHT_SERVER_ADDRESS_WEBKIT`
+- `BrowserOptions.ServerAddress` overrides the env var
+- `RequireServer: true` prevents fallback to local browser
+- `UseLocalOnly: true` skips server connection entirely
+
+### DOM Interaction
+
+Use the `Node` interface for all DOM operations:
+- `Select(selector)` returns `Nodes` (may be empty, never nil panic)
+- `SelectFirst(selector)` returns `Node` or **nil** — always nil-check before use
+- `ForEach(selector, fn)` iterates over matching elements
+- `SetAttribute(name, value)` uses JavaScript evaluation — be aware of escaping limitations (see #12)
+
+## Building
+
+```bash
+go build ./...
+go test ./...
+```
+
+CLI tools:
+```bash
+go build ./cmd/browser
+go build ./sites/duckduckgo/cmd/duckduckgo
+```
+
+## Issue Labels
+
+### Priority Labels
+
+| Label | Color | Usage |
+|-------|-------|-------|
+| `priority/critical` | `#B60205` | Showstopper — security vulnerability, data loss, or crash |
+| `priority/high` | `#D93F0B` | Important — significant bug or high-value improvement |
+| `priority/medium` | `#FBCA04` | Normal — standard improvement or non-critical bug |
+| `priority/low` | `#0E8A16` | Nice to have — minor improvement or cleanup |
+
+### Type Labels
+
+| Label | Color | Usage |
+|-------|-------|-------|
+| `type/epic` | `#5319E7` | Parent issue grouping related stories/tasks |
+| `type/task` | `#0075CA` | Concrete implementation work item |
+| `type/refactor` | `#D4C5F9` | Code restructuring without behavior change |
+
+### Category Labels
+
+| Label | Color | Usage |
+|-------|-------|-------|
+| `bug` | `#D73A4A` | Something isn't working correctly |
+| `enhancement` | `#A2EEEF` | New feature or improvement |
+| `security` | `#B60205` | Security-related issue |
+| `testing` | `#BFD4F2` | Test coverage or infrastructure |
+| `documentation` | `#0075CA` | Documentation improvements |
+| `performance` | `#FBCA04` | Performance optimization |
+
+### Hierarchy
+
+- **Epics** (`type/epic`) group related issues. Reference the parent epic with `**Parent:** #N` in sub-task descriptions.
+- **Tasks** (`type/task`) are concrete work items, usually children of an epic.
+- An issue should have exactly one `priority/*` label and one type/category label.
@@ -0,0 +1,253 @@
+# go-extractor
+
+A Go library for browser-based web scraping and content extraction, powered by [Playwright](https://playwright.dev/).
+
+## Features
+
+- **Browser automation** via Playwright (Chromium, Firefox, WebKit)
+- **Readability extraction** — extract article content from any page using Mozilla's readability algorithm
+- **Interactive browser control** — mouse, keyboard, screenshots for remote browser sessions
+- **Cookie management** — load/save cookies from `cookies.txt` files, read-only cookie jars
+- **Remote browser support** — connect to Playwright server instances or fall back to local browsers
+- **Site-specific extractors** for:
+  - DuckDuckGo search (with pagination)
+  - Google search
+  - Powerball lottery results
+  - Mega Millions lottery results
+  - Wegmans grocery prices
+  - AisleGopher grocery prices
+  - archive.ph archival
+  - useragents.me user-agent lookup
+
+## Installation
+
+```bash
+go get gitea.stevedudenhoeffer.com/steve/go-extractor
+```
+
+Playwright browsers must be installed:
+
+```bash
+go run github.com/playwright-community/playwright-go/cmd/playwright install
+```
+
+## Quick Start
+
+### Extract article content from a URL
+
+```go
+package main
+
+import (
+    "context"
+    "fmt"
+    "log"
+
+    extractor "gitea.stevedudenhoeffer.com/steve/go-extractor"
+)
+
+func main() {
+    ctx := context.Background()
+
+    browser, err := extractor.NewBrowser(ctx)
+    if err != nil {
+        log.Fatal(err)
+    }
+    defer browser.Close()
+
+    doc, err := browser.Open(ctx, "https://example.com/article", extractor.OpenPageOptions{})
+    if err != nil {
+        log.Fatal(err)
+    }
+    defer doc.Close()
+
+    article, err := extractor.Readability(ctx, doc)
+    if err != nil {
+        log.Fatal(err)
+    }
+
+    fmt.Println("Title:", article.Title)
+    fmt.Println("Content:", article.TextContent)
+}
+```
+
+### Take a screenshot
+
+```go
+data, err := extractor.Screenshot(ctx, "https://example.com", 30*time.Second)
+if err != nil {
+    log.Fatal(err)
+}
+os.WriteFile("screenshot.png", data, 0644)
+```
+
+### Search DuckDuckGo
+
+```go
+import "gitea.stevedudenhoeffer.com/steve/go-extractor/sites/duckduckgo"
+
+results, err := duckduckgo.DefaultConfig.Search(ctx, browser, "golang web scraping")
+for _, r := range results {
+    fmt.Printf("%s - %s\n", r.Title, r.URL)
+}
+```
+
+### Use with Playwright server
+
+Set environment variables to connect to a remote Playwright instance:
+
+```bash
+export PLAYWRIGHT_SERVER_ADDRESS_FIREFOX=ws://playwright-server:3000
+export PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM=ws://playwright-server:3001
+```
+
+Or pass the address directly:
+
+```go
+browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{
+    ServerAddress: "ws://playwright-server:3000",
+    RequireServer: true,  // fail instead of falling back to local
+})
+```
+
+## Browser Options
+
+```go
+extractor.BrowserOptions{
+    UserAgent:     "custom-agent",           // defaults to a recent Firefox UA
+    Browser:       extractor.BrowserFirefox, // or BrowserChromium, BrowserWebKit
+    Timeout:       &timeout,                 // default 30s, 0 for no timeout
+    CookieJar:     jar,                      // load/save cookies automatically
+    ShowBrowser:   true,                     // show browser window (non-headless)
+    Dimensions:    extractor.Size{1280, 720},
+    DarkMode:      true,
+    ServerAddress: "ws://...",               // remote Playwright server
+    RequireServer: true,                     // don't fall back to local browser
+    UseLocalOnly:  true,                     // don't try remote server
+}
+```
+
+## DOM Interaction
+
+Documents and Nodes expose CSS selector-based DOM manipulation:
+
+```go
+// Select elements
+nodes := doc.Select("div.results a")
+first := doc.SelectFirst("h1")
+
+// Extract text
+text, err := first.Text()
+content, err := first.Content()
+href, err := first.Attr("href")
+
+// Interact
+err = first.Click()
+err = first.Type("hello world")
+
+// Iterate
+err = doc.ForEach("li.item", func(n extractor.Node) error {
+    text, _ := n.Text()
+    fmt.Println(text)
+    return nil
+})
+
+// Modify
+err = first.SetHidden(true)
+err = first.SetAttribute("data-processed", "true")
+```
+
+## Cookie Management
+
+Load cookies from a Netscape `cookies.txt` file:
+
+```go
+jar, err := extractor.LoadCookiesFile("cookies.txt")
+browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{
+    CookieJar: jar,
+})
+```
+
+Use a read-only cookie jar (cookies are loaded but changes aren't saved back):
+
+```go
+roJar := extractor.ReadOnlyCookieJar{Jar: jar}
+```
+
+## Interactive Browser
+
+For remote browser control with mouse/keyboard:
+
+```go
+ib, err := extractor.NewInteractiveBrowser(ctx)
+defer ib.Close()
+
+url, err := ib.Navigate("https://example.com")
+err = ib.MouseClick(100, 200, "left")
+err = ib.KeyboardType("search query")
+err = ib.KeyboardPress("Enter")
+screenshot, err := ib.Screenshot(80)  // JPEG quality 0-100
+```
+
+## Command-Line Tools
+
+The `cmd/` and `sites/*/cmd/` directories contain CLI tools:
+
+```bash
+# Extract article from URL
+go run ./cmd/browser https://example.com/article
+
+# Search DuckDuckGo
+go run ./sites/duckduckgo/cmd/duckduckgo "golang tutorial"
+
+# Search Google
+go run ./sites/google/cmd/google "golang tutorial"
+
+# Get Powerball results
+go run ./sites/powerball/cmd/powerball
+
+# Get Mega Millions results
+go run ./sites/megamillions/cmd/megamillions
+
+# Archive a page
+go run ./sites/archive/cmd/archive https://example.com/page
+
+# Get most common user agent
+go run ./sites/useragents/cmd/useragents
+```
+
+## Project Structure
+
+```
+go-extractor/
+├── article.go           # Article struct (readability output)
+├── browser.go           # Browser interface and Playwright implementation
+├── browser_init.go      # Browser initialization and option merging
+├── close.go             # DeferClose helper
+├── cookiejar.go         # Cookie/CookieJar types and ReadOnlyCookieJar
+├── cookies_txt.go       # cookies.txt file parser and staticCookieJar
+├── document.go          # Document interface (page wrapper)
+├── interactive.go       # InteractiveBrowser for remote control
+├── node.go              # Node interface (DOM element wrapper)
+├── nodes.go             # Nodes collection type
+├── playwright.go        # Playwright browser implementation
+├── readability.go       # Readability article extraction
+├── cmd/
+│   └── browser/         # CLI tool for article extraction
+├── sites/
+│   ├── aislegopher/     # AisleGopher price extraction
+│   ├── archive/         # archive.ph integration
+│   ├── duckduckgo/      # DuckDuckGo search
+│   ├── google/          # Google search
+│   ├── megamillions/    # Mega Millions lottery
+│   ├── powerball/       # Powerball lottery
+│   ├── useragents/      # useragents.me lookup
+│   └── wegmans/         # Wegmans price extraction
+└── *_test.go            # Unit tests
+```
+
+## Requirements
+
+- Go 1.24+
+- Playwright browsers installed (`playwright install`)
+- Optional: Playwright server for remote browser execution