diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..b89371a --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,164 @@ +# go-extractor Developer Guide + +## Project Overview + +**Repository:** gitea.stevedudenhoeffer.com/steve/go-extractor +**Language:** Go 1.24 +**Primary Dependencies:** +- `github.com/playwright-community/playwright-go` - Browser automation +- `github.com/go-shiori/go-readability` - Article content extraction +- `github.com/urfave/cli/v3` - CLI framework (cmd tools only) +- `golang.org/x/text` - Currency formatting (megamillions) + +go-extractor is a browser-based web scraping library. It wraps Playwright to provide a clean Go API for opening pages, selecting DOM elements, extracting content, and interacting with pages. + +## Architecture + +### Core Types + +| File | Type | Purpose | +|------|------|---------| +| `browser.go` | `Browser` interface | Opens pages, manages browser lifecycle | +| `document.go` | `Document` interface | Represents an open page (URL, content, refresh, wait) | +| `node.go` | `Node` interface | DOM element operations (select, click, type, text, attr) | +| `nodes.go` | `Nodes` type | Collection of Nodes with bulk operations | +| `article.go` | `Article` struct | Readability extraction result | +| `cookiejar.go` | `CookieJar` interface | Cookie storage abstraction | +| `interactive.go` | `InteractiveBrowser` interface | Low-level mouse/keyboard/screenshot control | + +### Implementation Flow + +``` +NewBrowser(ctx, opts) + └─> initBrowser(opts) # browser_init.go + ├─> playwright.Run() # start Playwright + ├─> bt.Connect() or Launch # remote or local browser + ├─> browser.NewContext() # with UA, viewport, cookies + └─> return browserInitResult + +Browser.Open(ctx, url, opts) + └─> openPage() # playwright.go - creates page, navigates + └─> updateCookies() # sync cookies back to jar + └─> newDocument() # document.go - wraps page as Document +``` + +### Site Extractors + +Each site extractor in `sites/` follows the same pattern: +1. `Config` struct with site-specific options and `validate()` method +2. `DefaultConfig` package-level variable +3. Methods on Config that take `(ctx, Browser)` and return parsed data +4. A `cmd/` subdirectory with a CLI tool using `urfave/cli` + +## Development Guidelines + +### Error Handling + +- Always check and propagate errors with `fmt.Errorf("context: %w", err)` +- Never discard errors with `_ =` unless explicitly intended (like `DeferClose`) +- Check for nil before calling methods on `SelectFirst()` results — it returns nil if no element matches +- Move `defer DeferClose(x)` after the error check, not before + +### Testing + +- Core types have unit tests using mock implementations in `mock_test.go` and `nodes_test.go` +- `mockDocument` and `mockNode` implement the interfaces without Playwright +- Site extractors currently lack mock-based tests — they need HTML fixtures +- Run tests: `go test ./...` +- Tests that need a browser should use build tags or skip when Playwright is unavailable + +### Adding a New Site Extractor + +1. Create `sites/mysite/mysite.go`: + ```go + package mysite + + type Config struct{} + var DefaultConfig = Config{} + + func (c Config) validate() Config { return c } + + func (c Config) Extract(ctx context.Context, b extractor.Browser) (Result, error) { + doc, err := b.Open(ctx, "https://mysite.com", extractor.OpenPageOptions{}) + if err != nil { + return Result{}, fmt.Errorf("failed to open page: %w", err) + } + defer extractor.DeferClose(doc) + // ... extract data using doc.Select(), doc.ForEach(), etc. + } + ``` + +2. Create `sites/mysite/cmd/mysite/main.go` with a CLI wrapper +3. Add tests in `sites/mysite/mysite_test.go` + +### Browser Options + +When creating browsers, understand the option merging behavior: +- `mergeOptions()` in `browser_init.go` merges variadic `BrowserOptions` +- String/pointer fields: only overwritten if non-zero +- Boolean fields: `RequireServer` and `UseLocalOnly` are one-way (only set to true); `ShowBrowser` always overwrites (known issue #16) + +### Playwright Server + +The library supports connecting to a remote Playwright server: +- Environment variables: `PLAYWRIGHT_SERVER_ADDRESS_FIREFOX`, `PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM`, `PLAYWRIGHT_SERVER_ADDRESS_WEBKIT` +- `BrowserOptions.ServerAddress` overrides the env var +- `RequireServer: true` prevents fallback to local browser +- `UseLocalOnly: true` skips server connection entirely + +### DOM Interaction + +Use the `Node` interface for all DOM operations: +- `Select(selector)` returns `Nodes` (may be empty, never nil panic) +- `SelectFirst(selector)` returns `Node` or **nil** — always nil-check before use +- `ForEach(selector, fn)` iterates over matching elements +- `SetAttribute(name, value)` uses JavaScript evaluation — be aware of escaping limitations (see #12) + +## Building + +```bash +go build ./... +go test ./... +``` + +CLI tools: +```bash +go build ./cmd/browser +go build ./sites/duckduckgo/cmd/duckduckgo +``` + +## Issue Labels + +### Priority Labels + +| Label | Color | Usage | +|-------|-------|-------| +| `priority/critical` | `#B60205` | Showstopper — security vulnerability, data loss, or crash | +| `priority/high` | `#D93F0B` | Important — significant bug or high-value improvement | +| `priority/medium` | `#FBCA04` | Normal — standard improvement or non-critical bug | +| `priority/low` | `#0E8A16` | Nice to have — minor improvement or cleanup | + +### Type Labels + +| Label | Color | Usage | +|-------|-------|-------| +| `type/epic` | `#5319E7` | Parent issue grouping related stories/tasks | +| `type/task` | `#0075CA` | Concrete implementation work item | +| `type/refactor` | `#D4C5F9` | Code restructuring without behavior change | + +### Category Labels + +| Label | Color | Usage | +|-------|-------|-------| +| `bug` | `#D73A4A` | Something isn't working correctly | +| `enhancement` | `#A2EEEF` | New feature or improvement | +| `security` | `#B60205` | Security-related issue | +| `testing` | `#BFD4F2` | Test coverage or infrastructure | +| `documentation` | `#0075CA` | Documentation improvements | +| `performance` | `#FBCA04` | Performance optimization | + +### Hierarchy + +- **Epics** (`type/epic`) group related issues. Reference the parent epic with `**Parent:** #N` in sub-task descriptions. +- **Tasks** (`type/task`) are concrete work items, usually children of an epic. +- An issue should have exactly one `priority/*` label and one type/category label. diff --git a/README.md b/README.md new file mode 100644 index 0000000..1c4e074 --- /dev/null +++ b/README.md @@ -0,0 +1,253 @@ +# go-extractor + +A Go library for browser-based web scraping and content extraction, powered by [Playwright](https://playwright.dev/). + +## Features + +- **Browser automation** via Playwright (Chromium, Firefox, WebKit) +- **Readability extraction** — extract article content from any page using Mozilla's readability algorithm +- **Interactive browser control** — mouse, keyboard, screenshots for remote browser sessions +- **Cookie management** — load/save cookies from `cookies.txt` files, read-only cookie jars +- **Remote browser support** — connect to Playwright server instances or fall back to local browsers +- **Site-specific extractors** for: + - DuckDuckGo search (with pagination) + - Google search + - Powerball lottery results + - Mega Millions lottery results + - Wegmans grocery prices + - AisleGopher grocery prices + - archive.ph archival + - useragents.me user-agent lookup + +## Installation + +```bash +go get gitea.stevedudenhoeffer.com/steve/go-extractor +``` + +Playwright browsers must be installed: + +```bash +go run github.com/playwright-community/playwright-go/cmd/playwright install +``` + +## Quick Start + +### Extract article content from a URL + +```go +package main + +import ( + "context" + "fmt" + "log" + + extractor "gitea.stevedudenhoeffer.com/steve/go-extractor" +) + +func main() { + ctx := context.Background() + + browser, err := extractor.NewBrowser(ctx) + if err != nil { + log.Fatal(err) + } + defer browser.Close() + + doc, err := browser.Open(ctx, "https://example.com/article", extractor.OpenPageOptions{}) + if err != nil { + log.Fatal(err) + } + defer doc.Close() + + article, err := extractor.Readability(ctx, doc) + if err != nil { + log.Fatal(err) + } + + fmt.Println("Title:", article.Title) + fmt.Println("Content:", article.TextContent) +} +``` + +### Take a screenshot + +```go +data, err := extractor.Screenshot(ctx, "https://example.com", 30*time.Second) +if err != nil { + log.Fatal(err) +} +os.WriteFile("screenshot.png", data, 0644) +``` + +### Search DuckDuckGo + +```go +import "gitea.stevedudenhoeffer.com/steve/go-extractor/sites/duckduckgo" + +results, err := duckduckgo.DefaultConfig.Search(ctx, browser, "golang web scraping") +for _, r := range results { + fmt.Printf("%s - %s\n", r.Title, r.URL) +} +``` + +### Use with Playwright server + +Set environment variables to connect to a remote Playwright instance: + +```bash +export PLAYWRIGHT_SERVER_ADDRESS_FIREFOX=ws://playwright-server:3000 +export PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM=ws://playwright-server:3001 +``` + +Or pass the address directly: + +```go +browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{ + ServerAddress: "ws://playwright-server:3000", + RequireServer: true, // fail instead of falling back to local +}) +``` + +## Browser Options + +```go +extractor.BrowserOptions{ + UserAgent: "custom-agent", // defaults to a recent Firefox UA + Browser: extractor.BrowserFirefox, // or BrowserChromium, BrowserWebKit + Timeout: &timeout, // default 30s, 0 for no timeout + CookieJar: jar, // load/save cookies automatically + ShowBrowser: true, // show browser window (non-headless) + Dimensions: extractor.Size{1280, 720}, + DarkMode: true, + ServerAddress: "ws://...", // remote Playwright server + RequireServer: true, // don't fall back to local browser + UseLocalOnly: true, // don't try remote server +} +``` + +## DOM Interaction + +Documents and Nodes expose CSS selector-based DOM manipulation: + +```go +// Select elements +nodes := doc.Select("div.results a") +first := doc.SelectFirst("h1") + +// Extract text +text, err := first.Text() +content, err := first.Content() +href, err := first.Attr("href") + +// Interact +err = first.Click() +err = first.Type("hello world") + +// Iterate +err = doc.ForEach("li.item", func(n extractor.Node) error { + text, _ := n.Text() + fmt.Println(text) + return nil +}) + +// Modify +err = first.SetHidden(true) +err = first.SetAttribute("data-processed", "true") +``` + +## Cookie Management + +Load cookies from a Netscape `cookies.txt` file: + +```go +jar, err := extractor.LoadCookiesFile("cookies.txt") +browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{ + CookieJar: jar, +}) +``` + +Use a read-only cookie jar (cookies are loaded but changes aren't saved back): + +```go +roJar := extractor.ReadOnlyCookieJar{Jar: jar} +``` + +## Interactive Browser + +For remote browser control with mouse/keyboard: + +```go +ib, err := extractor.NewInteractiveBrowser(ctx) +defer ib.Close() + +url, err := ib.Navigate("https://example.com") +err = ib.MouseClick(100, 200, "left") +err = ib.KeyboardType("search query") +err = ib.KeyboardPress("Enter") +screenshot, err := ib.Screenshot(80) // JPEG quality 0-100 +``` + +## Command-Line Tools + +The `cmd/` and `sites/*/cmd/` directories contain CLI tools: + +```bash +# Extract article from URL +go run ./cmd/browser https://example.com/article + +# Search DuckDuckGo +go run ./sites/duckduckgo/cmd/duckduckgo "golang tutorial" + +# Search Google +go run ./sites/google/cmd/google "golang tutorial" + +# Get Powerball results +go run ./sites/powerball/cmd/powerball + +# Get Mega Millions results +go run ./sites/megamillions/cmd/megamillions + +# Archive a page +go run ./sites/archive/cmd/archive https://example.com/page + +# Get most common user agent +go run ./sites/useragents/cmd/useragents +``` + +## Project Structure + +``` +go-extractor/ +├── article.go # Article struct (readability output) +├── browser.go # Browser interface and Playwright implementation +├── browser_init.go # Browser initialization and option merging +├── close.go # DeferClose helper +├── cookiejar.go # Cookie/CookieJar types and ReadOnlyCookieJar +├── cookies_txt.go # cookies.txt file parser and staticCookieJar +├── document.go # Document interface (page wrapper) +├── interactive.go # InteractiveBrowser for remote control +├── node.go # Node interface (DOM element wrapper) +├── nodes.go # Nodes collection type +├── playwright.go # Playwright browser implementation +├── readability.go # Readability article extraction +├── cmd/ +│ └── browser/ # CLI tool for article extraction +├── sites/ +│ ├── aislegopher/ # AisleGopher price extraction +│ ├── archive/ # archive.ph integration +│ ├── duckduckgo/ # DuckDuckGo search +│ ├── google/ # Google search +│ ├── megamillions/ # Mega Millions lottery +│ ├── powerball/ # Powerball lottery +│ ├── useragents/ # useragents.me lookup +│ └── wegmans/ # Wegmans price extraction +└── *_test.go # Unit tests +``` + +## Requirements + +- Go 1.24+ +- Playwright browsers installed (`playwright install`) +- Optional: Playwright server for remote browser execution