docs: add README.md and CLAUDE.md
Add project documentation: - README.md with installation, usage examples, API reference, and project structure - CLAUDE.md with developer guide, architecture overview, conventions, and issue label docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
164
CLAUDE.md
Normal file
164
CLAUDE.md
Normal file
@@ -0,0 +1,164 @@
|
|||||||
|
# go-extractor Developer Guide
|
||||||
|
|
||||||
|
## Project Overview
|
||||||
|
|
||||||
|
**Repository:** gitea.stevedudenhoeffer.com/steve/go-extractor
|
||||||
|
**Language:** Go 1.24
|
||||||
|
**Primary Dependencies:**
|
||||||
|
- `github.com/playwright-community/playwright-go` - Browser automation
|
||||||
|
- `github.com/go-shiori/go-readability` - Article content extraction
|
||||||
|
- `github.com/urfave/cli/v3` - CLI framework (cmd tools only)
|
||||||
|
- `golang.org/x/text` - Currency formatting (megamillions)
|
||||||
|
|
||||||
|
go-extractor is a browser-based web scraping library. It wraps Playwright to provide a clean Go API for opening pages, selecting DOM elements, extracting content, and interacting with pages.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Core Types
|
||||||
|
|
||||||
|
| File | Type | Purpose |
|
||||||
|
|------|------|---------|
|
||||||
|
| `browser.go` | `Browser` interface | Opens pages, manages browser lifecycle |
|
||||||
|
| `document.go` | `Document` interface | Represents an open page (URL, content, refresh, wait) |
|
||||||
|
| `node.go` | `Node` interface | DOM element operations (select, click, type, text, attr) |
|
||||||
|
| `nodes.go` | `Nodes` type | Collection of Nodes with bulk operations |
|
||||||
|
| `article.go` | `Article` struct | Readability extraction result |
|
||||||
|
| `cookiejar.go` | `CookieJar` interface | Cookie storage abstraction |
|
||||||
|
| `interactive.go` | `InteractiveBrowser` interface | Low-level mouse/keyboard/screenshot control |
|
||||||
|
|
||||||
|
### Implementation Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
NewBrowser(ctx, opts)
|
||||||
|
└─> initBrowser(opts) # browser_init.go
|
||||||
|
├─> playwright.Run() # start Playwright
|
||||||
|
├─> bt.Connect() or Launch # remote or local browser
|
||||||
|
├─> browser.NewContext() # with UA, viewport, cookies
|
||||||
|
└─> return browserInitResult
|
||||||
|
|
||||||
|
Browser.Open(ctx, url, opts)
|
||||||
|
└─> openPage() # playwright.go - creates page, navigates
|
||||||
|
└─> updateCookies() # sync cookies back to jar
|
||||||
|
└─> newDocument() # document.go - wraps page as Document
|
||||||
|
```
|
||||||
|
|
||||||
|
### Site Extractors
|
||||||
|
|
||||||
|
Each site extractor in `sites/` follows the same pattern:
|
||||||
|
1. `Config` struct with site-specific options and `validate()` method
|
||||||
|
2. `DefaultConfig` package-level variable
|
||||||
|
3. Methods on Config that take `(ctx, Browser)` and return parsed data
|
||||||
|
4. A `cmd/` subdirectory with a CLI tool using `urfave/cli`
|
||||||
|
|
||||||
|
## Development Guidelines
|
||||||
|
|
||||||
|
### Error Handling
|
||||||
|
|
||||||
|
- Always check and propagate errors with `fmt.Errorf("context: %w", err)`
|
||||||
|
- Never discard errors with `_ =` unless explicitly intended (like `DeferClose`)
|
||||||
|
- Check for nil before calling methods on `SelectFirst()` results — it returns nil if no element matches
|
||||||
|
- Move `defer DeferClose(x)` after the error check, not before
|
||||||
|
|
||||||
|
### Testing
|
||||||
|
|
||||||
|
- Core types have unit tests using mock implementations in `mock_test.go` and `nodes_test.go`
|
||||||
|
- `mockDocument` and `mockNode` implement the interfaces without Playwright
|
||||||
|
- Site extractors currently lack mock-based tests — they need HTML fixtures
|
||||||
|
- Run tests: `go test ./...`
|
||||||
|
- Tests that need a browser should use build tags or skip when Playwright is unavailable
|
||||||
|
|
||||||
|
### Adding a New Site Extractor
|
||||||
|
|
||||||
|
1. Create `sites/mysite/mysite.go`:
|
||||||
|
```go
|
||||||
|
package mysite
|
||||||
|
|
||||||
|
type Config struct{}
|
||||||
|
var DefaultConfig = Config{}
|
||||||
|
|
||||||
|
func (c Config) validate() Config { return c }
|
||||||
|
|
||||||
|
func (c Config) Extract(ctx context.Context, b extractor.Browser) (Result, error) {
|
||||||
|
doc, err := b.Open(ctx, "https://mysite.com", extractor.OpenPageOptions{})
|
||||||
|
if err != nil {
|
||||||
|
return Result{}, fmt.Errorf("failed to open page: %w", err)
|
||||||
|
}
|
||||||
|
defer extractor.DeferClose(doc)
|
||||||
|
// ... extract data using doc.Select(), doc.ForEach(), etc.
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Create `sites/mysite/cmd/mysite/main.go` with a CLI wrapper
|
||||||
|
3. Add tests in `sites/mysite/mysite_test.go`
|
||||||
|
|
||||||
|
### Browser Options
|
||||||
|
|
||||||
|
When creating browsers, understand the option merging behavior:
|
||||||
|
- `mergeOptions()` in `browser_init.go` merges variadic `BrowserOptions`
|
||||||
|
- String/pointer fields: only overwritten if non-zero
|
||||||
|
- Boolean fields: `RequireServer` and `UseLocalOnly` are one-way (only set to true); `ShowBrowser` always overwrites (known issue #16)
|
||||||
|
|
||||||
|
### Playwright Server
|
||||||
|
|
||||||
|
The library supports connecting to a remote Playwright server:
|
||||||
|
- Environment variables: `PLAYWRIGHT_SERVER_ADDRESS_FIREFOX`, `PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM`, `PLAYWRIGHT_SERVER_ADDRESS_WEBKIT`
|
||||||
|
- `BrowserOptions.ServerAddress` overrides the env var
|
||||||
|
- `RequireServer: true` prevents fallback to local browser
|
||||||
|
- `UseLocalOnly: true` skips server connection entirely
|
||||||
|
|
||||||
|
### DOM Interaction
|
||||||
|
|
||||||
|
Use the `Node` interface for all DOM operations:
|
||||||
|
- `Select(selector)` returns `Nodes` (may be empty, never nil panic)
|
||||||
|
- `SelectFirst(selector)` returns `Node` or **nil** — always nil-check before use
|
||||||
|
- `ForEach(selector, fn)` iterates over matching elements
|
||||||
|
- `SetAttribute(name, value)` uses JavaScript evaluation — be aware of escaping limitations (see #12)
|
||||||
|
|
||||||
|
## Building
|
||||||
|
|
||||||
|
```bash
|
||||||
|
go build ./...
|
||||||
|
go test ./...
|
||||||
|
```
|
||||||
|
|
||||||
|
CLI tools:
|
||||||
|
```bash
|
||||||
|
go build ./cmd/browser
|
||||||
|
go build ./sites/duckduckgo/cmd/duckduckgo
|
||||||
|
```
|
||||||
|
|
||||||
|
## Issue Labels
|
||||||
|
|
||||||
|
### Priority Labels
|
||||||
|
|
||||||
|
| Label | Color | Usage |
|
||||||
|
|-------|-------|-------|
|
||||||
|
| `priority/critical` | `#B60205` | Showstopper — security vulnerability, data loss, or crash |
|
||||||
|
| `priority/high` | `#D93F0B` | Important — significant bug or high-value improvement |
|
||||||
|
| `priority/medium` | `#FBCA04` | Normal — standard improvement or non-critical bug |
|
||||||
|
| `priority/low` | `#0E8A16` | Nice to have — minor improvement or cleanup |
|
||||||
|
|
||||||
|
### Type Labels
|
||||||
|
|
||||||
|
| Label | Color | Usage |
|
||||||
|
|-------|-------|-------|
|
||||||
|
| `type/epic` | `#5319E7` | Parent issue grouping related stories/tasks |
|
||||||
|
| `type/task` | `#0075CA` | Concrete implementation work item |
|
||||||
|
| `type/refactor` | `#D4C5F9` | Code restructuring without behavior change |
|
||||||
|
|
||||||
|
### Category Labels
|
||||||
|
|
||||||
|
| Label | Color | Usage |
|
||||||
|
|-------|-------|-------|
|
||||||
|
| `bug` | `#D73A4A` | Something isn't working correctly |
|
||||||
|
| `enhancement` | `#A2EEEF` | New feature or improvement |
|
||||||
|
| `security` | `#B60205` | Security-related issue |
|
||||||
|
| `testing` | `#BFD4F2` | Test coverage or infrastructure |
|
||||||
|
| `documentation` | `#0075CA` | Documentation improvements |
|
||||||
|
| `performance` | `#FBCA04` | Performance optimization |
|
||||||
|
|
||||||
|
### Hierarchy
|
||||||
|
|
||||||
|
- **Epics** (`type/epic`) group related issues. Reference the parent epic with `**Parent:** #N` in sub-task descriptions.
|
||||||
|
- **Tasks** (`type/task`) are concrete work items, usually children of an epic.
|
||||||
|
- An issue should have exactly one `priority/*` label and one type/category label.
|
||||||
253
README.md
Normal file
253
README.md
Normal file
@@ -0,0 +1,253 @@
|
|||||||
|
# go-extractor
|
||||||
|
|
||||||
|
A Go library for browser-based web scraping and content extraction, powered by [Playwright](https://playwright.dev/).
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Browser automation** via Playwright (Chromium, Firefox, WebKit)
|
||||||
|
- **Readability extraction** — extract article content from any page using Mozilla's readability algorithm
|
||||||
|
- **Interactive browser control** — mouse, keyboard, screenshots for remote browser sessions
|
||||||
|
- **Cookie management** — load/save cookies from `cookies.txt` files, read-only cookie jars
|
||||||
|
- **Remote browser support** — connect to Playwright server instances or fall back to local browsers
|
||||||
|
- **Site-specific extractors** for:
|
||||||
|
- DuckDuckGo search (with pagination)
|
||||||
|
- Google search
|
||||||
|
- Powerball lottery results
|
||||||
|
- Mega Millions lottery results
|
||||||
|
- Wegmans grocery prices
|
||||||
|
- AisleGopher grocery prices
|
||||||
|
- archive.ph archival
|
||||||
|
- useragents.me user-agent lookup
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
go get gitea.stevedudenhoeffer.com/steve/go-extractor
|
||||||
|
```
|
||||||
|
|
||||||
|
Playwright browsers must be installed:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
go run github.com/playwright-community/playwright-go/cmd/playwright install
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
### Extract article content from a URL
|
||||||
|
|
||||||
|
```go
|
||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"fmt"
|
||||||
|
"log"
|
||||||
|
|
||||||
|
extractor "gitea.stevedudenhoeffer.com/steve/go-extractor"
|
||||||
|
)
|
||||||
|
|
||||||
|
func main() {
|
||||||
|
ctx := context.Background()
|
||||||
|
|
||||||
|
browser, err := extractor.NewBrowser(ctx)
|
||||||
|
if err != nil {
|
||||||
|
log.Fatal(err)
|
||||||
|
}
|
||||||
|
defer browser.Close()
|
||||||
|
|
||||||
|
doc, err := browser.Open(ctx, "https://example.com/article", extractor.OpenPageOptions{})
|
||||||
|
if err != nil {
|
||||||
|
log.Fatal(err)
|
||||||
|
}
|
||||||
|
defer doc.Close()
|
||||||
|
|
||||||
|
article, err := extractor.Readability(ctx, doc)
|
||||||
|
if err != nil {
|
||||||
|
log.Fatal(err)
|
||||||
|
}
|
||||||
|
|
||||||
|
fmt.Println("Title:", article.Title)
|
||||||
|
fmt.Println("Content:", article.TextContent)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Take a screenshot
|
||||||
|
|
||||||
|
```go
|
||||||
|
data, err := extractor.Screenshot(ctx, "https://example.com", 30*time.Second)
|
||||||
|
if err != nil {
|
||||||
|
log.Fatal(err)
|
||||||
|
}
|
||||||
|
os.WriteFile("screenshot.png", data, 0644)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Search DuckDuckGo
|
||||||
|
|
||||||
|
```go
|
||||||
|
import "gitea.stevedudenhoeffer.com/steve/go-extractor/sites/duckduckgo"
|
||||||
|
|
||||||
|
results, err := duckduckgo.DefaultConfig.Search(ctx, browser, "golang web scraping")
|
||||||
|
for _, r := range results {
|
||||||
|
fmt.Printf("%s - %s\n", r.Title, r.URL)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Use with Playwright server
|
||||||
|
|
||||||
|
Set environment variables to connect to a remote Playwright instance:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export PLAYWRIGHT_SERVER_ADDRESS_FIREFOX=ws://playwright-server:3000
|
||||||
|
export PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM=ws://playwright-server:3001
|
||||||
|
```
|
||||||
|
|
||||||
|
Or pass the address directly:
|
||||||
|
|
||||||
|
```go
|
||||||
|
browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{
|
||||||
|
ServerAddress: "ws://playwright-server:3000",
|
||||||
|
RequireServer: true, // fail instead of falling back to local
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
## Browser Options
|
||||||
|
|
||||||
|
```go
|
||||||
|
extractor.BrowserOptions{
|
||||||
|
UserAgent: "custom-agent", // defaults to a recent Firefox UA
|
||||||
|
Browser: extractor.BrowserFirefox, // or BrowserChromium, BrowserWebKit
|
||||||
|
Timeout: &timeout, // default 30s, 0 for no timeout
|
||||||
|
CookieJar: jar, // load/save cookies automatically
|
||||||
|
ShowBrowser: true, // show browser window (non-headless)
|
||||||
|
Dimensions: extractor.Size{1280, 720},
|
||||||
|
DarkMode: true,
|
||||||
|
ServerAddress: "ws://...", // remote Playwright server
|
||||||
|
RequireServer: true, // don't fall back to local browser
|
||||||
|
UseLocalOnly: true, // don't try remote server
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## DOM Interaction
|
||||||
|
|
||||||
|
Documents and Nodes expose CSS selector-based DOM manipulation:
|
||||||
|
|
||||||
|
```go
|
||||||
|
// Select elements
|
||||||
|
nodes := doc.Select("div.results a")
|
||||||
|
first := doc.SelectFirst("h1")
|
||||||
|
|
||||||
|
// Extract text
|
||||||
|
text, err := first.Text()
|
||||||
|
content, err := first.Content()
|
||||||
|
href, err := first.Attr("href")
|
||||||
|
|
||||||
|
// Interact
|
||||||
|
err = first.Click()
|
||||||
|
err = first.Type("hello world")
|
||||||
|
|
||||||
|
// Iterate
|
||||||
|
err = doc.ForEach("li.item", func(n extractor.Node) error {
|
||||||
|
text, _ := n.Text()
|
||||||
|
fmt.Println(text)
|
||||||
|
return nil
|
||||||
|
})
|
||||||
|
|
||||||
|
// Modify
|
||||||
|
err = first.SetHidden(true)
|
||||||
|
err = first.SetAttribute("data-processed", "true")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cookie Management
|
||||||
|
|
||||||
|
Load cookies from a Netscape `cookies.txt` file:
|
||||||
|
|
||||||
|
```go
|
||||||
|
jar, err := extractor.LoadCookiesFile("cookies.txt")
|
||||||
|
browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{
|
||||||
|
CookieJar: jar,
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
|
Use a read-only cookie jar (cookies are loaded but changes aren't saved back):
|
||||||
|
|
||||||
|
```go
|
||||||
|
roJar := extractor.ReadOnlyCookieJar{Jar: jar}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Interactive Browser
|
||||||
|
|
||||||
|
For remote browser control with mouse/keyboard:
|
||||||
|
|
||||||
|
```go
|
||||||
|
ib, err := extractor.NewInteractiveBrowser(ctx)
|
||||||
|
defer ib.Close()
|
||||||
|
|
||||||
|
url, err := ib.Navigate("https://example.com")
|
||||||
|
err = ib.MouseClick(100, 200, "left")
|
||||||
|
err = ib.KeyboardType("search query")
|
||||||
|
err = ib.KeyboardPress("Enter")
|
||||||
|
screenshot, err := ib.Screenshot(80) // JPEG quality 0-100
|
||||||
|
```
|
||||||
|
|
||||||
|
## Command-Line Tools
|
||||||
|
|
||||||
|
The `cmd/` and `sites/*/cmd/` directories contain CLI tools:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Extract article from URL
|
||||||
|
go run ./cmd/browser https://example.com/article
|
||||||
|
|
||||||
|
# Search DuckDuckGo
|
||||||
|
go run ./sites/duckduckgo/cmd/duckduckgo "golang tutorial"
|
||||||
|
|
||||||
|
# Search Google
|
||||||
|
go run ./sites/google/cmd/google "golang tutorial"
|
||||||
|
|
||||||
|
# Get Powerball results
|
||||||
|
go run ./sites/powerball/cmd/powerball
|
||||||
|
|
||||||
|
# Get Mega Millions results
|
||||||
|
go run ./sites/megamillions/cmd/megamillions
|
||||||
|
|
||||||
|
# Archive a page
|
||||||
|
go run ./sites/archive/cmd/archive https://example.com/page
|
||||||
|
|
||||||
|
# Get most common user agent
|
||||||
|
go run ./sites/useragents/cmd/useragents
|
||||||
|
```
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
go-extractor/
|
||||||
|
├── article.go # Article struct (readability output)
|
||||||
|
├── browser.go # Browser interface and Playwright implementation
|
||||||
|
├── browser_init.go # Browser initialization and option merging
|
||||||
|
├── close.go # DeferClose helper
|
||||||
|
├── cookiejar.go # Cookie/CookieJar types and ReadOnlyCookieJar
|
||||||
|
├── cookies_txt.go # cookies.txt file parser and staticCookieJar
|
||||||
|
├── document.go # Document interface (page wrapper)
|
||||||
|
├── interactive.go # InteractiveBrowser for remote control
|
||||||
|
├── node.go # Node interface (DOM element wrapper)
|
||||||
|
├── nodes.go # Nodes collection type
|
||||||
|
├── playwright.go # Playwright browser implementation
|
||||||
|
├── readability.go # Readability article extraction
|
||||||
|
├── cmd/
|
||||||
|
│ └── browser/ # CLI tool for article extraction
|
||||||
|
├── sites/
|
||||||
|
│ ├── aislegopher/ # AisleGopher price extraction
|
||||||
|
│ ├── archive/ # archive.ph integration
|
||||||
|
│ ├── duckduckgo/ # DuckDuckGo search
|
||||||
|
│ ├── google/ # Google search
|
||||||
|
│ ├── megamillions/ # Mega Millions lottery
|
||||||
|
│ ├── powerball/ # Powerball lottery
|
||||||
|
│ ├── useragents/ # useragents.me lookup
|
||||||
|
│ └── wegmans/ # Wegmans price extraction
|
||||||
|
└── *_test.go # Unit tests
|
||||||
|
```
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- Go 1.24+
|
||||||
|
- Playwright browsers installed (`playwright install`)
|
||||||
|
- Optional: Playwright server for remote browser execution
|
||||||
Reference in New Issue
Block a user