Files
go-extractor/CLAUDE.md
Steve Dudenhoeffer 49f294e884
All checks were successful
CI / test (push) Successful in 32s
CI / vet (push) Successful in 45s
CI / build (push) Successful in 46s
docs: add README.md and CLAUDE.md
Add project documentation:
- README.md with installation, usage examples, API reference, and project structure
- CLAUDE.md with developer guide, architecture overview, conventions, and issue label docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 11:10:28 -05:00

165 lines
6.2 KiB
Markdown

# go-extractor Developer Guide
## Project Overview
**Repository:** gitea.stevedudenhoeffer.com/steve/go-extractor
**Language:** Go 1.24
**Primary Dependencies:**
- `github.com/playwright-community/playwright-go` - Browser automation
- `github.com/go-shiori/go-readability` - Article content extraction
- `github.com/urfave/cli/v3` - CLI framework (cmd tools only)
- `golang.org/x/text` - Currency formatting (megamillions)
go-extractor is a browser-based web scraping library. It wraps Playwright to provide a clean Go API for opening pages, selecting DOM elements, extracting content, and interacting with pages.
## Architecture
### Core Types
| File | Type | Purpose |
|------|------|---------|
| `browser.go` | `Browser` interface | Opens pages, manages browser lifecycle |
| `document.go` | `Document` interface | Represents an open page (URL, content, refresh, wait) |
| `node.go` | `Node` interface | DOM element operations (select, click, type, text, attr) |
| `nodes.go` | `Nodes` type | Collection of Nodes with bulk operations |
| `article.go` | `Article` struct | Readability extraction result |
| `cookiejar.go` | `CookieJar` interface | Cookie storage abstraction |
| `interactive.go` | `InteractiveBrowser` interface | Low-level mouse/keyboard/screenshot control |
### Implementation Flow
```
NewBrowser(ctx, opts)
└─> initBrowser(opts) # browser_init.go
├─> playwright.Run() # start Playwright
├─> bt.Connect() or Launch # remote or local browser
├─> browser.NewContext() # with UA, viewport, cookies
└─> return browserInitResult
Browser.Open(ctx, url, opts)
└─> openPage() # playwright.go - creates page, navigates
└─> updateCookies() # sync cookies back to jar
└─> newDocument() # document.go - wraps page as Document
```
### Site Extractors
Each site extractor in `sites/` follows the same pattern:
1. `Config` struct with site-specific options and `validate()` method
2. `DefaultConfig` package-level variable
3. Methods on Config that take `(ctx, Browser)` and return parsed data
4. A `cmd/` subdirectory with a CLI tool using `urfave/cli`
## Development Guidelines
### Error Handling
- Always check and propagate errors with `fmt.Errorf("context: %w", err)`
- Never discard errors with `_ =` unless explicitly intended (like `DeferClose`)
- Check for nil before calling methods on `SelectFirst()` results — it returns nil if no element matches
- Move `defer DeferClose(x)` after the error check, not before
### Testing
- Core types have unit tests using mock implementations in `mock_test.go` and `nodes_test.go`
- `mockDocument` and `mockNode` implement the interfaces without Playwright
- Site extractors currently lack mock-based tests — they need HTML fixtures
- Run tests: `go test ./...`
- Tests that need a browser should use build tags or skip when Playwright is unavailable
### Adding a New Site Extractor
1. Create `sites/mysite/mysite.go`:
```go
package mysite
type Config struct{}
var DefaultConfig = Config{}
func (c Config) validate() Config { return c }
func (c Config) Extract(ctx context.Context, b extractor.Browser) (Result, error) {
doc, err := b.Open(ctx, "https://mysite.com", extractor.OpenPageOptions{})
if err != nil {
return Result{}, fmt.Errorf("failed to open page: %w", err)
}
defer extractor.DeferClose(doc)
// ... extract data using doc.Select(), doc.ForEach(), etc.
}
```
2. Create `sites/mysite/cmd/mysite/main.go` with a CLI wrapper
3. Add tests in `sites/mysite/mysite_test.go`
### Browser Options
When creating browsers, understand the option merging behavior:
- `mergeOptions()` in `browser_init.go` merges variadic `BrowserOptions`
- String/pointer fields: only overwritten if non-zero
- Boolean fields: `RequireServer` and `UseLocalOnly` are one-way (only set to true); `ShowBrowser` always overwrites (known issue #16)
### Playwright Server
The library supports connecting to a remote Playwright server:
- Environment variables: `PLAYWRIGHT_SERVER_ADDRESS_FIREFOX`, `PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM`, `PLAYWRIGHT_SERVER_ADDRESS_WEBKIT`
- `BrowserOptions.ServerAddress` overrides the env var
- `RequireServer: true` prevents fallback to local browser
- `UseLocalOnly: true` skips server connection entirely
### DOM Interaction
Use the `Node` interface for all DOM operations:
- `Select(selector)` returns `Nodes` (may be empty, never nil panic)
- `SelectFirst(selector)` returns `Node` or **nil** — always nil-check before use
- `ForEach(selector, fn)` iterates over matching elements
- `SetAttribute(name, value)` uses JavaScript evaluation — be aware of escaping limitations (see #12)
## Building
```bash
go build ./...
go test ./...
```
CLI tools:
```bash
go build ./cmd/browser
go build ./sites/duckduckgo/cmd/duckduckgo
```
## Issue Labels
### Priority Labels
| Label | Color | Usage |
|-------|-------|-------|
| `priority/critical` | `#B60205` | Showstopper — security vulnerability, data loss, or crash |
| `priority/high` | `#D93F0B` | Important — significant bug or high-value improvement |
| `priority/medium` | `#FBCA04` | Normal — standard improvement or non-critical bug |
| `priority/low` | `#0E8A16` | Nice to have — minor improvement or cleanup |
### Type Labels
| Label | Color | Usage |
|-------|-------|-------|
| `type/epic` | `#5319E7` | Parent issue grouping related stories/tasks |
| `type/task` | `#0075CA` | Concrete implementation work item |
| `type/refactor` | `#D4C5F9` | Code restructuring without behavior change |
### Category Labels
| Label | Color | Usage |
|-------|-------|-------|
| `bug` | `#D73A4A` | Something isn't working correctly |
| `enhancement` | `#A2EEEF` | New feature or improvement |
| `security` | `#B60205` | Security-related issue |
| `testing` | `#BFD4F2` | Test coverage or infrastructure |
| `documentation` | `#0075CA` | Documentation improvements |
| `performance` | `#FBCA04` | Performance optimization |
### Hierarchy
- **Epics** (`type/epic`) group related issues. Reference the parent epic with `**Parent:** #N` in sub-task descriptions.
- **Tasks** (`type/task`) are concrete work items, usually children of an epic.
- An issue should have exactly one `priority/*` label and one type/category label.