Add project documentation: - README.md with installation, usage examples, API reference, and project structure - CLAUDE.md with developer guide, architecture overview, conventions, and issue label docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6.2 KiB
6.2 KiB
go-extractor Developer Guide
Project Overview
Repository: gitea.stevedudenhoeffer.com/steve/go-extractor Language: Go 1.24 Primary Dependencies:
github.com/playwright-community/playwright-go- Browser automationgithub.com/go-shiori/go-readability- Article content extractiongithub.com/urfave/cli/v3- CLI framework (cmd tools only)golang.org/x/text- Currency formatting (megamillions)
go-extractor is a browser-based web scraping library. It wraps Playwright to provide a clean Go API for opening pages, selecting DOM elements, extracting content, and interacting with pages.
Architecture
Core Types
| File | Type | Purpose |
|---|---|---|
browser.go |
Browser interface |
Opens pages, manages browser lifecycle |
document.go |
Document interface |
Represents an open page (URL, content, refresh, wait) |
node.go |
Node interface |
DOM element operations (select, click, type, text, attr) |
nodes.go |
Nodes type |
Collection of Nodes with bulk operations |
article.go |
Article struct |
Readability extraction result |
cookiejar.go |
CookieJar interface |
Cookie storage abstraction |
interactive.go |
InteractiveBrowser interface |
Low-level mouse/keyboard/screenshot control |
Implementation Flow
NewBrowser(ctx, opts)
└─> initBrowser(opts) # browser_init.go
├─> playwright.Run() # start Playwright
├─> bt.Connect() or Launch # remote or local browser
├─> browser.NewContext() # with UA, viewport, cookies
└─> return browserInitResult
Browser.Open(ctx, url, opts)
└─> openPage() # playwright.go - creates page, navigates
└─> updateCookies() # sync cookies back to jar
└─> newDocument() # document.go - wraps page as Document
Site Extractors
Each site extractor in sites/ follows the same pattern:
Configstruct with site-specific options andvalidate()methodDefaultConfigpackage-level variable- Methods on Config that take
(ctx, Browser)and return parsed data - A
cmd/subdirectory with a CLI tool usingurfave/cli
Development Guidelines
Error Handling
- Always check and propagate errors with
fmt.Errorf("context: %w", err) - Never discard errors with
_ =unless explicitly intended (likeDeferClose) - Check for nil before calling methods on
SelectFirst()results — it returns nil if no element matches - Move
defer DeferClose(x)after the error check, not before
Testing
- Core types have unit tests using mock implementations in
mock_test.goandnodes_test.go mockDocumentandmockNodeimplement the interfaces without Playwright- Site extractors currently lack mock-based tests — they need HTML fixtures
- Run tests:
go test ./... - Tests that need a browser should use build tags or skip when Playwright is unavailable
Adding a New Site Extractor
-
Create
sites/mysite/mysite.go:package mysite type Config struct{} var DefaultConfig = Config{} func (c Config) validate() Config { return c } func (c Config) Extract(ctx context.Context, b extractor.Browser) (Result, error) { doc, err := b.Open(ctx, "https://mysite.com", extractor.OpenPageOptions{}) if err != nil { return Result{}, fmt.Errorf("failed to open page: %w", err) } defer extractor.DeferClose(doc) // ... extract data using doc.Select(), doc.ForEach(), etc. } -
Create
sites/mysite/cmd/mysite/main.gowith a CLI wrapper -
Add tests in
sites/mysite/mysite_test.go
Browser Options
When creating browsers, understand the option merging behavior:
mergeOptions()inbrowser_init.gomerges variadicBrowserOptions- String/pointer fields: only overwritten if non-zero
- Boolean fields:
RequireServerandUseLocalOnlyare one-way (only set to true);ShowBrowseralways overwrites (known issue #16)
Playwright Server
The library supports connecting to a remote Playwright server:
- Environment variables:
PLAYWRIGHT_SERVER_ADDRESS_FIREFOX,PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM,PLAYWRIGHT_SERVER_ADDRESS_WEBKIT BrowserOptions.ServerAddressoverrides the env varRequireServer: trueprevents fallback to local browserUseLocalOnly: trueskips server connection entirely
DOM Interaction
Use the Node interface for all DOM operations:
Select(selector)returnsNodes(may be empty, never nil panic)SelectFirst(selector)returnsNodeor nil — always nil-check before useForEach(selector, fn)iterates over matching elementsSetAttribute(name, value)uses JavaScript evaluation — be aware of escaping limitations (see #12)
Building
go build ./...
go test ./...
CLI tools:
go build ./cmd/browser
go build ./sites/duckduckgo/cmd/duckduckgo
Issue Labels
Priority Labels
| Label | Color | Usage |
|---|---|---|
priority/critical |
#B60205 |
Showstopper — security vulnerability, data loss, or crash |
priority/high |
#D93F0B |
Important — significant bug or high-value improvement |
priority/medium |
#FBCA04 |
Normal — standard improvement or non-critical bug |
priority/low |
#0E8A16 |
Nice to have — minor improvement or cleanup |
Type Labels
| Label | Color | Usage |
|---|---|---|
type/epic |
#5319E7 |
Parent issue grouping related stories/tasks |
type/task |
#0075CA |
Concrete implementation work item |
type/refactor |
#D4C5F9 |
Code restructuring without behavior change |
Category Labels
| Label | Color | Usage |
|---|---|---|
bug |
#D73A4A |
Something isn't working correctly |
enhancement |
#A2EEEF |
New feature or improvement |
security |
#B60205 |
Security-related issue |
testing |
#BFD4F2 |
Test coverage or infrastructure |
documentation |
#0075CA |
Documentation improvements |
performance |
#FBCA04 |
Performance optimization |
Hierarchy
- Epics (
type/epic) group related issues. Reference the parent epic with**Parent:** #Nin sub-task descriptions. - Tasks (
type/task) are concrete work items, usually children of an epic. - An issue should have exactly one
priority/*label and one type/category label.