Files
go-extractor/CLAUDE.md
Steve Dudenhoeffer 49f294e884
All checks were successful
CI / test (push) Successful in 32s
CI / vet (push) Successful in 45s
CI / build (push) Successful in 46s
docs: add README.md and CLAUDE.md
Add project documentation:
- README.md with installation, usage examples, API reference, and project structure
- CLAUDE.md with developer guide, architecture overview, conventions, and issue label docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 11:10:28 -05:00

6.2 KiB

go-extractor Developer Guide

Project Overview

Repository: gitea.stevedudenhoeffer.com/steve/go-extractor Language: Go 1.24 Primary Dependencies:

  • github.com/playwright-community/playwright-go - Browser automation
  • github.com/go-shiori/go-readability - Article content extraction
  • github.com/urfave/cli/v3 - CLI framework (cmd tools only)
  • golang.org/x/text - Currency formatting (megamillions)

go-extractor is a browser-based web scraping library. It wraps Playwright to provide a clean Go API for opening pages, selecting DOM elements, extracting content, and interacting with pages.

Architecture

Core Types

File Type Purpose
browser.go Browser interface Opens pages, manages browser lifecycle
document.go Document interface Represents an open page (URL, content, refresh, wait)
node.go Node interface DOM element operations (select, click, type, text, attr)
nodes.go Nodes type Collection of Nodes with bulk operations
article.go Article struct Readability extraction result
cookiejar.go CookieJar interface Cookie storage abstraction
interactive.go InteractiveBrowser interface Low-level mouse/keyboard/screenshot control

Implementation Flow

NewBrowser(ctx, opts)
  └─> initBrowser(opts)           # browser_init.go
      ├─> playwright.Run()        # start Playwright
      ├─> bt.Connect() or Launch  # remote or local browser
      ├─> browser.NewContext()     # with UA, viewport, cookies
      └─> return browserInitResult

Browser.Open(ctx, url, opts)
  └─> openPage()                  # playwright.go - creates page, navigates
      └─> updateCookies()         # sync cookies back to jar
          └─> newDocument()       # document.go - wraps page as Document

Site Extractors

Each site extractor in sites/ follows the same pattern:

  1. Config struct with site-specific options and validate() method
  2. DefaultConfig package-level variable
  3. Methods on Config that take (ctx, Browser) and return parsed data
  4. A cmd/ subdirectory with a CLI tool using urfave/cli

Development Guidelines

Error Handling

  • Always check and propagate errors with fmt.Errorf("context: %w", err)
  • Never discard errors with _ = unless explicitly intended (like DeferClose)
  • Check for nil before calling methods on SelectFirst() results — it returns nil if no element matches
  • Move defer DeferClose(x) after the error check, not before

Testing

  • Core types have unit tests using mock implementations in mock_test.go and nodes_test.go
  • mockDocument and mockNode implement the interfaces without Playwright
  • Site extractors currently lack mock-based tests — they need HTML fixtures
  • Run tests: go test ./...
  • Tests that need a browser should use build tags or skip when Playwright is unavailable

Adding a New Site Extractor

  1. Create sites/mysite/mysite.go:

    package mysite
    
    type Config struct{}
    var DefaultConfig = Config{}
    
    func (c Config) validate() Config { return c }
    
    func (c Config) Extract(ctx context.Context, b extractor.Browser) (Result, error) {
        doc, err := b.Open(ctx, "https://mysite.com", extractor.OpenPageOptions{})
        if err != nil {
            return Result{}, fmt.Errorf("failed to open page: %w", err)
        }
        defer extractor.DeferClose(doc)
        // ... extract data using doc.Select(), doc.ForEach(), etc.
    }
    
  2. Create sites/mysite/cmd/mysite/main.go with a CLI wrapper

  3. Add tests in sites/mysite/mysite_test.go

Browser Options

When creating browsers, understand the option merging behavior:

  • mergeOptions() in browser_init.go merges variadic BrowserOptions
  • String/pointer fields: only overwritten if non-zero
  • Boolean fields: RequireServer and UseLocalOnly are one-way (only set to true); ShowBrowser always overwrites (known issue #16)

Playwright Server

The library supports connecting to a remote Playwright server:

  • Environment variables: PLAYWRIGHT_SERVER_ADDRESS_FIREFOX, PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM, PLAYWRIGHT_SERVER_ADDRESS_WEBKIT
  • BrowserOptions.ServerAddress overrides the env var
  • RequireServer: true prevents fallback to local browser
  • UseLocalOnly: true skips server connection entirely

DOM Interaction

Use the Node interface for all DOM operations:

  • Select(selector) returns Nodes (may be empty, never nil panic)
  • SelectFirst(selector) returns Node or nil — always nil-check before use
  • ForEach(selector, fn) iterates over matching elements
  • SetAttribute(name, value) uses JavaScript evaluation — be aware of escaping limitations (see #12)

Building

go build ./...
go test ./...

CLI tools:

go build ./cmd/browser
go build ./sites/duckduckgo/cmd/duckduckgo

Issue Labels

Priority Labels

Label Color Usage
priority/critical #B60205 Showstopper — security vulnerability, data loss, or crash
priority/high #D93F0B Important — significant bug or high-value improvement
priority/medium #FBCA04 Normal — standard improvement or non-critical bug
priority/low #0E8A16 Nice to have — minor improvement or cleanup

Type Labels

Label Color Usage
type/epic #5319E7 Parent issue grouping related stories/tasks
type/task #0075CA Concrete implementation work item
type/refactor #D4C5F9 Code restructuring without behavior change

Category Labels

Label Color Usage
bug #D73A4A Something isn't working correctly
enhancement #A2EEEF New feature or improvement
security #B60205 Security-related issue
testing #BFD4F2 Test coverage or infrastructure
documentation #0075CA Documentation improvements
performance #FBCA04 Performance optimization

Hierarchy

  • Epics (type/epic) group related issues. Reference the parent epic with **Parent:** #N in sub-task descriptions.
  • Tasks (type/task) are concrete work items, usually children of an epic.
  • An issue should have exactly one priority/* label and one type/category label.