steve/go-extractor

Fork 0

Go to file

Steve Dudenhoeffer a9711ce904

CI / vet (pull_request) Successful in 1m10s

Details

CI / build (pull_request) Successful in 1m21s

Details

CI / test (pull_request) Successful in 1m28s

Details

fix: surface parsing errors instead of silently discarding them

Return errors for required fields (ID, price) and log warnings for
optional fields (title, description, unit price) across all site
extractors instead of silently discarding them with _ =.

Closes #24
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-15 16:31:56 +00:00

.gitea/workflows

fix: use setup-go@v3 (latest available on gitea.com mirror)

2026-02-09 14:13:46 -05:00

cmd/browser

fix: consolidate user-agent strings into DefaultUserAgent constant

2026-02-15 16:24:44 +00:00

sites

fix: surface parsing errors instead of silently discarding them

2026-02-15 16:31:56 +00:00

article_test.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

article.go

initial commit

2024-12-07 03:53:46 -05:00

browser_init.go

fix: ShowBrowser merge behavior and consistent browser defaults

2026-02-15 16:22:49 +00:00

browser.go

added archive, megamillions, and powerball site logic

2024-12-23 03:18:50 -05:00

CLAUDE.md

docs: add README.md and CLAUDE.md

2026-02-14 11:10:28 -05:00

close_test.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

close.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

cookiejar_test.go

fix: bug fixes, test coverage, and CI workflow

2026-02-09 11:14:19 -05:00

cookiejar.go

fix: bug fixes, test coverage, and CI workflow

2026-02-09 11:14:19 -05:00

cookies_txt_test.go

fix: bug fixes, test coverage, and CI workflow

2026-02-09 11:14:19 -05:00

cookies_txt.go

added archive, megamillions, and powerball site logic

2024-12-23 03:18:50 -05:00

document.go

fix: add nil guards to prevent nil-pointer panics

2026-02-15 16:13:43 +00:00

go.mod

Refactored jackpot handling and updated dependencies

2025-09-16 10:52:49 -04:00

go.sum

fix: add go.sum to repository for CI builds

2026-02-09 14:15:43 -05:00

interactive.go

fix: consolidate user-agent strings into DefaultUserAgent constant

2026-02-15 16:24:44 +00:00

MIGRATION.md

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

mock_test.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

node_test.go

fix: eliminate XSS vulnerability in SetAttribute by using Playwright arg passing

2026-02-15 16:12:46 +00:00

node.go

fix: eliminate XSS vulnerability in SetAttribute by using Playwright arg passing

2026-02-15 16:12:46 +00:00

nodes_test.go

fix: bug fixes, test coverage, and CI workflow

2026-02-09 11:14:19 -05:00

nodes.go

fix: bug fixes, test coverage, and CI workflow

2026-02-09 11:14:19 -05:00

playwright.go

fix: consolidate user-agent strings into DefaultUserAgent constant

2026-02-15 16:24:44 +00:00

readability_test.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

readability.go

added archive, megamillions, and powerball site logic

2024-12-23 03:18:50 -05:00

README.md

fix: ShowBrowser merge behavior and consistent browser defaults

2026-02-15 16:22:49 +00:00

README.md

go-extractor

A Go library for browser-based web scraping and content extraction, powered by Playwright.

Features

Browser automation via Playwright (Chromium, Firefox, WebKit)
Readability extraction — extract article content from any page using Mozilla's readability algorithm
Interactive browser control — mouse, keyboard, screenshots for remote browser sessions
Cookie management — load/save cookies from cookies.txt files, read-only cookie jars
Remote browser support — connect to Playwright server instances or fall back to local browsers
Site-specific extractors for:
- DuckDuckGo search (with pagination)
- Google search
- Powerball lottery results
- Mega Millions lottery results
- Wegmans grocery prices
- AisleGopher grocery prices
- archive.ph archival
- useragents.me user-agent lookup

Installation

go get gitea.stevedudenhoeffer.com/steve/go-extractor

Playwright browsers must be installed:

go run github.com/playwright-community/playwright-go/cmd/playwright install

Quick Start

Extract article content from a URL

package main

import (
    "context"
    "fmt"
    "log"

    extractor "gitea.stevedudenhoeffer.com/steve/go-extractor"
)

func main() {
    ctx := context.Background()

    browser, err := extractor.NewBrowser(ctx)
    if err != nil {
        log.Fatal(err)
    }
    defer browser.Close()

    doc, err := browser.Open(ctx, "https://example.com/article", extractor.OpenPageOptions{})
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    article, err := extractor.Readability(ctx, doc)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println("Title:", article.Title)
    fmt.Println("Content:", article.TextContent)
}

Take a screenshot

data, err := extractor.Screenshot(ctx, "https://example.com", 30*time.Second)
if err != nil {
    log.Fatal(err)
}
os.WriteFile("screenshot.png", data, 0644)

Search DuckDuckGo

import "gitea.stevedudenhoeffer.com/steve/go-extractor/sites/duckduckgo"

results, err := duckduckgo.DefaultConfig.Search(ctx, browser, "golang web scraping")
for _, r := range results {
    fmt.Printf("%s - %s\n", r.Title, r.URL)
}

Use with Playwright server

Set environment variables to connect to a remote Playwright instance:

export PLAYWRIGHT_SERVER_ADDRESS_FIREFOX=ws://playwright-server:3000
export PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM=ws://playwright-server:3001

Or pass the address directly:

browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{
    ServerAddress: "ws://playwright-server:3000",
    RequireServer: true,  // fail instead of falling back to local
})

Browser Options

extractor.BrowserOptions{
    UserAgent:     "custom-agent",           // defaults to a recent Firefox UA
    Browser:       extractor.BrowserFirefox, // or BrowserChromium, BrowserWebKit
    Timeout:       &timeout,                 // default 30s, 0 for no timeout
    CookieJar:     jar,                      // load/save cookies automatically
    ShowBrowser:   extractor.Bool(true),     // show browser window (non-headless)
    Dimensions:    extractor.Size{1280, 720},
    DarkMode:      true,
    ServerAddress: "ws://...",               // remote Playwright server
    RequireServer: true,                     // don't fall back to local browser
    UseLocalOnly:  true,                     // don't try remote server
}

DOM Interaction

Documents and Nodes expose CSS selector-based DOM manipulation:

// Select elements
nodes := doc.Select("div.results a")
first := doc.SelectFirst("h1")

// Extract text
text, err := first.Text()
content, err := first.Content()
href, err := first.Attr("href")

// Interact
err = first.Click()
err = first.Type("hello world")

// Iterate
err = doc.ForEach("li.item", func(n extractor.Node) error {
    text, _ := n.Text()
    fmt.Println(text)
    return nil
})

// Modify
err = first.SetHidden(true)
err = first.SetAttribute("data-processed", "true")

Load cookies from a Netscape cookies.txt file:

jar, err := extractor.LoadCookiesFile("cookies.txt")
browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{
    CookieJar: jar,
})

Use a read-only cookie jar (cookies are loaded but changes aren't saved back):

roJar := extractor.ReadOnlyCookieJar{Jar: jar}

Interactive Browser

For remote browser control with mouse/keyboard:

ib, err := extractor.NewInteractiveBrowser(ctx)
defer ib.Close()

url, err := ib.Navigate("https://example.com")
err = ib.MouseClick(100, 200, "left")
err = ib.KeyboardType("search query")
err = ib.KeyboardPress("Enter")
screenshot, err := ib.Screenshot(80)  // JPEG quality 0-100

Command-Line Tools

The cmd/ and sites/*/cmd/ directories contain CLI tools:

# Extract article from URL
go run ./cmd/browser https://example.com/article

# Search DuckDuckGo
go run ./sites/duckduckgo/cmd/duckduckgo "golang tutorial"

# Search Google
go run ./sites/google/cmd/google "golang tutorial"

# Get Powerball results
go run ./sites/powerball/cmd/powerball

# Get Mega Millions results
go run ./sites/megamillions/cmd/megamillions

# Archive a page
go run ./sites/archive/cmd/archive https://example.com/page

# Get most common user agent
go run ./sites/useragents/cmd/useragents

Project Structure

go-extractor/
├── article.go           # Article struct (readability output)
├── browser.go           # Browser interface and Playwright implementation
├── browser_init.go      # Browser initialization and option merging
├── close.go             # DeferClose helper
├── cookiejar.go         # Cookie/CookieJar types and ReadOnlyCookieJar
├── cookies_txt.go       # cookies.txt file parser and staticCookieJar
├── document.go          # Document interface (page wrapper)
├── interactive.go       # InteractiveBrowser for remote control
├── node.go              # Node interface (DOM element wrapper)
├── nodes.go             # Nodes collection type
├── playwright.go        # Playwright browser implementation
├── readability.go       # Readability article extraction
├── cmd/
│   └── browser/         # CLI tool for article extraction
├── sites/
│   ├── aislegopher/     # AisleGopher price extraction
│   ├── archive/         # archive.ph integration
│   ├── duckduckgo/      # DuckDuckGo search
│   ├── google/          # Google search
│   ├── megamillions/    # Mega Millions lottery
│   ├── powerball/       # Powerball lottery
│   ├── useragents/      # useragents.me lookup
│   └── wegmans/         # Wegmans price extraction
└── *_test.go            # Unit tests

Requirements

Go 1.24+
Playwright browsers installed (playwright install)
Optional: Playwright server for remote browser execution

README.md

go-extractor

Features

Installation

Quick Start

Extract article content from a URL

Take a screenshot

Search DuckDuckGo

Use with Playwright server

Browser Options

DOM Interaction

Cookie Management

Interactive Browser

Command-Line Tools

Project Structure

Requirements