steve/go-extractor

Fork 0

Go to file

Steve Dudenhoeffer ce95fb1d89

CI / build (pull_request) Successful in 47s

Details

CI / vet (pull_request) Successful in 46s

Details

CI / test (pull_request) Successful in 49s

Details

fix: enhance stealth mode with additional anti-detection measures

Add 7 new init scripts to cover WebGL fingerprinting, missing Chrome
APIs, permissions behavior, CDP artifacts, and HeadlessChrome UA string.
Enable Chromium's new headless mode (Channel: "chromium") when stealth
is active to use the full UI layer that is harder to detect.

Closes #58

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-17 22:45:12 +00:00

.gitea/workflows

fix: use setup-go@v3 (latest available on gitea.com mirror)

2026-02-09 14:13:46 -05:00

cmd/browser

feature: add stealth mode, launch args, and init scripts to BrowserOptions

2026-02-17 20:10:58 +00:00

extractortest

test: add mock-based site extractor test infrastructure

2026-02-15 16:37:58 +00:00

sites

fix: update weather extractor selectors to match DuckDuckGo's actual DOM

2026-02-15 23:00:44 +00:00

article_test.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

article.go

initial commit

2024-12-07 03:53:46 -05:00

browser_init.go

fix: enhance stealth mode with additional anti-detection measures

2026-02-17 22:45:12 +00:00

browser.go

added archive, megamillions, and powerball site logic

2024-12-23 03:18:50 -05:00

CLAUDE.md

docs: add README.md and CLAUDE.md

2026-02-14 11:10:28 -05:00

close_test.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

close.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

cookiejar_test.go

enhance: thread-safe CookieJar, SameSite cookie attr, dynamic Google countries

2026-02-15 16:34:54 +00:00

cookiejar.go

enhance: thread-safe CookieJar, SameSite cookie attr, dynamic Google countries

2026-02-15 16:34:54 +00:00

cookies_txt_test.go

fix: bug fixes, test coverage, and CI workflow

2026-02-09 11:14:19 -05:00

cookies_txt.go

enhance: thread-safe CookieJar, SameSite cookie attr, dynamic Google countries

2026-02-15 16:34:54 +00:00

document.go

fix: add nil guards to prevent nil-pointer panics

2026-02-15 16:13:43 +00:00

go.mod

Refactored jackpot handling and updated dependencies

2025-09-16 10:52:49 -04:00

go.sum

fix: add go.sum to repository for CI builds

2026-02-09 14:15:43 -05:00

interactive.go

feature: add stealth mode, launch args, and init scripts to BrowserOptions

2026-02-17 20:10:58 +00:00

MIGRATION.md

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

mock_test.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

node_test.go

fix: eliminate XSS vulnerability in SetAttribute by using Playwright arg passing

2026-02-15 16:12:46 +00:00

node.go

fix: eliminate XSS vulnerability in SetAttribute by using Playwright arg passing

2026-02-15 16:12:46 +00:00

nodes_test.go

fix: bug fixes, test coverage, and CI workflow

2026-02-09 11:14:19 -05:00

nodes.go

fix: bug fixes, test coverage, and CI workflow

2026-02-09 11:14:19 -05:00

playwright.go

feature: add stealth mode, launch args, and init scripts to BrowserOptions

2026-02-17 20:10:58 +00:00

readability_test.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

readability.go

added archive, megamillions, and powerball site logic

2024-12-23 03:18:50 -05:00

README.md

fix: ShowBrowser merge behavior and consistent browser defaults

2026-02-15 16:22:49 +00:00

stealth_test.go

fix: enhance stealth mode with additional anti-detection measures

2026-02-17 22:45:12 +00:00

stealth.go

fix: enhance stealth mode with additional anti-detection measures

2026-02-17 22:45:12 +00:00

README.md

go-extractor

A Go library for browser-based web scraping and content extraction, powered by Playwright.

Features

Browser automation via Playwright (Chromium, Firefox, WebKit)
Readability extraction — extract article content from any page using Mozilla's readability algorithm
Interactive browser control — mouse, keyboard, screenshots for remote browser sessions
Cookie management — load/save cookies from cookies.txt files, read-only cookie jars
Remote browser support — connect to Playwright server instances or fall back to local browsers
Site-specific extractors for:
- DuckDuckGo search (with pagination)
- Google search
- Powerball lottery results
- Mega Millions lottery results
- Wegmans grocery prices
- AisleGopher grocery prices
- archive.ph archival
- useragents.me user-agent lookup

Installation

go get gitea.stevedudenhoeffer.com/steve/go-extractor

Playwright browsers must be installed:

go run github.com/playwright-community/playwright-go/cmd/playwright install

Quick Start

Extract article content from a URL

package main

import (
    "context"
    "fmt"
    "log"

    extractor "gitea.stevedudenhoeffer.com/steve/go-extractor"
)

func main() {
    ctx := context.Background()

    browser, err := extractor.NewBrowser(ctx)
    if err != nil {
        log.Fatal(err)
    }
    defer browser.Close()

    doc, err := browser.Open(ctx, "https://example.com/article", extractor.OpenPageOptions{})
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    article, err := extractor.Readability(ctx, doc)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println("Title:", article.Title)
    fmt.Println("Content:", article.TextContent)
}

Take a screenshot

data, err := extractor.Screenshot(ctx, "https://example.com", 30*time.Second)
if err != nil {
    log.Fatal(err)
}
os.WriteFile("screenshot.png", data, 0644)

Search DuckDuckGo

import "gitea.stevedudenhoeffer.com/steve/go-extractor/sites/duckduckgo"

results, err := duckduckgo.DefaultConfig.Search(ctx, browser, "golang web scraping")
for _, r := range results {
    fmt.Printf("%s - %s\n", r.Title, r.URL)
}

Use with Playwright server

Set environment variables to connect to a remote Playwright instance:

export PLAYWRIGHT_SERVER_ADDRESS_FIREFOX=ws://playwright-server:3000
export PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM=ws://playwright-server:3001

Or pass the address directly:

browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{
    ServerAddress: "ws://playwright-server:3000",
    RequireServer: true,  // fail instead of falling back to local
})

Browser Options

extractor.BrowserOptions{
    UserAgent:     "custom-agent",           // defaults to a recent Firefox UA
    Browser:       extractor.BrowserFirefox, // or BrowserChromium, BrowserWebKit
    Timeout:       &timeout,                 // default 30s, 0 for no timeout
    CookieJar:     jar,                      // load/save cookies automatically
    ShowBrowser:   extractor.Bool(true),     // show browser window (non-headless)
    Dimensions:    extractor.Size{1280, 720},
    DarkMode:      true,
    ServerAddress: "ws://...",               // remote Playwright server
    RequireServer: true,                     // don't fall back to local browser
    UseLocalOnly:  true,                     // don't try remote server
}

DOM Interaction

Documents and Nodes expose CSS selector-based DOM manipulation:

// Select elements
nodes := doc.Select("div.results a")
first := doc.SelectFirst("h1")

// Extract text
text, err := first.Text()
content, err := first.Content()
href, err := first.Attr("href")

// Interact
err = first.Click()
err = first.Type("hello world")

// Iterate
err = doc.ForEach("li.item", func(n extractor.Node) error {
    text, _ := n.Text()
    fmt.Println(text)
    return nil
})

// Modify
err = first.SetHidden(true)
err = first.SetAttribute("data-processed", "true")

Load cookies from a Netscape cookies.txt file:

jar, err := extractor.LoadCookiesFile("cookies.txt")
browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{
    CookieJar: jar,
})

Use a read-only cookie jar (cookies are loaded but changes aren't saved back):

roJar := extractor.ReadOnlyCookieJar{Jar: jar}

Interactive Browser

For remote browser control with mouse/keyboard:

ib, err := extractor.NewInteractiveBrowser(ctx)
defer ib.Close()

url, err := ib.Navigate("https://example.com")
err = ib.MouseClick(100, 200, "left")
err = ib.KeyboardType("search query")
err = ib.KeyboardPress("Enter")
screenshot, err := ib.Screenshot(80)  // JPEG quality 0-100

Command-Line Tools

The cmd/ and sites/*/cmd/ directories contain CLI tools:

# Extract article from URL
go run ./cmd/browser https://example.com/article

# Search DuckDuckGo
go run ./sites/duckduckgo/cmd/duckduckgo "golang tutorial"

# Search Google
go run ./sites/google/cmd/google "golang tutorial"

# Get Powerball results
go run ./sites/powerball/cmd/powerball

# Get Mega Millions results
go run ./sites/megamillions/cmd/megamillions

# Archive a page
go run ./sites/archive/cmd/archive https://example.com/page

# Get most common user agent
go run ./sites/useragents/cmd/useragents

Project Structure

go-extractor/
├── article.go           # Article struct (readability output)
├── browser.go           # Browser interface and Playwright implementation
├── browser_init.go      # Browser initialization and option merging
├── close.go             # DeferClose helper
├── cookiejar.go         # Cookie/CookieJar types and ReadOnlyCookieJar
├── cookies_txt.go       # cookies.txt file parser and staticCookieJar
├── document.go          # Document interface (page wrapper)
├── interactive.go       # InteractiveBrowser for remote control
├── node.go              # Node interface (DOM element wrapper)
├── nodes.go             # Nodes collection type
├── playwright.go        # Playwright browser implementation
├── readability.go       # Readability article extraction
├── cmd/
│   └── browser/         # CLI tool for article extraction
├── sites/
│   ├── aislegopher/     # AisleGopher price extraction
│   ├── archive/         # archive.ph integration
│   ├── duckduckgo/      # DuckDuckGo search
│   ├── google/          # Google search
│   ├── megamillions/    # Mega Millions lottery
│   ├── powerball/       # Powerball lottery
│   ├── useragents/      # useragents.me lookup
│   └── wegmans/         # Wegmans price extraction
└── *_test.go            # Unit tests

Requirements

Go 1.24+
Playwright browsers installed (playwright install)
Optional: Playwright server for remote browser execution

README.md

go-extractor

Features

Installation

Quick Start

Extract article content from a URL

Take a screenshot

Search DuckDuckGo

Use with Playwright server

Browser Options

DOM Interaction

Cookie Management

Interactive Browser

Command-Line Tools

Project Structure

Requirements