steve/go-extractor

Fork 0

T

steve 65cf6b027f

CI / vet (pull_request) Successful in 34s

Details

CI / test (pull_request) Successful in 1m1s

Details

CI / build (pull_request) Successful in 1m5s

Details

feat: add RemoveHidden option to strip display:none elements before extraction

When RemoveHidden is true, JavaScript is evaluated on the live page to
remove all elements with computed display:none before readability
extraction. This defends against anti-scraping honeypots that embed
prompt injections in hidden DOM elements.

The implementation uses an optional pageEvaluator interface so that the
concrete document (backed by Playwright) supports it while the Document
interface remains unchanged.

Closes #62

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-20 14:06:17 +00:00

.gitea/workflows

fix: use setup-go@v3 (latest available on gitea.com mirror)

2026-02-09 14:13:46 -05:00

cmd/browser

feature: add stealth mode, launch args, and init scripts to BrowserOptions

2026-02-17 20:10:58 +00:00

extractortest

test: add mock-based site extractor test infrastructure

2026-02-15 16:37:58 +00:00

sites

fix: update weather extractor selectors to match DuckDuckGo's actual DOM

2026-02-15 23:00:44 +00:00

article_test.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

article.go

initial commit

2024-12-07 03:53:46 -05:00

browser_init.go

fix: enhance stealth mode with additional anti-detection measures

2026-02-17 22:45:12 +00:00

browser.go

added archive, megamillions, and powerball site logic

2024-12-23 03:18:50 -05:00

CLAUDE.md

docs: add README.md and CLAUDE.md

2026-02-14 11:10:28 -05:00

close_test.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

close.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

cookiejar_test.go

enhance: thread-safe CookieJar, SameSite cookie attr, dynamic Google countries

2026-02-15 16:34:54 +00:00

cookiejar.go

enhance: thread-safe CookieJar, SameSite cookie attr, dynamic Google countries

2026-02-15 16:34:54 +00:00

cookies_txt_test.go

fix: bug fixes, test coverage, and CI workflow

2026-02-09 11:14:19 -05:00

cookies_txt.go

enhance: thread-safe CookieJar, SameSite cookie attr, dynamic Google countries

2026-02-15 16:34:54 +00:00

document.go

feat: add RemoveHidden option to strip display:none elements before extraction

2026-02-20 14:06:17 +00:00

go.mod

feat: add ReadabilityWithOptions for DOM cleanup before extraction

2026-02-19 01:09:28 +00:00

go.sum

feat: add ReadabilityWithOptions for DOM cleanup before extraction

2026-02-19 01:09:28 +00:00

interactive.go

feature: add stealth mode, launch args, and init scripts to BrowserOptions

2026-02-17 20:10:58 +00:00

MIGRATION.md

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

mock_test.go

refactor: restructure API, deduplicate code, expand test coverage

2026-02-09 13:59:47 -05:00

node_test.go

fix: eliminate XSS vulnerability in SetAttribute by using Playwright arg passing

2026-02-15 16:12:46 +00:00

node.go

fix: eliminate XSS vulnerability in SetAttribute by using Playwright arg passing

2026-02-15 16:12:46 +00:00

nodes_test.go

fix: bug fixes, test coverage, and CI workflow

2026-02-09 11:14:19 -05:00

nodes.go

fix: bug fixes, test coverage, and CI workflow

2026-02-09 11:14:19 -05:00

playwright.go

feature: add stealth mode, launch args, and init scripts to BrowserOptions

2026-02-17 20:10:58 +00:00

readability_test.go

feat: add RemoveHidden option to strip display:none elements before extraction

2026-02-20 14:06:17 +00:00

readability.go

feat: add RemoveHidden option to strip display:none elements before extraction

2026-02-20 14:06:17 +00:00

README.md

fix: ShowBrowser merge behavior and consistent browser defaults

2026-02-15 16:22:49 +00:00

stealth_test.go

fix: enhance stealth mode with additional anti-detection measures

2026-02-17 22:45:12 +00:00

stealth.go

fix: enhance stealth mode with additional anti-detection measures

2026-02-17 22:45:12 +00:00

README.md

go-extractor

A Go library for browser-based web scraping and content extraction, powered by Playwright.

Features

Browser automation via Playwright (Chromium, Firefox, WebKit)
Readability extraction — extract article content from any page using Mozilla's readability algorithm
Interactive browser control — mouse, keyboard, screenshots for remote browser sessions
Cookie management — load/save cookies from cookies.txt files, read-only cookie jars
Remote browser support — connect to Playwright server instances or fall back to local browsers
Site-specific extractors for:
- DuckDuckGo search (with pagination)
- Google search
- Powerball lottery results
- Mega Millions lottery results
- Wegmans grocery prices
- AisleGopher grocery prices
- archive.ph archival
- useragents.me user-agent lookup

Installation

go get gitea.stevedudenhoeffer.com/steve/go-extractor

Playwright browsers must be installed:

go run github.com/playwright-community/playwright-go/cmd/playwright install

Quick Start

Extract article content from a URL

package main

import (
    "context"
    "fmt"
    "log"

    extractor "gitea.stevedudenhoeffer.com/steve/go-extractor"
)

func main() {
    ctx := context.Background()

    browser, err := extractor.NewBrowser(ctx)
    if err != nil {
        log.Fatal(err)
    }
    defer browser.Close()

    doc, err := browser.Open(ctx, "https://example.com/article", extractor.OpenPageOptions{})
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    article, err := extractor.Readability(ctx, doc)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println("Title:", article.Title)
    fmt.Println("Content:", article.TextContent)
}

Take a screenshot

data, err := extractor.Screenshot(ctx, "https://example.com", 30*time.Second)
if err != nil {
    log.Fatal(err)
}
os.WriteFile("screenshot.png", data, 0644)

Search DuckDuckGo

import "gitea.stevedudenhoeffer.com/steve/go-extractor/sites/duckduckgo"

results, err := duckduckgo.DefaultConfig.Search(ctx, browser, "golang web scraping")
for _, r := range results {
    fmt.Printf("%s - %s\n", r.Title, r.URL)
}

Use with Playwright server

Set environment variables to connect to a remote Playwright instance:

export PLAYWRIGHT_SERVER_ADDRESS_FIREFOX=ws://playwright-server:3000
export PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM=ws://playwright-server:3001

Or pass the address directly:

browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{
    ServerAddress: "ws://playwright-server:3000",
    RequireServer: true,  // fail instead of falling back to local
})

Browser Options

extractor.BrowserOptions{
    UserAgent:     "custom-agent",           // defaults to a recent Firefox UA
    Browser:       extractor.BrowserFirefox, // or BrowserChromium, BrowserWebKit
    Timeout:       &timeout,                 // default 30s, 0 for no timeout
    CookieJar:     jar,                      // load/save cookies automatically
    ShowBrowser:   extractor.Bool(true),     // show browser window (non-headless)
    Dimensions:    extractor.Size{1280, 720},
    DarkMode:      true,
    ServerAddress: "ws://...",               // remote Playwright server
    RequireServer: true,                     // don't fall back to local browser
    UseLocalOnly:  true,                     // don't try remote server
}

DOM Interaction

Documents and Nodes expose CSS selector-based DOM manipulation:

// Select elements
nodes := doc.Select("div.results a")
first := doc.SelectFirst("h1")

// Extract text
text, err := first.Text()
content, err := first.Content()
href, err := first.Attr("href")

// Interact
err = first.Click()
err = first.Type("hello world")

// Iterate
err = doc.ForEach("li.item", func(n extractor.Node) error {
    text, _ := n.Text()
    fmt.Println(text)
    return nil
})

// Modify
err = first.SetHidden(true)
err = first.SetAttribute("data-processed", "true")

Load cookies from a Netscape cookies.txt file:

jar, err := extractor.LoadCookiesFile("cookies.txt")
browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{
    CookieJar: jar,
})

Use a read-only cookie jar (cookies are loaded but changes aren't saved back):

roJar := extractor.ReadOnlyCookieJar{Jar: jar}

Interactive Browser

For remote browser control with mouse/keyboard:

ib, err := extractor.NewInteractiveBrowser(ctx)
defer ib.Close()

url, err := ib.Navigate("https://example.com")
err = ib.MouseClick(100, 200, "left")
err = ib.KeyboardType("search query")
err = ib.KeyboardPress("Enter")
screenshot, err := ib.Screenshot(80)  // JPEG quality 0-100

Command-Line Tools

The cmd/ and sites/*/cmd/ directories contain CLI tools:

# Extract article from URL
go run ./cmd/browser https://example.com/article

# Search DuckDuckGo
go run ./sites/duckduckgo/cmd/duckduckgo "golang tutorial"

# Search Google
go run ./sites/google/cmd/google "golang tutorial"

# Get Powerball results
go run ./sites/powerball/cmd/powerball

# Get Mega Millions results
go run ./sites/megamillions/cmd/megamillions

# Archive a page
go run ./sites/archive/cmd/archive https://example.com/page

# Get most common user agent
go run ./sites/useragents/cmd/useragents

Project Structure

go-extractor/
├── article.go           # Article struct (readability output)
├── browser.go           # Browser interface and Playwright implementation
├── browser_init.go      # Browser initialization and option merging
├── close.go             # DeferClose helper
├── cookiejar.go         # Cookie/CookieJar types and ReadOnlyCookieJar
├── cookies_txt.go       # cookies.txt file parser and staticCookieJar
├── document.go          # Document interface (page wrapper)
├── interactive.go       # InteractiveBrowser for remote control
├── node.go              # Node interface (DOM element wrapper)
├── nodes.go             # Nodes collection type
├── playwright.go        # Playwright browser implementation
├── readability.go       # Readability article extraction
├── cmd/
│   └── browser/         # CLI tool for article extraction
├── sites/
│   ├── aislegopher/     # AisleGopher price extraction
│   ├── archive/         # archive.ph integration
│   ├── duckduckgo/      # DuckDuckGo search
│   ├── google/          # Google search
│   ├── megamillions/    # Mega Millions lottery
│   ├── powerball/       # Powerball lottery
│   ├── useragents/      # useragents.me lookup
│   └── wegmans/         # Wegmans price extraction
└── *_test.go            # Unit tests

Requirements

Go 1.24+
Playwright browsers installed (playwright install)
Optional: Playwright server for remote browser execution

README.md

go-extractor

Features

Installation

Quick Start

Extract article content from a URL

Take a screenshot

Search DuckDuckGo

Use with Playwright server

Browser Options

DOM Interaction

Cookie Management

Interactive Browser

Command-Line Tools

Project Structure

Requirements