69603b7caeaa2dd7315cff3c343b3e0b829b74f7
go-extractor
A Go library for browser-based web scraping and content extraction, powered by Playwright.
Features
- Browser automation via Playwright (Chromium, Firefox, WebKit)
- Readability extraction — extract article content from any page using Mozilla's readability algorithm
- Interactive browser control — mouse, keyboard, screenshots for remote browser sessions
- Cookie management — load/save cookies from
cookies.txtfiles, read-only cookie jars - Remote browser support — connect to Playwright server instances or fall back to local browsers
- Site-specific extractors for:
- DuckDuckGo search (with pagination)
- Google search
- Powerball lottery results
- Mega Millions lottery results
- Wegmans grocery prices
- AisleGopher grocery prices
- archive.ph archival
- useragents.me user-agent lookup
Installation
go get gitea.stevedudenhoeffer.com/steve/go-extractor
Playwright browsers must be installed:
go run github.com/playwright-community/playwright-go/cmd/playwright install
Quick Start
Extract article content from a URL
package main
import (
"context"
"fmt"
"log"
extractor "gitea.stevedudenhoeffer.com/steve/go-extractor"
)
func main() {
ctx := context.Background()
browser, err := extractor.NewBrowser(ctx)
if err != nil {
log.Fatal(err)
}
defer browser.Close()
doc, err := browser.Open(ctx, "https://example.com/article", extractor.OpenPageOptions{})
if err != nil {
log.Fatal(err)
}
defer doc.Close()
article, err := extractor.Readability(ctx, doc)
if err != nil {
log.Fatal(err)
}
fmt.Println("Title:", article.Title)
fmt.Println("Content:", article.TextContent)
}
Take a screenshot
data, err := extractor.Screenshot(ctx, "https://example.com", 30*time.Second)
if err != nil {
log.Fatal(err)
}
os.WriteFile("screenshot.png", data, 0644)
Search DuckDuckGo
import "gitea.stevedudenhoeffer.com/steve/go-extractor/sites/duckduckgo"
results, err := duckduckgo.DefaultConfig.Search(ctx, browser, "golang web scraping")
for _, r := range results {
fmt.Printf("%s - %s\n", r.Title, r.URL)
}
Use with Playwright server
Set environment variables to connect to a remote Playwright instance:
export PLAYWRIGHT_SERVER_ADDRESS_FIREFOX=ws://playwright-server:3000
export PLAYWRIGHT_SERVER_ADDRESS_CHROMIUM=ws://playwright-server:3001
Or pass the address directly:
browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{
ServerAddress: "ws://playwright-server:3000",
RequireServer: true, // fail instead of falling back to local
})
Browser Options
extractor.BrowserOptions{
UserAgent: "custom-agent", // defaults to a recent Firefox UA
Browser: extractor.BrowserFirefox, // or BrowserChromium, BrowserWebKit
Timeout: &timeout, // default 30s, 0 for no timeout
CookieJar: jar, // load/save cookies automatically
ShowBrowser: extractor.Bool(true), // show browser window (non-headless)
Dimensions: extractor.Size{1280, 720},
DarkMode: true,
ServerAddress: "ws://...", // remote Playwright server
RequireServer: true, // don't fall back to local browser
UseLocalOnly: true, // don't try remote server
}
DOM Interaction
Documents and Nodes expose CSS selector-based DOM manipulation:
// Select elements
nodes := doc.Select("div.results a")
first := doc.SelectFirst("h1")
// Extract text
text, err := first.Text()
content, err := first.Content()
href, err := first.Attr("href")
// Interact
err = first.Click()
err = first.Type("hello world")
// Iterate
err = doc.ForEach("li.item", func(n extractor.Node) error {
text, _ := n.Text()
fmt.Println(text)
return nil
})
// Modify
err = first.SetHidden(true)
err = first.SetAttribute("data-processed", "true")
Cookie Management
Load cookies from a Netscape cookies.txt file:
jar, err := extractor.LoadCookiesFile("cookies.txt")
browser, err := extractor.NewBrowser(ctx, extractor.BrowserOptions{
CookieJar: jar,
})
Use a read-only cookie jar (cookies are loaded but changes aren't saved back):
roJar := extractor.ReadOnlyCookieJar{Jar: jar}
Interactive Browser
For remote browser control with mouse/keyboard:
ib, err := extractor.NewInteractiveBrowser(ctx)
defer ib.Close()
url, err := ib.Navigate("https://example.com")
err = ib.MouseClick(100, 200, "left")
err = ib.KeyboardType("search query")
err = ib.KeyboardPress("Enter")
screenshot, err := ib.Screenshot(80) // JPEG quality 0-100
Command-Line Tools
The cmd/ and sites/*/cmd/ directories contain CLI tools:
# Extract article from URL
go run ./cmd/browser https://example.com/article
# Search DuckDuckGo
go run ./sites/duckduckgo/cmd/duckduckgo "golang tutorial"
# Search Google
go run ./sites/google/cmd/google "golang tutorial"
# Get Powerball results
go run ./sites/powerball/cmd/powerball
# Get Mega Millions results
go run ./sites/megamillions/cmd/megamillions
# Archive a page
go run ./sites/archive/cmd/archive https://example.com/page
# Get most common user agent
go run ./sites/useragents/cmd/useragents
Project Structure
go-extractor/
├── article.go # Article struct (readability output)
├── browser.go # Browser interface and Playwright implementation
├── browser_init.go # Browser initialization and option merging
├── close.go # DeferClose helper
├── cookiejar.go # Cookie/CookieJar types and ReadOnlyCookieJar
├── cookies_txt.go # cookies.txt file parser and staticCookieJar
├── document.go # Document interface (page wrapper)
├── interactive.go # InteractiveBrowser for remote control
├── node.go # Node interface (DOM element wrapper)
├── nodes.go # Nodes collection type
├── playwright.go # Playwright browser implementation
├── readability.go # Readability article extraction
├── cmd/
│ └── browser/ # CLI tool for article extraction
├── sites/
│ ├── aislegopher/ # AisleGopher price extraction
│ ├── archive/ # archive.ph integration
│ ├── duckduckgo/ # DuckDuckGo search
│ ├── google/ # Google search
│ ├── megamillions/ # Mega Millions lottery
│ ├── powerball/ # Powerball lottery
│ ├── useragents/ # useragents.me lookup
│ └── wegmans/ # Wegmans price extraction
└── *_test.go # Unit tests
Requirements
- Go 1.24+
- Playwright browsers installed (
playwright install) - Optional: Playwright server for remote browser execution
Description
Languages
Go
100%