The Secure field was dropped in both Playwright<->internal cookie
conversion functions, causing cookies with __Secure-/__Host- prefixes
to be rejected by Chromium. Additionally, batch AddCookies meant one
invalid cookie would fail browser creation entirely.
Changes:
- Map Secure field in cookieToPlaywrightOptionalCookie and
playwrightCookieToCookie
- Add cookies one-by-one with slog.Warn on failure instead of
failing the entire batch
- Add unit tests for both conversion functions
Closes#75
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace static stealthChromiumScripts and stealthFirefoxScripts slices
with builder functions that accept hardware profile structs. Each browser
session now randomly selects from a pool of 6 realistic profiles per
engine, and Chromium connection stats receive per-session jitter (±20ms
RTT, ±2 Mbps downlink). This prevents anti-bot systems from correlating
sessions via identical WebGL, connection, mozInnerScreen, and
hardwareConcurrency fingerprints.
Closes#71
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NewBrowser previously had no viewport (strong headless signal) and used a
Firefox User-Agent unconditionally, even for Chromium instances (detectable
mismatch).
Add per-engine UA constants (DefaultFirefoxUserAgent, DefaultChromiumUserAgent)
and auto-select the matching UA in initBrowser when the caller hasn't set one
explicitly. Keep DefaultUserAgent as a backward-compatible alias.
Add 1920x1080 default viewport to NewBrowser (most common desktop resolution).
NewInteractiveBrowser keeps its existing 1280x720 default but also gains
engine-aware UA selection.
Closes#70
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The stealth system previously injected all 12 init scripts unconditionally
into every browser engine. Chromium-specific scripts (window.chrome stubs,
ANGLE WebGL strings, CDP cleanup, HeadlessChrome UA strip) were no-ops or
actively suspicious on Firefox, while Firefox-specific headless vectors
were unaddressed.
Split stealthInitScripts into three categories:
- stealthCommonScripts (4): webdriver, outerWidth/Height, permissions, Notification
- stealthChromiumScripts (8): existing Chromium-specific scripts
- stealthFirefoxScripts (5): new Firefox-specific stealth:
- navigator.webdriver getOwnPropertyDescriptor hardening
- WebGL renderer spoof with Mesa/Intel strings
- mozInnerScreenX/Y non-zero spoof
- navigator.hardwareConcurrency normalization
- PDF.js plugin list override
browser_init.go now selects common + engine-specific scripts based on
opt.Browser. Tests updated with per-category validation and cross-
contamination checks.
Closes#69
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a new site extractor for pizzint.watch, which tracks pizza shop
activity near the Pentagon as an OSINT indicator. The extractor fetches
the dashboard API and exposes DOUGHCON levels, restaurant activity, and
spike events.
Includes a CLI tool with an HTTP server mode (--serve) for embedding
the pizza status in dashboards or status displays.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The weather extractor used positional CSS selectors (div:first-child,
div:nth-child(2)) to locate the header and hourly container within the
widget section. When DuckDuckGo inserts advisory banners (e.g. wind
advisory), the extra div shifts positions and breaks extraction of
current temp, hourly data, humidity, and wind.
Replace with structural selectors:
- div:not(:has(ul)) for the header (first div without a list)
- div:has(> ul) for the hourly container (div with direct ul child)
These match elements by their content structure rather than position,
so advisory banners no longer break extraction.
Fixes#64
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When RemoveHidden is true, JavaScript is evaluated on the live page to
remove all elements with computed display:none before readability
extraction. This defends against anti-scraping honeypots that embed
prompt injections in hidden DOM elements.
The implementation uses an optional pageEvaluator interface so that the
concrete document (backed by Playwright) supports it while the Document
interface remains unchanged.
Closes#62
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sites with infinite scroll (e.g. The Verge) load additional articles
into the DOM, which get included in readability extraction. Add
ReadabilityOptions.RemoveSelectors to strip elements by CSS selector
before parsing, avoiding the need to reimplement the readability
pipeline downstream.
Closes#60
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 7 new init scripts to cover WebGL fingerprinting, missing Chrome
APIs, permissions behavior, CDP artifacts, and HeadlessChrome UA string.
Enable Chromium's new headless mode (Channel: "chromium") when stealth
is active to use the full UI layer that is harder to detect.
Closes#58
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add anti-bot-detection evasion support to reduce blocking by sites like
archive.ph. Stealth mode is enabled by default for all browsers and applies
common evasions: navigator.webdriver override, plugin/mimeType spoofing,
window.chrome stub, and outerWidth/outerHeight fixes. For Chromium,
--disable-blink-features=AutomationControlled is also added.
New BrowserOptions fields:
- Stealth *bool: toggle stealth presets (default true)
- LaunchArgs []string: custom browser launch arguments
- InitScripts []string: JavaScript injected before page scripts
Closes#56
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DuckDuckGo's weather widget uses randomized CSS module class names that
don't match the BEM-style selectors the extractor was using. Replace all
class-based selectors with structural and attribute-based selectors:
- Identify widget via article:has(img[src*='weatherkit'])
- Use positional selectors (div:first-child, p:first-of-type, etc.)
- Extract icon hints from img[alt] attributes
- Parse precipitation from span > span structure
- Derive CurrentTemp from first hourly entry (no standalone element)
- Derive HighTemp/LowTemp from first daily forecast entry
- Use text-matching for Humidity/Wind labels
Fixes#53
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add HourlyForecast struct and Hourly field to WeatherData for hourly
temperature/condition data. Add Precipitation (int, -1 if unavailable)
and IconHint (from aria-label/title/alt attributes) to both DayForecast
and HourlyForecast. This enables downstream consumers like mort to
replace inline DuckDuckGo scraping with a single GetWeather() call.
Closes#51
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract firmware information from Bambu Lab's firmware download pages
by parsing the __NEXT_DATA__ JSON blob embedded in the page. Supports
all printer models (X1, P1, A1, A1 mini, H2D, H2S, P2S, X1E, H2D Pro).
Provides GetLatestFirmware() and GetAllFirmware() methods that return
version, release date, release notes, download URL, and MD5 checksum.
Closes#45
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add sites/recipe package with ExtractRecipe() that works on any recipe
URL. Parses JSON-LD structured data (@type: Recipe) first, with DOM
fallback. Handles @graph containers, arrays, HowToStep objects, ISO
8601 durations, and various author/yield/image formats.
Closes#29
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add sites/steam package with GetGamePrice() and SearchGames() methods.
Handles regular prices, discounted games, and free-to-play titles.
Includes age gate bypass logic and currency detection.
Closes#28
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add sites/coingecko package with GetPrice() method that extracts
structured crypto price data (name, symbol, price, 24h/7d change,
market cap, volume, 24h high/low) from CoinGecko coin pages.
Includes mock-based tests and parseLargeNumber helper for T/B/M suffixes.
Closes#27
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add weather.go with GetWeather() for extracting structured weather data
(location, temp, conditions, forecast) and stock.go with GetStockQuote()
and GetStockChart() for stock data extraction and chart screenshots.
Both include mock-based tests. CSS selectors may need tuning against
the live site since DuckDuckGo's React-rendered widgets use dynamic
class names.
Closes#25, #26
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create exported extractortest package with MockBrowser, MockDocument,
and MockNode that support selector-based responses for testing site
extractors without a real browser.
Add extraction tests for DuckDuckGo (result parsing, empty results, no
links, full search flow) and Powerball (drawing parsing, next drawing
parsing with billion/million, error cases, full GetCurrent flow).
Closes#21
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Wrap staticCookieJar in struct with sync.RWMutex for thread safety
- Add SameSite field to Cookie struct with Strict/Lax/None constants
- Update Playwright cookie conversion functions for SameSite
- Replace hardcoded 4-country switch with dynamic country code generation
Closes#20, #22, #23
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Return errors for required fields (ID, price) and log warnings for
optional fields (title, description, unit price) across all site
extractors instead of silently discarding them with _ =.
Closes#24
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract identical numericOnly inline functions from powerball and
megamillions into shared sites/internal/parse.NumericOnly with tests
- Extract duplicated DuckDuckGo result parsing from Search() and
GetResults() into shared extractResults() helper
Closes#13, #14
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Define DefaultUserAgent (Firefox/147.0) in playwright.go and reference
it from NewBrowser, NewInteractiveBrowser, and CLI flags. Previously
three different UA strings existed (two at 142.0, one outdated at 133.0).
Closes#17
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change ShowBrowser from bool to *bool so nil means "don't override"
in mergeOptions(), fixing the bug where it always overwrote the base
- Add Bool() helper for convenient *bool construction
- Align NewInteractiveBrowser default from Chromium to Firefox to match
NewBrowser
- Update README example and CLI flags for the *bool change
Closes#15, #16
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- playwright.go: check error from page.Context().Cookies() before
iterating over results, preventing silent failures
- archive.go: replace time.Sleep(5s) with context-aware select using
time.After, allowing the operation to be cancelled promptly
Closes#7, #18
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>