Commit Graph

80 Commits

Author SHA1 Message Date
c2768e2b05 feature: add IMDB movie/TV extractor
All checks were successful
CI / test (pull_request) Successful in 46s
CI / vet (pull_request) Successful in 47s
CI / build (pull_request) Successful in 1m18s
Add sites/imdb package with GetMovie() and Search() methods. Extracts
title, year, rating, votes, runtime, genres, director, cast, plot,
poster, and box office data. Uses JSON-LD parsing with DOM fallback.
Supports Movie, TVSeries, and TVMiniSeries types.

Closes #30

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:54:30 +00:00
100e53531b Merge pull request 'feature: add recipe extractor with JSON-LD and DOM parsing' (#48) from feature/allrecipes-extractor into main
All checks were successful
CI / build (push) Successful in 1m3s
CI / vet (push) Successful in 1m5s
CI / test (push) Successful in 1m8s
2026-02-15 16:52:47 +00:00
de0a065923 feature: add recipe extractor with JSON-LD and DOM parsing
All checks were successful
CI / build (pull_request) Successful in 57s
CI / vet (pull_request) Successful in 1m2s
CI / test (pull_request) Successful in 1m5s
Add sites/recipe package with ExtractRecipe() that works on any recipe
URL. Parses JSON-LD structured data (@type: Recipe) first, with DOM
fallback. Handles @graph containers, arrays, HowToStep objects, ISO
8601 durations, and various author/yield/image formats.

Closes #29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:52:28 +00:00
ec27c7e2e0 Merge pull request 'feature: add Steam Store game price extractor' (#47) from feature/steam-extractor into main
All checks were successful
CI / build (push) Successful in 42s
CI / vet (push) Successful in 44s
CI / test (push) Successful in 1m23s
2026-02-15 16:50:46 +00:00
b1137f2ebc feature: add Steam Store game price extractor
All checks were successful
CI / vet (pull_request) Successful in 1m24s
CI / build (pull_request) Successful in 1m24s
CI / test (pull_request) Successful in 1m28s
Add sites/steam package with GetGamePrice() and SearchGames() methods.
Handles regular prices, discounted games, and free-to-play titles.
Includes age gate bypass logic and currency detection.

Closes #28

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:50:27 +00:00
69603b7cae Merge pull request 'feature: add CoinGecko cryptocurrency price extractor' (#46) from feature/coingecko-extractor into main
All checks were successful
CI / test (push) Successful in 44s
CI / vet (push) Successful in 1m19s
CI / build (push) Successful in 1m23s
2026-02-15 16:48:08 +00:00
349b1b9c6b feature: add CoinGecko cryptocurrency price extractor
All checks were successful
CI / build (pull_request) Successful in 46s
CI / vet (pull_request) Successful in 1m20s
CI / test (pull_request) Successful in 1m23s
Add sites/coingecko package with GetPrice() method that extracts
structured crypto price data (name, symbol, price, 24h/7d change,
market cap, volume, 24h high/low) from CoinGecko coin pages.

Includes mock-based tests and parseLargeNumber helper for T/B/M suffixes.

Closes #27

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:47:53 +00:00
d0b3131d98 Merge pull request 'feature: add DuckDuckGo weather and stock widget extractors' (#44) from feature/duckduckgo-widgets into main
All checks were successful
CI / vet (push) Successful in 29s
CI / build (push) Successful in 50s
CI / test (push) Successful in 50s
2026-02-15 16:43:07 +00:00
461b704792 feature: add DuckDuckGo weather and stock widget extractors
All checks were successful
CI / vet (pull_request) Successful in 29s
CI / build (pull_request) Successful in 46s
CI / test (pull_request) Successful in 48s
Add weather.go with GetWeather() for extracting structured weather data
(location, temp, conditions, forecast) and stock.go with GetStockQuote()
and GetStockChart() for stock data extraction and chart screenshots.

Both include mock-based tests. CSS selectors may need tuning against
the live site since DuckDuckGo's React-rendered widgets use dynamic
class names.

Closes #25, #26
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:40:53 +00:00
dcc977c0cc Merge pull request 'Mock-based site extractor test infrastructure' (#43) from test/site-extractor-mocks into main
All checks were successful
CI / test (push) Successful in 1m4s
CI / build (push) Successful in 1m7s
CI / vet (push) Successful in 1m7s
2026-02-15 16:38:15 +00:00
198906946b test: add mock-based site extractor test infrastructure
All checks were successful
CI / vet (pull_request) Successful in 1m5s
CI / build (pull_request) Successful in 1m6s
CI / test (pull_request) Successful in 1m6s
Create exported extractortest package with MockBrowser, MockDocument,
and MockNode that support selector-based responses for testing site
extractors without a real browser.

Add extraction tests for DuckDuckGo (result parsing, empty results, no
links, full search flow) and Powerball (drawing parsing, next drawing
parsing with billion/million, error cases, full GetCurrent flow).

Closes #21
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:37:58 +00:00
ddb701fca0 Merge pull request 'Thread-safe CookieJar, SameSite, Google countries' (#42) from enhance/cookies-and-google into main
All checks were successful
CI / vet (push) Successful in 41s
CI / build (push) Successful in 1m23s
CI / test (push) Successful in 1m24s
2026-02-15 16:35:10 +00:00
963696cd62 enhance: thread-safe CookieJar, SameSite cookie attr, dynamic Google countries
All checks were successful
CI / vet (pull_request) Successful in 40s
CI / build (pull_request) Successful in 1m22s
CI / test (pull_request) Successful in 1m28s
- Wrap staticCookieJar in struct with sync.RWMutex for thread safety
- Add SameSite field to Cookie struct with Strict/Lax/None constants
- Update Playwright cookie conversion functions for SameSite
- Replace hardcoded 4-country switch with dynamic country code generation

Closes #20, #22, #23
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:34:54 +00:00
0ba9cc9b98 Merge pull request 'Fix silently ignored parsing errors (#24)' (#41) from fix/silent-parsing-errors into main
All checks were successful
CI / build (push) Successful in 42s
CI / vet (push) Successful in 44s
CI / test (push) Successful in 1m25s
2026-02-15 16:32:14 +00:00
a9711ce904 fix: surface parsing errors instead of silently discarding them
All checks were successful
CI / vet (pull_request) Successful in 1m10s
CI / build (pull_request) Successful in 1m21s
CI / test (pull_request) Successful in 1m28s
Return errors for required fields (ID, price) and log warnings for
optional fields (title, description, unit price) across all site
extractors instead of silently discarding them with _ =.

Closes #24
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:31:56 +00:00
7f24e97131 Merge pull request 'Deduplicate helpers (#13, #14)' (#40) from refactor/deduplicate-helpers into main
All checks were successful
CI / test (push) Successful in 31s
CI / vet (push) Successful in 46s
CI / build (push) Successful in 46s
2026-02-15 16:28:55 +00:00
132817144e refactor: deduplicate numericOnly and DuckDuckGo result extraction
All checks were successful
CI / build (pull_request) Successful in 29s
CI / vet (pull_request) Successful in 1m1s
CI / test (pull_request) Successful in 1m4s
- Extract identical numericOnly inline functions from powerball and
  megamillions into shared sites/internal/parse.NumericOnly with tests
- Extract duplicated DuckDuckGo result parsing from Search() and
  GetResults() into shared extractResults() helper

Closes #13, #14

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:26:54 +00:00
384566e016 Merge pull request 'Consolidate user-agent strings (#17)' (#39) from fix/user-agent-consistency into main
All checks were successful
CI / build (push) Successful in 1m34s
CI / vet (push) Successful in 1m34s
CI / test (push) Successful in 1m34s
2026-02-15 16:25:03 +00:00
097b2e12c7 fix: consolidate user-agent strings into DefaultUserAgent constant
All checks were successful
CI / build (pull_request) Successful in 44s
CI / test (pull_request) Successful in 46s
CI / vet (pull_request) Successful in 1m28s
Define DefaultUserAgent (Firefox/147.0) in playwright.go and reference
it from NewBrowser, NewInteractiveBrowser, and CLI flags. Previously
three different UA strings existed (two at 142.0, one outdated at 133.0).

Closes #17

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:24:44 +00:00
0df639abea Merge pull request 'Fix ShowBrowser merge + consistent browser defaults (#15, #16)' (#38) from fix/merge-options-and-browser-defaults into main
All checks were successful
CI / build (push) Successful in 32s
CI / vet (push) Successful in 1m50s
CI / test (push) Successful in 1m51s
2026-02-15 16:23:07 +00:00
328455de32 fix: ShowBrowser merge behavior and consistent browser defaults
All checks were successful
CI / vet (pull_request) Successful in 1m49s
CI / build (pull_request) Successful in 1m51s
CI / test (pull_request) Successful in 1m52s
- Change ShowBrowser from bool to *bool so nil means "don't override"
  in mergeOptions(), fixing the bug where it always overwrote the base
- Add Bool() helper for convenient *bool construction
- Align NewInteractiveBrowser default from Chromium to Firefox to match
  NewBrowser
- Update README example and CLI flags for the *bool change

Closes #15, #16

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:22:49 +00:00
85e4632ea9 Merge pull request 'Fix updateCookies error + context-aware sleep (#7, #18)' (#37) from fix/cookies-error-and-context-sleep into main
All checks were successful
CI / test (push) Successful in 1m33s
CI / build (push) Successful in 1m35s
CI / vet (push) Successful in 1m34s
2026-02-15 16:20:06 +00:00
769b870a17 fix: check Cookies() error and use context-aware sleep
All checks were successful
CI / build (pull_request) Successful in 46s
CI / vet (pull_request) Successful in 47s
CI / test (pull_request) Successful in 1m22s
- playwright.go: check error from page.Context().Cookies() before
  iterating over results, preventing silent failures
- archive.go: replace time.Sleep(5s) with context-aware select using
  time.After, allowing the operation to be cancelled promptly

Closes #7, #18

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:19:49 +00:00
8b136b9dda Merge pull request 'Fix cmd flags and defer-before-error-check (#8, #19)' (#36) from fix/cmd-flags-and-defer-ordering into main
All checks were successful
CI / vet (push) Successful in 37s
CI / test (push) Successful in 51s
CI / build (push) Successful in 52s
2026-02-15 16:18:54 +00:00
fca50a47c3 Merge pull request 'Fix DuckDuckGo error handling (#5, #6)' (#35) from fix/duckduckgo-error-handling into main
Some checks failed
CI / build (push) Has been cancelled
CI / test (push) Has been cancelled
CI / vet (push) Has been cancelled
2026-02-15 16:18:50 +00:00
991c43d020 Merge pull request 'Fix archive cmd panic on short content (#9)' (#34) from fix/archive-cmd-short-content into main
Some checks failed
CI / test (push) Has been cancelled
CI / build (push) Has been cancelled
CI / vet (push) Has been cancelled
2026-02-15 16:18:46 +00:00
2aa565d3a0 Merge pull request 'Fix nil-pointer panics (#10, #11)' (#33) from fix/nil-pointer-panics into main
Some checks failed
CI / build (push) Has been cancelled
CI / test (push) Has been cancelled
CI / vet (push) Has been cancelled
2026-02-15 16:18:41 +00:00
2af4cbcdce Merge pull request 'Fix XSS vulnerability in SetAttribute (#12)' (#32) from fix/escape-javascript-xss into main
Some checks failed
CI / build (push) Has been cancelled
CI / vet (push) Has been cancelled
CI / test (push) Has been cancelled
2026-02-15 16:18:36 +00:00
e5e0db85e8 fix: use merged flags in archive cmd and move defer after error checks
All checks were successful
CI / vet (pull_request) Successful in 29s
CI / build (pull_request) Successful in 32s
CI / test (pull_request) Successful in 57s
- Fix archive cmd passing only archive-specific Flags instead of the
  merged flags variable that includes browser flags (#8)
- Move defer DeferClose() after error checks in 6 locations to prevent
  calling Close on nil values (#19):
  - sites/duckduckgo/cmd/duckduckgo/main.go
  - sites/duckduckgo/duckduckgo.go
  - sites/google/cmd/google/main.go
  - sites/wegmans/cmd/wegmans/main.go
  - sites/wegmans/wegmans.go
  - sites/aislegopher/aislegopher.go

Closes #8, #19

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:17:38 +00:00
a12c9f7cb6 fix: propagate errors from DuckDuckGo search and GetResults
Some checks failed
CI / test (pull_request) Failing after 6m12s
CI / vet (pull_request) Failing after 6m12s
CI / build (pull_request) Failing after 6m15s
- Change SearchPage.GetResults() to return ([]Result, error) so ForEach
  errors are no longer silently discarded
- Fix Search() to return the ForEach error instead of nil
- Update cmd caller to check GetResults() errors

Closes #5, #6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:16:04 +00:00
b4e462a6b4 fix: prevent panic on short article content in archive cmd
All checks were successful
CI / vet (pull_request) Successful in 1m6s
CI / build (pull_request) Successful in 1m7s
CI / test (pull_request) Successful in 1m8s
Add length check before slicing article.Content[:32], matching the
safe truncation pattern already used in cmd/browser/main.go.

Closes #9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:14:32 +00:00
6c68062e56 fix: add nil guards to prevent nil-pointer panics
All checks were successful
CI / test (pull_request) Successful in 46s
CI / build (pull_request) Successful in 47s
CI / vet (pull_request) Successful in 59s
- document.go: check if resp is nil before calling resp.Status() in
  Refresh(), since Playwright's Reload() can return a nil response
- archive.go: check SelectFirst() results for nil before calling
  Type() and Click(), preventing panics when DOM elements are missing

Closes #10, #11

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:13:43 +00:00
6e94bfe10f fix: eliminate XSS vulnerability in SetAttribute by using Playwright arg passing
All checks were successful
CI / build (pull_request) Successful in 47s
CI / test (pull_request) Successful in 48s
CI / vet (pull_request) Successful in 1m1s
Replace string interpolation in SetAttribute with Playwright's Evaluate
argument passing mechanism. This structurally eliminates the injection
surface — arbitrary name/value strings are safely passed as JavaScript
arguments rather than interpolated into the expression string.

The vulnerable escapeJavaScript helper (which only escaped \ and ') is
removed since it is no longer needed.

Closes #12

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:12:46 +00:00
49f294e884 docs: add README.md and CLAUDE.md
All checks were successful
CI / test (push) Successful in 32s
CI / vet (push) Successful in 45s
CI / build (push) Successful in 46s
Add project documentation:
- README.md with installation, usage examples, API reference, and project structure
- CLAUDE.md with developer guide, architecture overview, conventions, and issue label docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 11:10:28 -05:00
05ca15b165 fix: add go.sum to repository for CI builds
All checks were successful
CI / build (push) Successful in 34s
CI / vet (push) Successful in 57s
CI / test (push) Successful in 1m0s
The go.sum file was not tracked, causing CI to fail with
"missing go.sum entry" errors during build/test/vet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 14:15:43 -05:00
294097c3b6 fix: use setup-go@v3 (latest available on gitea.com mirror)
Some checks failed
CI / vet (push) Failing after 26s
CI / build (push) Failing after 32s
CI / test (push) Failing after 36s
The gitea.com/actions/setup-go mirror only has tags up to v3.
v3 supports go-version-file which is all we need.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 14:13:46 -05:00
022e002f98 ci: use gitea.com action mirrors instead of github.com
Some checks failed
CI / test (push) Failing after 15s
CI / vet (push) Failing after 15s
CI / build (push) Failing after 18s
GitHub is returning 500 errors for actions/checkout and actions/setup-go.
Switch to Gitea's own mirrors at gitea.com/actions/ to avoid the dependency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 14:09:46 -05:00
51ce639994 ci: re-trigger workflow after transient GitHub 500 error
Some checks failed
CI / test (push) Failing after 10s
CI / build (push) Failing after 2m2s
CI / vet (push) Failing after 2m2s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 14:06:39 -05:00
cb2ed10cfd refactor: restructure API, deduplicate code, expand test coverage
Some checks failed
CI / build (push) Failing after 2m4s
CI / test (push) Failing after 2m6s
CI / vet (push) Failing after 2m19s
- Extract shared DeferClose helper, removing 14 duplicate copies
- Rename PlayWright-prefixed types to cleaner names (BrowserOptions,
  BrowserSelection, NewBrowser, etc.)
- Rename fields: ServerAddress, RequireServer (was DontLaunchOnConnectFailure)
- Extract shared initBrowser/mergeOptions into browser_init.go,
  deduplicating ~120 lines between NewBrowser and NewInteractiveBrowser
- Remove unused locator field from document struct
- Add tests for all previously untested packages (archive, aislegopher,
  wegmans, useragents, powerball) and expand existing test suites
- Add MIGRATION.md documenting all breaking API changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 13:59:47 -05:00
e7b7e78796 fix: bug fixes, test coverage, and CI workflow
Some checks failed
CI / vet (push) Failing after 15s
CI / build (push) Failing after 30s
CI / test (push) Failing after 36s
- Fix Nodes.First() panic on empty slice (return nil)
- Fix ticker leak in archive.go (create once, defer Stop)
- Fix cookie path matching for empty and root paths
- Fix lost query params in google.go (u.Query().Set was discarded)
- Fix type assertion panic in useragents.go
- Fix dropped date parse error in powerball.go
- Remove unreachable dead code in megamillions.go and powerball.go
- Simplify document.go WaitForNetworkIdle, remove unused root field
- Remove debug fmt.Println calls across codebase
- Replace panic(err) with stderr+exit in all cmd/ programs
- Fix duckduckgo cmd: remove useless defer, return error on bad safesearch
- Fix archive cmd: ToConfig returns error instead of panicking
- Add 39+ unit tests across 6 new test files
- Add Gitea Actions CI workflow (build, test, vet in parallel)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 11:14:19 -05:00
steve
e807dbb2ff feat: add KeyboardInsertText to InteractiveBrowser
Exposes Playwright's Keyboard.InsertText() which dispatches only an
input event (no keydown/keyup). This is essential for pasting text
into password fields and custom input components that don't handle
rapid-fire synthetic key events from Type().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 15:16:15 +00:00
52a9cb585d feat: add InteractiveBrowser API for remote browser control
Exposes low-level mouse, keyboard, screenshot, navigation, and cookie
extraction APIs via a new InteractiveBrowser interface. Designed for
interactive browser proxy sessions where direct page control is needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 02:58:00 -05:00
868acfae40 Add context support to Playwright browser initialization
Refactored Playwright initialization to ensure context propagation. Updated `NewPlayWrightBrowser` and related methods to accept `context.Context` for better cancellation and timeout handling. Improved error resilience and concurrency during browser setup.
2025-10-28 00:24:19 -04:00
82fce5a200 Handle unit suffix in price parsing and add logging
Refined price parsing logic to strip trailing periods from units (e.g., "lb." -> "lb") for better handling. Added logging for debugging extracted response data.
2025-10-20 22:36:20 -04:00
5fe7313fa4 Refine status check logic when handling document requests in Playwright 2025-10-12 20:17:04 -04:00
39c2c7d37a Add UseLocalOnly flag to connection options in Playwright logic 2025-10-12 00:17:11 -04:00
e32a6fa791 Add UseLocalOnly option to Playwright connection logic
Introduced the `UseLocalOnly` option to prevent connections to a remote Playwright server and enforce usage of the local server. Updated relevant connection logic to respect this new option.
2025-10-12 00:10:58 -04:00
afa0238758 Restrict Price assignment to unit price with "lb" only 2025-10-11 23:48:09 -04:00
9ae8619f93 Enhance price parsing to handle non-zero unit price
Updated price extraction logic to set `Price` from `UnitPrice` when it is non-zero, ensuring more accurate parsing.
2025-10-11 23:34:41 -04:00
f4caef22b0 Add timeout option to Playwright server connection
Introduced a 30-second timeout for connecting to the Playwright server. Added logging for connection attempts to improve debugging and enhance connection reliability.
2025-10-10 20:25:27 -04:00