Compare commits

...

16 Commits

Author SHA1 Message Date
203b97d957 Update default UserAgent string in PlayWrightBrowser
Changed the UserAgent to represent a macOS system using Firefox 137.0. This ensures the browser identification aligns with updated standards and improves compatibility.
2025-05-27 01:46:06 -04:00
39453288ce Add OpenSearch and SearchPage functionality for DuckDuckGo
Introduced the `OpenSearch` method and `SearchPage` interface to streamline search operations and allow for loading more results dynamically. Updated dependencies and modified the DuckDuckGo CLI to utilize these enhancements.
2025-03-18 02:42:50 -04:00
7c0e44a22f Add viewport dimensions and dark mode support
This commit introduces optional viewport dimensions and dark mode support to the PlayWrightBrowserOptions struct and its usage. It ensures more control over browser display settings and improves flexibility when configuring browser contexts. Additionally, visibility checking logic in SetHidden was refined to avoid redundant operations.
2025-03-15 00:46:02 -04:00
0f9f6c776d Rename SetVisible to SetHidden for clearer semantic meaning
The method and its implementation now align with setting an element's "hidden" property instead of "visible." This change improves code clarity and consistency with expected behavior.
2025-03-03 23:39:37 -05:00
62cb6958fa Add SetVisible and SetAttribute methods to Node interface
This commit introduces two new methods, SetVisible and SetAttribute, to the Node interface. These methods allow toggling element visibility and setting attributes dynamically. Additionally, a helper function, escapeJavaScript, was added to ensure proper escaping of JavaScript strings.
2025-03-03 23:31:51 -05:00
964a98a5a8 Handle commands without automatic reaction responses
Introduce `ErrCommandNoReactions` to allow commands to opt out of success reactions. Adjust bot behavior to respect this error and prevent reactions when applicable, ensuring cleaner and more controlled responses. Add error handling and safeguard workers against panics.
2025-01-22 21:06:07 -05:00
81ea656332 Add unit price and unit parsing for items
This update enhances the `Item` structure to include `UnitPrice` and `Unit` fields. Additional logic is implemented to extract and parse unit pricing details from the HTML, improving data accuracy and granularity.
2025-01-21 19:42:25 -05:00
6de455b1bd Add price extraction and validate URL structure in parsers
Added price field to Item struct in AisleGopher and implemented logic to extract price data. Updated Wegmans parser to validate URL structure by ensuring the second segment is "product". These changes improve data accuracy and error handling.
2025-01-20 13:00:59 -05:00
f37e60dddc Add Wegmans module to fetch item details and prices
Introduce functionality to retrieve item details, including name and price, from Wegmans using a browser-based scraper. This includes a CLI tool to execute searches and robust error handling for URL validation and browser interactions.
2025-01-20 12:28:29 -05:00
654976de82 Add AisleGopher integration for data extraction
Introduced a new package and command for extracting data from aislegopher.com, including URL parsing and item retrieval. Updated dependencies in go.mod to support the new functionality. Additionally, refined import structure in the DuckDuckGo integration.
2025-01-20 02:16:32 -05:00
e8de488d2b Update CSS selector for extracting titles in DuckDuckGo parser
Replaced the overly complex CSS selector with a simplified "h2" selector for extracting titles. This change improves maintainability and ensures accurate title extraction from the updated DOM structure.
2025-01-16 21:37:38 -05:00
67a3552747 Add DuckDuckGo integration for search functionality
Implemented a DuckDuckGo search module with configurable SafeSearch and regional settings. Added a CLI tool to perform searches via DuckDuckGo using browser automation, supporting flags for customization.
2025-01-16 20:45:37 -05:00
eec94ec708 Reorder imports in main.go for better organization.
Moved the local package import to align with standard Go import grouping conventions. This improves code readability and maintains a consistent structure.
2025-01-16 20:45:23 -05:00
691ae400d1 Add Google search integration with CLI support
Introduce a Google search integration, including a Go package for performing searches with configurable parameters (e.g., language, region) and a CLI tool for executing search queries. Refactor archive CLI import ordering for consistency.
2025-01-16 16:56:05 -05:00
2ca2bb0742 close playwright instance on browser close 2025-01-01 22:48:12 -05:00
8ad5a34f2d Added global screenshot shortcut 2024-12-26 22:20:07 -05:00
13 changed files with 1024 additions and 14 deletions

16
go.mod
View File

@ -3,19 +3,19 @@ module gitea.stevedudenhoeffer.com/steve/go-extractor
go 1.23.2
require (
github.com/go-shiori/go-readability v0.0.0-20241012063810-92284fa8a71f
github.com/playwright-community/playwright-go v0.4802.0
github.com/go-shiori/go-readability v0.0.0-20250217085726-9f5bf5ca7612
github.com/playwright-community/playwright-go v0.5001.0
github.com/urfave/cli/v3 v3.0.0-beta1
golang.org/x/text v0.23.0
)
require (
github.com/andybalholm/cascadia v1.3.2 // indirect
github.com/andybalholm/cascadia v1.3.3 // indirect
github.com/araddon/dateparse v0.0.0-20210429162001-6b43995a97de // indirect
github.com/deckarep/golang-set/v2 v2.6.0 // indirect
github.com/go-jose/go-jose/v3 v3.0.3 // indirect
github.com/deckarep/golang-set/v2 v2.8.0 // indirect
github.com/go-jose/go-jose/v3 v3.0.4 // indirect
github.com/go-shiori/dom v0.0.0-20230515143342-73569d674e1c // indirect
github.com/go-stack/stack v1.8.1 // indirect
github.com/gogs/chardet v0.0.0-20211120154057-b7413eaefb8f // indirect
github.com/urfave/cli/v3 v3.0.0-beta1 // indirect
golang.org/x/net v0.32.0 // indirect
golang.org/x/text v0.21.0 // indirect
golang.org/x/net v0.37.0 // indirect
)

33
node.go
View File

@ -1,6 +1,9 @@
package extractor
import (
"fmt"
"strings"
"github.com/playwright-community/playwright-go"
)
@ -17,6 +20,9 @@ type Node interface {
SelectFirst(selector string) Node
ForEach(selector string, fn func(Node) error) error
SetHidden(val bool) error
SetAttribute(name, value string) error
}
type node struct {
@ -79,3 +85,30 @@ func (n node) ForEach(selector string, fn func(Node) error) error {
return nil
}
func (n node) SetHidden(val bool) error {
visible, err := n.locator.IsVisible()
if err != nil {
return fmt.Errorf("error checking visibility: %w", err)
}
if visible == !val {
return nil
}
// Set the hidden property
_, err = n.locator.Evaluate(fmt.Sprintf(`(element) => element.hidden = %t;`, val), nil)
if err != nil {
return fmt.Errorf("error setting hidden property: %w", err)
}
return nil
}
func escapeJavaScript(s string) string {
return strings.Replace(strings.Replace(s, "\\", "\\\\", -1), "'", "\\'", -1)
}
func (n node) SetAttribute(name, value string) error {
_, err := n.locator.Evaluate(fmt.Sprintf(`(element) => element.setAttribute('%s', '%s');`, escapeJavaScript(name), escapeJavaScript(value)), nil)
return err
}

View File

@ -4,6 +4,7 @@ import (
"context"
"errors"
"fmt"
"io"
"log/slog"
"time"
@ -35,6 +36,10 @@ const (
PlayWrightBrowserSelectionWebKit PlayWrightBrowserSelection = "webkit"
)
type Size struct {
Width int
Height int
}
type PlayWrightBrowserOptions struct {
UserAgent string // If empty, defaults to "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0"
Browser PlayWrightBrowserSelection // If unset defaults to Firefox.
@ -45,6 +50,9 @@ type PlayWrightBrowserOptions struct {
CookieJar
ShowBrowser bool // If false, browser will be headless
Dimensions Size
DarkMode bool
}
func cookieToPlaywrightOptionalCookie(cookie Cookie) playwright.OptionalCookie {
@ -72,9 +80,10 @@ func playwrightCookieToCookie(cookie playwright.Cookie) Cookie {
func NewPlayWrightBrowser(opts ...PlayWrightBrowserOptions) (Browser, error) {
var thirtySeconds = 30 * time.Second
opt := PlayWrightBrowserOptions{
UserAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0",
UserAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:137.0) Gecko/20100101 Firefox/137.0",
Browser: PlayWrightBrowserSelectionFirefox,
Timeout: &thirtySeconds,
DarkMode: false,
}
for _, o := range opts {
@ -90,6 +99,13 @@ func NewPlayWrightBrowser(opts ...PlayWrightBrowserOptions) (Browser, error) {
if o.CookieJar != nil {
opt.CookieJar = o.CookieJar
}
if o.Dimensions.Width > 0 && o.Dimensions.Height > 0 {
opt.Dimensions = o.Dimensions
}
if o.DarkMode {
opt.DarkMode = true
}
opt.ShowBrowser = o.ShowBrowser
}
@ -132,8 +148,26 @@ func NewPlayWrightBrowser(opts ...PlayWrightBrowserOptions) (Browser, error) {
return nil, err
}
var viewport *playwright.Size
if opt.Dimensions.Width > 0 && opt.Dimensions.Height > 0 {
viewport = &playwright.Size{
Width: opt.Dimensions.Width,
Height: opt.Dimensions.Height,
}
}
var scheme *playwright.ColorScheme
if opt.DarkMode {
scheme = playwright.ColorSchemeDark
} else {
scheme = playwright.ColorSchemeNoPreference
}
c, err := browser.NewContext(playwright.BrowserNewContextOptions{
UserAgent: playwright.String(opt.UserAgent),
Viewport: viewport,
ColorScheme: scheme,
})
if err != nil {
return nil, err
@ -244,7 +278,33 @@ func (b playWrightBrowser) Open(ctx context.Context, url string, opts OpenPageOp
func (b playWrightBrowser) Close() error {
return errors.Join(
b.ctx.Close(),
b.browser.Close(),
b.ctx.Close(),
b.pw.Stop(),
)
}
func deferClose(cl io.Closer) {
_ = cl.Close()
}
func Screenshot(ctx context.Context, target string, timeout time.Duration) ([]byte, error) {
browser, err := NewPlayWrightBrowser(PlayWrightBrowserOptions{
Timeout: &timeout,
})
if err != nil {
return nil, fmt.Errorf("error creating browser: %w", err)
}
defer deferClose(browser)
doc, err := browser.Open(ctx, target, OpenPageOptions{})
if err != nil {
return nil, fmt.Errorf("error opening page: %w", err)
}
defer deferClose(doc)
return doc.Screenshot()
}

View File

@ -0,0 +1,81 @@
package aislegopher
import (
"context"
"errors"
"fmt"
"io"
"net/url"
"strconv"
"strings"
"gitea.stevedudenhoeffer.com/steve/go-extractor"
)
type Config struct {
}
var DefaultConfig = Config{}
var (
ErrInvalidURL = errors.New("invalid url")
)
type Item struct {
ID int
Name string
Price float64
}
func deferClose(cl io.Closer) {
if cl != nil {
_ = cl.Close()
}
}
func GetItemFromURL(ctx context.Context, b extractor.Browser, u *url.URL) (Item, error) {
return DefaultConfig.GetItemFromURL(ctx, b, u)
}
func (c Config) GetItemFromURL(ctx context.Context, b extractor.Browser, u *url.URL) (Item, error) {
res := Item{}
// the url will be in the format of aislegopher.com/p/slug/id
// we need to parse the slug and id from the url
a := strings.Split(u.Path, "/")
if len(a) != 4 {
return res, ErrInvalidURL
}
if a[1] != "p" {
return res, ErrInvalidURL
}
if u.Host != "aislegopher.com" && u.Host != "www.aislegopher.com" {
return res, ErrInvalidURL
}
res.ID, _ = strconv.Atoi(a[3])
doc, err := b.Open(ctx, u.String(), extractor.OpenPageOptions{})
defer deferClose(doc)
if err != nil {
return res, fmt.Errorf("failed to open page: %w", err)
}
names := doc.Select("h2.h4")
if len(names) > 0 {
res.Name, _ = names[0].Text()
}
prices := doc.Select("h4.h2")
if len(prices) > 0 {
priceStr, _ := prices[0].Text()
priceStr = strings.ReplaceAll(priceStr, "$", "")
priceStr = strings.TrimSpace(priceStr)
res.Price, _ = strconv.ParseFloat(priceStr, 64)
}
return res, nil
}

View File

@ -0,0 +1,77 @@
package main
import (
"context"
"fmt"
"io"
"net/url"
"os"
"gitea.stevedudenhoeffer.com/steve/go-extractor/cmd/browser/pkg/browser"
"gitea.stevedudenhoeffer.com/steve/go-extractor/sites/aislegopher"
"github.com/urfave/cli/v3"
)
type AisleGopherFlags []cli.Flag
var Flags = AisleGopherFlags{}
func (f AisleGopherFlags) ToConfig(_ *cli.Command) aislegopher.Config {
res := aislegopher.DefaultConfig
return res
}
func deferClose(cl io.Closer) {
if cl != nil {
_ = cl.Close()
}
}
func main() {
var flags []cli.Flag
flags = append(flags, browser.Flags...)
flags = append(flags, Flags...)
cli := &cli.Command{
Name: "aislegopher",
Usage: "AisleGopher is a tool for extracting data from aislegopher.com",
Flags: flags,
Action: func(ctx context.Context, c *cli.Command) error {
cfg := Flags.ToConfig(c)
b, err := browser.FromCommand(ctx, c)
if err != nil {
return fmt.Errorf("failed to create browser: %w", err)
}
defer deferClose(b)
arg := c.Args().First()
if arg == "" {
return fmt.Errorf("url is required")
}
u, err := url.Parse(arg)
if err != nil {
return fmt.Errorf("failed to parse url: %w", err)
}
data, err := cfg.GetItemFromURL(ctx, b, u)
if err != nil {
return fmt.Errorf("failed to get item from url: %w", err)
}
fmt.Printf("Item: %+v\n", data)
return nil
},
}
err := cli.Run(context.Background(), os.Args)
if err != nil {
panic(err)
}
}

View File

@ -6,12 +6,11 @@ import (
"os"
"time"
"gitea.stevedudenhoeffer.com/steve/go-extractor"
"github.com/urfave/cli/v3"
"gitea.stevedudenhoeffer.com/steve/go-extractor"
"gitea.stevedudenhoeffer.com/steve/go-extractor/cmd/browser/pkg/browser"
"gitea.stevedudenhoeffer.com/steve/go-extractor/sites/archive"
"github.com/urfave/cli/v3"
)
type ArchiveFlags []cli.Flag

View File

@ -0,0 +1,113 @@
package main
import (
"context"
"fmt"
"github.com/urfave/cli/v3"
"io"
"os"
"strings"
"time"
"gitea.stevedudenhoeffer.com/steve/go-extractor/cmd/browser/pkg/browser"
"gitea.stevedudenhoeffer.com/steve/go-extractor/sites/duckduckgo"
)
type DuckDuckGoFlags []cli.Flag
var Flags = DuckDuckGoFlags{
&cli.StringFlag{
Name: "region",
Aliases: []string{"r"},
},
&cli.StringFlag{
Name: "safesearch",
Aliases: []string{"s"},
},
}
func (f DuckDuckGoFlags) ToConfig(cmd *cli.Command) duckduckgo.Config {
var res = duckduckgo.DefaultConfig
if r := cmd.String("region"); r != "" {
res.Region = r
}
if s := cmd.String("safesearch"); s != "" {
switch s {
case "on":
res.SafeSearch = duckduckgo.SafeSearchOn
case "moderate":
res.SafeSearch = duckduckgo.SafeSearchModerate
case "off":
res.SafeSearch = duckduckgo.SafeSearchOff
default:
panic("invalid safe search value")
}
}
return res
}
func deferClose(cl io.Closer) {
if cl != nil {
_ = cl.Close()
}
}
func main() {
var flags []cli.Flag
flags = append(flags, browser.Flags...)
flags = append(flags, Flags...)
cli := &cli.Command{
Name: "duckduckgo",
Usage: "Search DuckDuckGo",
Flags: flags,
Action: func(ctx context.Context, command *cli.Command) error {
c := Flags.ToConfig(command)
defer deferClose(nil)
query := strings.TrimSpace(strings.Join(command.Args().Slice(), " "))
if query == "" {
return cli.Exit("usage: duckduckgo <query>", 1)
}
b, err := browser.FromCommand(ctx, command)
defer deferClose(b)
if err != nil {
return fmt.Errorf("failed to create browser: %w", err)
}
search, err := c.OpenSearch(ctx, b, query)
if err != nil {
return fmt.Errorf("failed to open search: %w", err)
}
defer deferClose(search)
res := search.GetResults()
fmt.Println("Results:", res)
err = search.LoadMore()
if err != nil {
return fmt.Errorf("failed to load more: %w", err)
}
time.Sleep(2 * time.Second)
res = search.GetResults()
fmt.Println("Results:", res)
return nil
},
}
err := cli.Run(context.Background(), os.Args)
if err != nil {
panic(err)
}
}

View File

@ -0,0 +1,141 @@
package duckduckgo
import (
"context"
"fmt"
"io"
"log/slog"
"net/url"
"gitea.stevedudenhoeffer.com/steve/go-extractor"
)
type SafeSearch int
const (
SafeSearchOn SafeSearch = 1
SafeSearchModerate SafeSearch = -1
SafeSearchOff SafeSearch = -2
)
type Config struct {
// SafeSearch is the safe-search level to use. If empty, SafeSearchOff will be used.
SafeSearch SafeSearch
// Region is the region to use for the search engine.
// See: https://duckduckgo.com/duckduckgo-help-pages/settings/params/ for more values
Region string
}
func (c Config) validate() Config {
if c.SafeSearch == 0 {
c.SafeSearch = SafeSearchOff
}
return c
}
func (c Config) ToSearchURL(query string) *url.URL {
c = c.validate()
res, _ := url.Parse("https://duckduckgo.com/")
var vals = res.Query()
switch c.SafeSearch {
case SafeSearchOn:
vals.Set("kp", "1")
case SafeSearchModerate:
vals.Set("kp", "-1")
case SafeSearchOff:
vals.Set("kp", "-2")
}
if c.Region != "" {
vals.Set("kl", c.Region)
}
vals.Set("q", query)
res.RawQuery = vals.Encode()
return res
}
var DefaultConfig = Config{
SafeSearch: SafeSearchOff,
}
type Result struct {
URL string
Title string
Description string
}
func deferClose(cl io.Closer) {
if cl != nil {
_ = cl.Close()
}
}
func (c Config) OpenSearch(ctx context.Context, b extractor.Browser, query string) (SearchPage, error) {
u := c.ToSearchURL(query)
slog.Info("searching", "url", u, "query", query, "config", c, "browser", b)
doc, err := b.Open(ctx, u.String(), extractor.OpenPageOptions{})
if err != nil {
if doc != nil {
_ = doc.Close()
}
return nil, fmt.Errorf("failed to open url: %w", err)
}
return searchPage{doc}, nil
}
func (c Config) Search(ctx context.Context, b extractor.Browser, query string) ([]Result, error) {
u := c.ToSearchURL(query)
slog.Info("searching", "url", u, "query", query, "config", c, "browser", b)
doc, err := b.Open(ctx, u.String(), extractor.OpenPageOptions{})
defer deferClose(doc)
if err != nil {
return nil, fmt.Errorf("failed to open url: %w", err)
}
var res []Result
err = doc.ForEach(`article[id^="r1-"]`, func(n extractor.Node) error {
var r Result
links := n.Select(`a[href][target="_self"]`)
if len(links) == 0 {
return nil
}
r.URL, err = links[0].Attr(`href`)
if err != nil {
return fmt.Errorf("failed to get link: %w", err)
}
titles := n.Select("h2")
if len(titles) != 0 {
r.Title, _ = titles[0].Text()
}
descriptions := n.Select("span > span")
if len(descriptions) != 0 {
r.Description, _ = descriptions[0].Text()
}
res = append(res, r)
return nil
})
return res, nil
}

68
sites/duckduckgo/page.go Normal file
View File

@ -0,0 +1,68 @@
package duckduckgo
import (
"fmt"
"gitea.stevedudenhoeffer.com/steve/go-extractor"
"io"
"log/slog"
)
type SearchPage interface {
io.Closer
GetResults() []Result
LoadMore() error
}
type searchPage struct {
doc extractor.Document
}
func (s searchPage) GetResults() []Result {
var res []Result
var err error
err = s.doc.ForEach(`article[id^="r1-"]`, func(n extractor.Node) error {
var r Result
links := n.Select(`a[href][target="_self"]`)
if len(links) == 0 {
return nil
}
r.URL, err = links[0].Attr(`href`)
if err != nil {
return fmt.Errorf("failed to get link: %w", err)
}
titles := n.Select("h2")
if len(titles) != 0 {
r.Title, _ = titles[0].Text()
}
descriptions := n.Select("span > span")
if len(descriptions) != 0 {
r.Description, _ = descriptions[0].Text()
}
res = append(res, r)
return nil
})
return res
}
func (s searchPage) LoadMore() error {
return s.doc.ForEach(`button#more-results`, func(n extractor.Node) error {
slog.Info("clicking load more", "node", n)
return n.Click()
})
}
func (s searchPage) Close() error {
return s.doc.Close()
}

View File

@ -0,0 +1,95 @@
package main
import (
"context"
"fmt"
"io"
"os"
"strings"
"github.com/urfave/cli/v3"
"gitea.stevedudenhoeffer.com/steve/go-extractor/cmd/browser/pkg/browser"
"gitea.stevedudenhoeffer.com/steve/go-extractor/sites/google"
)
type GoogleFlags []cli.Flag
var Flags = GoogleFlags{
&cli.StringFlag{
Name: "domain",
Aliases: []string{"d"},
Usage: "The base domain to use",
},
&cli.StringFlag{
Name: "language",
Aliases: []string{"l"},
Usage: "The language to use",
},
}
func (f GoogleFlags) ToConfig(_ context.Context, cmd *cli.Command) google.Config {
c := google.DefaultConfig
if d := cmd.String("domain"); d != "" {
c.BaseURL = d
}
if l := cmd.String("language"); l != "" {
c.Language = l
}
return c
}
func deferClose(cl io.Closer) {
if cl != nil {
_ = cl.Close()
}
}
func main() {
var flags []cli.Flag
flags = append(flags, browser.Flags...)
flags = append(flags, Flags...)
cli := &cli.Command{
Name: "google",
Usage: "Search Google",
Flags: flags,
Action: func(ctx context.Context, cli *cli.Command) error {
query := strings.Join(cli.Args().Slice(), " ")
if query == "" {
return fmt.Errorf("usage: google <query>")
}
b, err := browser.FromCommand(ctx, cli)
defer deferClose(b)
if err != nil {
return err
}
cfg := Flags.ToConfig(ctx, cli)
res, err := cfg.Search(ctx, b, query)
if err != nil {
return err
}
fmt.Println(res)
return nil
},
}
err := cli.Run(context.Background(), os.Args)
if err != nil {
panic(err)
}
}

144
sites/google/google.go Normal file
View File

@ -0,0 +1,144 @@
package google
import (
"context"
"fmt"
"io"
"net/url"
"gitea.stevedudenhoeffer.com/steve/go-extractor"
)
type Config struct {
// BaseURL is the base URL for the search engine, if empty "google.com" is used
BaseURL string
// Language is the language to use for the search engine, if empty "en" is used
Language string
// Country is the country to use for the search engine, if empty "us" is used
Country string
}
var DefaultConfig = Config{
BaseURL: "google.com",
Language: "en",
Country: "us",
}
func (c Config) validate() Config {
if c.BaseURL == "" {
c.BaseURL = "google.com"
}
if c.Language == "" {
c.Language = "en"
}
if c.Country == "" {
c.Country = "us"
}
return c
}
type Result struct {
URL string
Title string
Description string
}
func deferClose(cl io.Closer) {
if cl != nil {
_ = cl.Close()
}
}
func (c Config) Search(ctx context.Context, b extractor.Browser, query string) ([]Result, error) {
c = c.validate()
u, err := url.Parse(fmt.Sprintf("https://%s/search?q=%s", c.BaseURL, query))
if err != nil {
return nil, fmt.Errorf("invalid url: %w", err)
}
if c.Language != "" {
u.Query().Set("hl", c.Language)
}
if c.Country != "" {
country := ""
switch c.Country {
case "us":
country = "countryUS"
case "uk":
country = "countryUK"
case "au":
country = "countryAU"
case "ca":
country = "countryCA"
}
if country != "" {
u.Query().Set("cr", country)
}
}
doc, err := b.Open(ctx, u.String(), extractor.OpenPageOptions{})
if err != nil {
return nil, fmt.Errorf("failed to open url: %w", err)
}
defer deferClose(doc)
var res []Result
err = doc.ForEach("div.g", func(s extractor.Node) error {
var u string
var title string
var desc string
// get the first link in the div
link := s.Select("a")
if len(link) == 0 {
return nil
}
u, err := link[0].Attr("href")
if err != nil {
return fmt.Errorf("failed to get link: %w", err)
}
titles := s.Select("div > div > div a > h3")
if len(titles) != 0 {
title, _ = titles[0].Text()
}
descs := s.Select("div:nth-child(1) > div:nth-child(2) > div:nth-child(1) > span:not([class])")
if len(descs) != 0 {
desc, _ = descs[0].Text()
}
res = append(res, Result{
URL: u,
Title: title,
Description: desc,
})
return nil
})
return res, err
}
func Search(ctx context.Context, b extractor.Browser, query string) ([]Result, error) {
return DefaultConfig.Search(ctx, b, query)
}

View File

@ -0,0 +1,81 @@
package main
import (
"context"
"fmt"
"io"
"net/url"
"os"
"gitea.stevedudenhoeffer.com/steve/go-extractor/cmd/browser/pkg/browser"
"github.com/urfave/cli/v3"
"gitea.stevedudenhoeffer.com/steve/go-extractor/sites/wegmans"
)
func deferClose(cl io.Closer) {
if cl != nil {
_ = cl.Close()
}
}
type WegmansFlags []cli.Flag
var Flags = WegmansFlags{}
func (f WegmansFlags) ToConfig(_ *cli.Command) wegmans.Config {
var res = wegmans.DefaultConfig
return res
}
func main() {
var flags []cli.Flag
flags = append(flags, browser.Flags...)
flags = append(flags, Flags...)
app := &cli.Command{
Name: "wegmans",
Usage: "Search Wegmans",
Flags: flags,
Action: func(ctx context.Context, cmd *cli.Command) error {
cfg := Flags.ToConfig(cmd)
b, err := browser.FromCommand(ctx, cmd)
defer deferClose(b)
if err != nil {
return fmt.Errorf("error creating browser: %w", err)
}
arg := cmd.Args().First()
if arg == "" {
return fmt.Errorf("url is required")
}
u, err := url.Parse(arg)
if err != nil {
return fmt.Errorf("failed to parse url: %w", err)
}
item, err := cfg.GetItemPrice(ctx, b, u)
if err != nil {
return fmt.Errorf("failed to get item price: %w", err)
}
fmt.Println(item)
return nil
},
}
err := app.Run(context.Background(), os.Args)
if err != nil {
panic(err)
}
}

118
sites/wegmans/wegmans.go Normal file
View File

@ -0,0 +1,118 @@
package wegmans
import (
"context"
"errors"
"io"
"net/url"
"strconv"
"strings"
"time"
"gitea.stevedudenhoeffer.com/steve/go-extractor"
)
type Config struct {
}
var DefaultConfig = Config{}
var ErrNilBrowser = errors.New("browser is nil")
var ErrNilURL = errors.New("url is nil")
var ErrInvalidURL = errors.New("invalid url")
type Item struct {
ID int
Name string
Price float64
UnitPrice float64
Unit string
}
func deferClose(c io.Closer) {
if c != nil {
_ = c.Close()
}
}
func (c Config) GetItemPrice(ctx context.Context, b extractor.Browser, u *url.URL) (Item, error) {
if b == nil {
return Item{}, ErrNilBrowser
}
if u == nil {
return Item{}, ErrNilURL
}
// urls in the format of:
// https://shop.wegmans.com/product/24921[/wegmans-frozen-thin-crust-uncured-pepperoni-pizza]
// (the slug is optional)
// get the product ID
a := strings.Split(u.Path, "/")
if len(a) < 3 {
return Item{}, ErrInvalidURL
}
if a[1] != "product" {
return Item{}, ErrInvalidURL
}
id, _ := strconv.Atoi(a[2])
if id == 0 {
return Item{}, ErrInvalidURL
}
doc, err := b.Open(ctx, u.String(), extractor.OpenPageOptions{})
defer deferClose(doc)
if err != nil {
return Item{}, err
}
timeout := 15 * time.Second
_ = doc.WaitForNetworkIdle(&timeout)
res := Item{
ID: id,
}
titles := doc.Select("h1[data-test]")
if len(titles) != 0 {
res.Name, _ = titles[0].Text()
}
prices := doc.Select("span[data-test=\"amount\"] span:nth-child(1)")
if len(prices) != 0 {
priceStr, _ := prices[0].Text()
priceStr = strings.ReplaceAll(priceStr, "$", "")
priceStr = strings.ReplaceAll(priceStr, ",", "")
price, _ := strconv.ParseFloat(priceStr, 64)
res.Price = price
}
unitPrices := doc.Select(`span[data-test="per-unit-price"]`)
if len(unitPrices) != 0 {
unitPriceStr, _ := unitPrices[0].Text()
unitPriceStr = strings.TrimSpace(unitPriceStr)
unitPriceStr = strings.ReplaceAll(unitPriceStr, "(", "")
unitPriceStr = strings.ReplaceAll(unitPriceStr, ")", "")
unitPriceStr = strings.ReplaceAll(unitPriceStr, "$", "")
unitPriceStr = strings.ReplaceAll(unitPriceStr, ",", "")
units := strings.Split(unitPriceStr, "/")
if len(units) > 1 {
res.Unit = strings.TrimSpace(units[1])
res.UnitPrice, _ = strconv.ParseFloat(units[0], 64)
}
}
return res, nil
}