bug: aislegopher extractor blocked by Cloudflare Turnstile bot protection #55

New Issue

2026-02-17T01:12:32Z

Claude commented

2026-02-17 01:12:32 +00:00

Summary

The aislegopher site extractor no longer works because aislegopher.com has added Cloudflare Turnstile bot protection. Every page (except sitemap.xml) returns HTTP 403 with an interactive "Verify you are human" challenge that cannot be solved by automated/headless browsers.

How it fails

b.Open() navigates to the product URL
Cloudflare responds with HTTP 403 and a challenge page ("Just a moment...")
The status code check in openPage() (playwright.go:224) rejects the 403
Returns ErrInvalidStatusCode: 403 — the scraper never reaches DOM extraction

Even if the 403 check were bypassed, the page content is just the Cloudflare challenge HTML, not the actual product page. The .h4 and .h2 selectors would find nothing.

Reproduction

go run ./sites/aislegopher/cmd/aislegopher/ "https://aislegopher.com/p/equate-ibuprofen-tablets-200-mg-pain-reliever-and-fever-reducer-100-count/31393"

What was tested

Headless Chromium via Playwright MCP — 403 + Turnstile challenge
Waited 30+ seconds for Turnstile auto-resolution — never resolves
Attempted to interact with Turnstile checkbox — headless detection prevents it
curl with real browser User-Agent — 403
Tried API endpoints, Google cache, Wayback Machine — all blocked or unavailable
Only sitemap.xml is accessible (likely whitelisted by Cloudflare for SEO)

Possible approaches

Non-headless browser with ShowBrowser: true — Turnstile may auto-resolve for visible browsers, but this only works in desktop environments and requires manual verification
Cookie persistence — If a user manually solves the challenge once, persisting the cf_clearance cookie via CookieJar might allow subsequent automated requests to pass
Alternative data source — If aislegopher exposes a public API or data feed not behind Cloudflare, the extractor could use that instead
Accept the limitation — Document that this extractor requires a non-headless browser or pre-solved Cloudflare cookies

Additional concern

Because the actual page content is inaccessible, the DOM selectors (.h4 for product name, .h2 for price) cannot be verified. Even once Cloudflare access is resolved, the selectors may need updating if the site has been redesigned.

## Summary The aislegopher site extractor no longer works because aislegopher.com has added Cloudflare Turnstile bot protection. Every page (except `sitemap.xml`) returns HTTP 403 with an interactive "Verify you are human" challenge that cannot be solved by automated/headless browsers. ## How it fails 1. `b.Open()` navigates to the product URL 2. Cloudflare responds with HTTP **403** and a challenge page ("Just a moment...") 3. The status code check in `openPage()` (`playwright.go:224`) rejects the 403 4. Returns `ErrInvalidStatusCode: 403` — the scraper never reaches DOM extraction Even if the 403 check were bypassed, the page content is just the Cloudflare challenge HTML, not the actual product page. The `.h4` and `.h2` selectors would find nothing. ## Reproduction ``` go run ./sites/aislegopher/cmd/aislegopher/ "https://aislegopher.com/p/equate-ibuprofen-tablets-200-mg-pain-reliever-and-fever-reducer-100-count/31393" ``` ## What was tested - Headless Chromium via Playwright MCP — 403 + Turnstile challenge - Waited 30+ seconds for Turnstile auto-resolution — never resolves - Attempted to interact with Turnstile checkbox — headless detection prevents it - `curl` with real browser User-Agent — 403 - Tried API endpoints, Google cache, Wayback Machine — all blocked or unavailable - Only `sitemap.xml` is accessible (likely whitelisted by Cloudflare for SEO) ## Possible approaches 1. **Non-headless browser with `ShowBrowser: true`** — Turnstile may auto-resolve for visible browsers, but this only works in desktop environments and requires manual verification 2. **Cookie persistence** — If a user manually solves the challenge once, persisting the `cf_clearance` cookie via `CookieJar` might allow subsequent automated requests to pass 3. **Alternative data source** — If aislegopher exposes a public API or data feed not behind Cloudflare, the extractor could use that instead 4. **Accept the limitation** — Document that this extractor requires a non-headless browser or pre-solved Cloudflare cookies ## Additional concern Because the actual page content is inaccessible, the DOM selectors (`.h4` for product name, `.h2` for price) cannot be verified. Even once Cloudflare access is resolved, the selectors may need updating if the site has been redesigned.

Claude added the bug priority/high labels 2026-02-17 01:12:49 +00:00

steve closed this issue

2026-02-19 01:00:19 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: steve/go-extractor#55