bug: aislegopher extractor blocked by Cloudflare Turnstile bot protection #55

Closed
opened 2026-02-17 01:12:32 +00:00 by Claude · 0 comments
Collaborator

Summary

The aislegopher site extractor no longer works because aislegopher.com has added Cloudflare Turnstile bot protection. Every page (except sitemap.xml) returns HTTP 403 with an interactive "Verify you are human" challenge that cannot be solved by automated/headless browsers.

How it fails

  1. b.Open() navigates to the product URL
  2. Cloudflare responds with HTTP 403 and a challenge page ("Just a moment...")
  3. The status code check in openPage() (playwright.go:224) rejects the 403
  4. Returns ErrInvalidStatusCode: 403 — the scraper never reaches DOM extraction

Even if the 403 check were bypassed, the page content is just the Cloudflare challenge HTML, not the actual product page. The .h4 and .h2 selectors would find nothing.

Reproduction

go run ./sites/aislegopher/cmd/aislegopher/ "https://aislegopher.com/p/equate-ibuprofen-tablets-200-mg-pain-reliever-and-fever-reducer-100-count/31393"

What was tested

  • Headless Chromium via Playwright MCP — 403 + Turnstile challenge
  • Waited 30+ seconds for Turnstile auto-resolution — never resolves
  • Attempted to interact with Turnstile checkbox — headless detection prevents it
  • curl with real browser User-Agent — 403
  • Tried API endpoints, Google cache, Wayback Machine — all blocked or unavailable
  • Only sitemap.xml is accessible (likely whitelisted by Cloudflare for SEO)

Possible approaches

  1. Non-headless browser with ShowBrowser: true — Turnstile may auto-resolve for visible browsers, but this only works in desktop environments and requires manual verification
  2. Cookie persistence — If a user manually solves the challenge once, persisting the cf_clearance cookie via CookieJar might allow subsequent automated requests to pass
  3. Alternative data source — If aislegopher exposes a public API or data feed not behind Cloudflare, the extractor could use that instead
  4. Accept the limitation — Document that this extractor requires a non-headless browser or pre-solved Cloudflare cookies

Additional concern

Because the actual page content is inaccessible, the DOM selectors (.h4 for product name, .h2 for price) cannot be verified. Even once Cloudflare access is resolved, the selectors may need updating if the site has been redesigned.

## Summary The aislegopher site extractor no longer works because aislegopher.com has added Cloudflare Turnstile bot protection. Every page (except `sitemap.xml`) returns HTTP 403 with an interactive "Verify you are human" challenge that cannot be solved by automated/headless browsers. ## How it fails 1. `b.Open()` navigates to the product URL 2. Cloudflare responds with HTTP **403** and a challenge page ("Just a moment...") 3. The status code check in `openPage()` (`playwright.go:224`) rejects the 403 4. Returns `ErrInvalidStatusCode: 403` — the scraper never reaches DOM extraction Even if the 403 check were bypassed, the page content is just the Cloudflare challenge HTML, not the actual product page. The `.h4` and `.h2` selectors would find nothing. ## Reproduction ``` go run ./sites/aislegopher/cmd/aislegopher/ "https://aislegopher.com/p/equate-ibuprofen-tablets-200-mg-pain-reliever-and-fever-reducer-100-count/31393" ``` ## What was tested - Headless Chromium via Playwright MCP — 403 + Turnstile challenge - Waited 30+ seconds for Turnstile auto-resolution — never resolves - Attempted to interact with Turnstile checkbox — headless detection prevents it - `curl` with real browser User-Agent — 403 - Tried API endpoints, Google cache, Wayback Machine — all blocked or unavailable - Only `sitemap.xml` is accessible (likely whitelisted by Cloudflare for SEO) ## Possible approaches 1. **Non-headless browser with `ShowBrowser: true`** — Turnstile may auto-resolve for visible browsers, but this only works in desktop environments and requires manual verification 2. **Cookie persistence** — If a user manually solves the challenge once, persisting the `cf_clearance` cookie via `CookieJar` might allow subsequent automated requests to pass 3. **Alternative data source** — If aislegopher exposes a public API or data feed not behind Cloudflare, the extractor could use that instead 4. **Accept the limitation** — Document that this extractor requires a non-headless browser or pre-solved Cloudflare cookies ## Additional concern Because the actual page content is inaccessible, the DOM selectors (`.h4` for product name, `.h2` for price) cannot be verified. Even once Cloudflare access is resolved, the selectors may need updating if the site has been redesigned.
Claude added the bugpriority/high labels 2026-02-17 01:12:49 +00:00
steve closed this issue 2026-02-19 01:00:19 +00:00
Sign in to join this conversation.