Stealth mode insufficient for archive.ph bot detection #58

Closed
opened 2026-02-17 22:32:21 +00:00 by Claude · 3 comments
Collaborator

Problem

Despite the stealth mode added in #57 (merged in 917569dd), archive.ph still detects and blocks the headless browser with HTTP 429 responses.

Current Behavior

When a headless browser with stealth mode enabled (Stealth: Bool(true), which is the default) tries to open https://archive.ph, the server returns a 429 Too Many Requests status code. This happens immediately on the first request — it's not rate limiting, it's bot detection.

The same URL loads perfectly fine in:

  • A regular desktop browser
  • An interactive Playwright browser (non-headless Chromium via NewInteractiveBrowser)
  • The captcha proxy's interactive browser session

Current Stealth Measures (from stealth.go)

The current stealth mode applies:

  1. --disable-blink-features=AutomationControlled launch arg
  2. navigator.webdriverundefined
  3. navigator.plugins populated with PDF viewer entries
  4. navigator.mimeTypes populated
  5. window.chrome.runtime stub
  6. window.outerWidth/outerHeight fix for headless

Likely Detection Vectors

archive.ph appears to use more sophisticated detection than just navigator.webdriver. Possible vectors that aren't currently addressed:

  1. --headless flag detection — Chromium's headless mode can be detected through:

    • navigator.userAgent containing "HeadlessChrome"
    • window.chrome.app being undefined in headless
    • window.chrome.csi and window.chrome.loadTimes being undefined
  2. WebGL fingerprinting — Headless Chromium reports different WebGL renderer strings (e.g., "SwiftShader" or "Google SwiftShader" for UNMASKED_RENDERER_WEBGL) which are dead giveaways for headless mode.

  3. CDP (Chrome DevTools Protocol) detection — Some sites detect the presence of CDP connections through window.cdc_adoQpoasnfa76pfcZLmcfl_* or similar runtime properties injected by Playwright.

  4. navigator.permissions behavior — In headless Chromium, navigator.permissions.query({name: "notifications"}) returns "prompt" instead of "denied", which real browsers typically return.

  5. Missing Notification constructor — Headless Chromium may not have the Notification API.

  6. navigator.connection — May be missing or have different values in headless mode.

Possible Solutions

  1. Use --headless=new (Chromium 112+) — The "new headless" mode is much harder to detect as it runs the full browser UI layer.

  2. Additional init scripts to spoof:

    • WebGL renderer/vendor strings
    • window.chrome.app, window.chrome.csi, window.chrome.loadTimes
    • navigator.permissions.query behavior
    • Notification constructor presence
  3. User-Agent override — Strip "HeadlessChrome" from the UA string if present.

  4. CDP artifact cleanup — Remove or rename CDP-injected global properties.

Impact

This blocks the mort Discord bot's summary system from using archive.ph to read paywalled articles. The current workaround (captcha proxy) doesn't work because archive.ph doesn't show a user-solvable captcha — it simply rejects headless browsers at the HTTP level.

  • PR #57 (stealth mode implementation)
  • mort issue #687 (captcha detection & solving)
## Problem Despite the stealth mode added in #57 (merged in `917569dd`), archive.ph still detects and blocks the headless browser with HTTP 429 responses. ## Current Behavior When a headless browser with stealth mode enabled (`Stealth: Bool(true)`, which is the default) tries to open `https://archive.ph`, the server returns a **429 Too Many Requests** status code. This happens immediately on the first request — it's not rate limiting, it's bot detection. The same URL loads perfectly fine in: - A regular desktop browser - An interactive Playwright browser (non-headless Chromium via `NewInteractiveBrowser`) - The captcha proxy's interactive browser session ## Current Stealth Measures (from `stealth.go`) The current stealth mode applies: 1. `--disable-blink-features=AutomationControlled` launch arg 2. `navigator.webdriver` → `undefined` 3. `navigator.plugins` populated with PDF viewer entries 4. `navigator.mimeTypes` populated 5. `window.chrome.runtime` stub 6. `window.outerWidth`/`outerHeight` fix for headless ## Likely Detection Vectors archive.ph appears to use more sophisticated detection than just `navigator.webdriver`. Possible vectors that aren't currently addressed: 1. **`--headless` flag detection** — Chromium's headless mode can be detected through: - `navigator.userAgent` containing "HeadlessChrome" - `window.chrome.app` being undefined in headless - `window.chrome.csi` and `window.chrome.loadTimes` being undefined 2. **WebGL fingerprinting** — Headless Chromium reports different WebGL renderer strings (e.g., "SwiftShader" or "Google SwiftShader" for `UNMASKED_RENDERER_WEBGL`) which are dead giveaways for headless mode. 3. **CDP (Chrome DevTools Protocol) detection** — Some sites detect the presence of CDP connections through `window.cdc_adoQpoasnfa76pfcZLmcfl_*` or similar runtime properties injected by Playwright. 4. **`navigator.permissions` behavior** — In headless Chromium, `navigator.permissions.query({name: "notifications"})` returns `"prompt"` instead of `"denied"`, which real browsers typically return. 5. **Missing `Notification` constructor** — Headless Chromium may not have the `Notification` API. 6. **`navigator.connection`** — May be missing or have different values in headless mode. ## Possible Solutions 1. **Use `--headless=new`** (Chromium 112+) — The "new headless" mode is much harder to detect as it runs the full browser UI layer. 2. **Additional init scripts** to spoof: - WebGL renderer/vendor strings - `window.chrome.app`, `window.chrome.csi`, `window.chrome.loadTimes` - `navigator.permissions.query` behavior - `Notification` constructor presence 3. **User-Agent override** — Strip "HeadlessChrome" from the UA string if present. 4. **CDP artifact cleanup** — Remove or rename CDP-injected global properties. ## Impact This blocks the mort Discord bot's summary system from using archive.ph to read paywalled articles. The current workaround (captcha proxy) doesn't work because archive.ph doesn't show a user-solvable captcha — it simply rejects headless browsers at the HTTP level. ## Related - PR #57 (stealth mode implementation) - mort issue #687 (captcha detection & solving)
Author
Collaborator

Starting work on this. Plan:

  1. Expand stealthChromiumArgs to use --headless=new (Chromium 112+ "new headless" mode that runs the full UI layer and is much harder to detect)
  2. Add init scripts for all the detection vectors listed:
    • WebGL renderer/vendor spoofing (hide SwiftShader)
    • window.chrome.app, window.chrome.csi, window.chrome.loadTimes stubs
    • navigator.permissions.query behavior fix
    • Notification constructor presence
    • navigator.connection spoofing
    • CDP artifact cleanup (remove cdc_ prefixed globals)
  3. Strip "HeadlessChrome" from User-Agent when stealth is enabled
  4. Add tests for the new stealth measures
  5. Modify browser_init.go to apply the --headless=new arg when stealth + headless + Chromium
Starting work on this. Plan: 1. Expand `stealthChromiumArgs` to use `--headless=new` (Chromium 112+ "new headless" mode that runs the full UI layer and is much harder to detect) 2. Add init scripts for all the detection vectors listed: - WebGL renderer/vendor spoofing (hide SwiftShader) - `window.chrome.app`, `window.chrome.csi`, `window.chrome.loadTimes` stubs - `navigator.permissions.query` behavior fix - `Notification` constructor presence - `navigator.connection` spoofing - CDP artifact cleanup (remove `cdc_` prefixed globals) 3. Strip "HeadlessChrome" from User-Agent when stealth is enabled 4. Add tests for the new stealth measures 5. Modify `browser_init.go` to apply the `--headless=new` arg when stealth + headless + Chromium
Author
Collaborator

Starting work on enhanced stealth mode. Plan:

  1. Add 7 new init scripts to stealth.go (WebGL spoof, chrome API stubs, permissions query, Notification stub, navigator.connection, CDP cleanup, UA strip)
  2. Enable Chromium's new headless mode via Channel: "chromium" in browser_init.go
  3. Add tests in stealth_test.go for all new scripts

Will push to fix/enhanced-stealth-mode branch and create a PR.

Starting work on enhanced stealth mode. Plan: 1. Add 7 new init scripts to `stealth.go` (WebGL spoof, chrome API stubs, permissions query, Notification stub, navigator.connection, CDP cleanup, UA strip) 2. Enable Chromium's new headless mode via `Channel: "chromium"` in `browser_init.go` 3. Add tests in `stealth_test.go` for all new scripts Will push to `fix/enhanced-stealth-mode` branch and create a PR.
Author
Collaborator

Work complete. Merged via PR #59.

Changes:

  • Added 7 new stealth init scripts (WebGL renderer spoof, chrome.app/csi/loadTimes stubs, navigator.permissions.query override, Notification constructor stub, navigator.connection stub, CDP artifact cleanup, HeadlessChrome UA strip)
  • Enabled Chromium's new headless mode (Channel: "chromium") when stealth is active
  • Added 8 new unit tests covering all new scripts

CI: All 3 checks passed (build, test, vet).

Work complete. Merged via PR #59. **Changes:** - Added 7 new stealth init scripts (WebGL renderer spoof, `chrome.app`/`csi`/`loadTimes` stubs, `navigator.permissions.query` override, `Notification` constructor stub, `navigator.connection` stub, CDP artifact cleanup, HeadlessChrome UA strip) - Enabled Chromium's new headless mode (`Channel: "chromium"`) when stealth is active - Added 8 new unit tests covering all new scripts **CI:** All 3 checks passed (build, test, vet).
Sign in to join this conversation.