Stealth mode insufficient for archive.ph bot detection #58
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Despite the stealth mode added in #57 (merged in
917569dd), archive.ph still detects and blocks the headless browser with HTTP 429 responses.Current Behavior
When a headless browser with stealth mode enabled (
Stealth: Bool(true), which is the default) tries to openhttps://archive.ph, the server returns a 429 Too Many Requests status code. This happens immediately on the first request — it's not rate limiting, it's bot detection.The same URL loads perfectly fine in:
NewInteractiveBrowser)Current Stealth Measures (from
stealth.go)The current stealth mode applies:
--disable-blink-features=AutomationControlledlaunch argnavigator.webdriver→undefinednavigator.pluginspopulated with PDF viewer entriesnavigator.mimeTypespopulatedwindow.chrome.runtimestubwindow.outerWidth/outerHeightfix for headlessLikely Detection Vectors
archive.ph appears to use more sophisticated detection than just
navigator.webdriver. Possible vectors that aren't currently addressed:--headlessflag detection — Chromium's headless mode can be detected through:navigator.userAgentcontaining "HeadlessChrome"window.chrome.appbeing undefined in headlesswindow.chrome.csiandwindow.chrome.loadTimesbeing undefinedWebGL fingerprinting — Headless Chromium reports different WebGL renderer strings (e.g., "SwiftShader" or "Google SwiftShader" for
UNMASKED_RENDERER_WEBGL) which are dead giveaways for headless mode.CDP (Chrome DevTools Protocol) detection — Some sites detect the presence of CDP connections through
window.cdc_adoQpoasnfa76pfcZLmcfl_*or similar runtime properties injected by Playwright.navigator.permissionsbehavior — In headless Chromium,navigator.permissions.query({name: "notifications"})returns"prompt"instead of"denied", which real browsers typically return.Missing
Notificationconstructor — Headless Chromium may not have theNotificationAPI.navigator.connection— May be missing or have different values in headless mode.Possible Solutions
Use
--headless=new(Chromium 112+) — The "new headless" mode is much harder to detect as it runs the full browser UI layer.Additional init scripts to spoof:
window.chrome.app,window.chrome.csi,window.chrome.loadTimesnavigator.permissions.querybehaviorNotificationconstructor presenceUser-Agent override — Strip "HeadlessChrome" from the UA string if present.
CDP artifact cleanup — Remove or rename CDP-injected global properties.
Impact
This blocks the mort Discord bot's summary system from using archive.ph to read paywalled articles. The current workaround (captcha proxy) doesn't work because archive.ph doesn't show a user-solvable captcha — it simply rejects headless browsers at the HTTP level.
Related
Starting work on this. Plan:
stealthChromiumArgsto use--headless=new(Chromium 112+ "new headless" mode that runs the full UI layer and is much harder to detect)window.chrome.app,window.chrome.csi,window.chrome.loadTimesstubsnavigator.permissions.querybehavior fixNotificationconstructor presencenavigator.connectionspoofingcdc_prefixed globals)browser_init.goto apply the--headless=newarg when stealth + headless + ChromiumStarting work on enhanced stealth mode. Plan:
stealth.go(WebGL spoof, chrome API stubs, permissions query, Notification stub, navigator.connection, CDP cleanup, UA strip)Channel: "chromium"inbrowser_init.gostealth_test.gofor all new scriptsWill push to
fix/enhanced-stealth-modebranch and create a PR.Work complete. Merged via PR #59.
Changes:
chrome.app/csi/loadTimesstubs,navigator.permissions.queryoverride,Notificationconstructor stub,navigator.connectionstub, CDP artifact cleanup, HeadlessChrome UA strip)Channel: "chromium") when stealth is activeCI: All 3 checks passed (build, test, vet).