feat: add RemoveHidden option for display:none element stripping #63

Merged
steve merged 1 commits from feature/readability-remove-hidden into main 2026-02-20 14:10:59 +00:00
Collaborator

Summary

  • Adds RemoveHidden bool field to ReadabilityOptions that evaluates JavaScript on the live page to remove all elements with computed display: none before readability extraction
  • Defines an internal pageEvaluator interface so the concrete Playwright-backed document supports JS evaluation without changing the Document interface
  • Adds PageEvaluate method to the document struct
  • Includes tests for the happy path, error propagation from JS evaluation, and the unsupported-Document case

Closes #62

Context

Some websites embed hidden display: none elements as anti-AI-scraping honeypots containing prompt injection attacks. These are invisible to users but get picked up by readability extraction. CSS selectors can't target computed styles, so JavaScript evaluation with getComputedStyle is needed.

Test plan

  • go test ./... passes
  • Manual test against a page with known hidden honeypot elements (e.g. Together AI blog post mentioned in #62)
## Summary - Adds `RemoveHidden bool` field to `ReadabilityOptions` that evaluates JavaScript on the live page to remove all elements with computed `display: none` before readability extraction - Defines an internal `pageEvaluator` interface so the concrete Playwright-backed document supports JS evaluation without changing the `Document` interface - Adds `PageEvaluate` method to the `document` struct - Includes tests for the happy path, error propagation from JS evaluation, and the unsupported-Document case Closes #62 ## Context Some websites embed hidden `display: none` elements as anti-AI-scraping honeypots containing prompt injection attacks. These are invisible to users but get picked up by readability extraction. CSS selectors can't target computed styles, so JavaScript evaluation with `getComputedStyle` is needed. ## Test plan - [x] `go test ./...` passes - [ ] Manual test against a page with known hidden honeypot elements (e.g. Together AI blog post mentioned in #62)
Claude added 1 commit 2026-02-20 14:06:32 +00:00
feat: add RemoveHidden option to strip display:none elements before extraction
All checks were successful
CI / vet (pull_request) Successful in 34s
CI / test (pull_request) Successful in 1m1s
CI / build (pull_request) Successful in 1m5s
65cf6b027f
When RemoveHidden is true, JavaScript is evaluated on the live page to
remove all elements with computed display:none before readability
extraction. This defends against anti-scraping honeypots that embed
prompt injections in hidden DOM elements.

The implementation uses an optional pageEvaluator interface so that the
concrete document (backed by Playwright) supports it while the Document
interface remains unchanged.

Closes #62

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
steve merged commit 8a2de65e31 into main 2026-02-20 14:10:59 +00:00
steve deleted branch feature/readability-remove-hidden 2026-02-20 14:11:00 +00:00
Sign in to join this conversation.