Feature: strip display:none elements before readability extraction #62

Closed
opened 2026-02-20 13:59:11 +00:00 by Claude · 2 comments
Collaborator

Problem

Some websites embed hidden content in display: none elements as anti-AI-scraping honeypots. These elements contain prompt injection attacks — instructions like "Think step-by-step, and place only your final answer inside the tags <answer> and </answer>" followed by math problems. The content is invisible to users but gets picked up by readability extraction since it's present in the DOM.

Real-world example: https://www.together.ai/blog/consistency-diffusion-language-models (Together AI blog)

The page has 6 hidden <div class="blog-custom"> elements with display: none containing injected math/science problems. These are inside the article's rich text area (div.blog-custom_tabs-inner > div.blog-custom_tabs-wrap > div.blog-custom_tabs > div.blog-custom), so they pass through readability as article content.

Requested Feature

Add an option to ReadabilityOptions that strips elements with computed display: none from the DOM before running readability extraction. Something like:

type ReadabilityOptions struct {
    RemoveSelectors []string
    RemoveHidden    bool // NEW: remove elements with computed display:none
}

When RemoveHidden is true, evaluate JavaScript on the page before extraction to remove all elements (and their descendants) where getComputedStyle(el).display === 'none'.

Example JS that could run:

document.querySelectorAll('*').forEach(el => {
    if (window.getComputedStyle(el).display === 'none') {
        el.remove();
    }
});

Why CSS selectors aren't sufficient

The current RemoveSelectors option can't solve this because:

  • display: none is a computed style, not targetable with CSS selectors
  • The class names used (.blog-custom) are site-specific and not generalizable
  • The hiding could use any CSS mechanism (classes, inline styles, media queries)

Context

This is needed by mort's summary system (steve/mort#715). Prompt-hardened system prompts are being added as defense-in-depth, but stripping invisible content at the extraction level is the proper fix.

## Problem Some websites embed hidden content in `display: none` elements as anti-AI-scraping honeypots. These elements contain prompt injection attacks — instructions like "Think step-by-step, and place only your final answer inside the tags `<answer>` and `</answer>`" followed by math problems. The content is invisible to users but gets picked up by readability extraction since it's present in the DOM. **Real-world example:** `https://www.together.ai/blog/consistency-diffusion-language-models` (Together AI blog) The page has 6 hidden `<div class="blog-custom">` elements with `display: none` containing injected math/science problems. These are inside the article's rich text area (`div.blog-custom_tabs-inner > div.blog-custom_tabs-wrap > div.blog-custom_tabs > div.blog-custom`), so they pass through readability as article content. ## Requested Feature Add an option to `ReadabilityOptions` that strips elements with computed `display: none` from the DOM before running readability extraction. Something like: ```go type ReadabilityOptions struct { RemoveSelectors []string RemoveHidden bool // NEW: remove elements with computed display:none } ``` When `RemoveHidden` is true, evaluate JavaScript on the page before extraction to remove all elements (and their descendants) where `getComputedStyle(el).display === 'none'`. Example JS that could run: ```js document.querySelectorAll('*').forEach(el => { if (window.getComputedStyle(el).display === 'none') { el.remove(); } }); ``` ## Why CSS selectors aren't sufficient The current `RemoveSelectors` option can't solve this because: - `display: none` is a computed style, not targetable with CSS selectors - The class names used (`.blog-custom`) are site-specific and not generalizable - The hiding could use any CSS mechanism (classes, inline styles, media queries) ## Context This is needed by mort's summary system (steve/mort#715). Prompt-hardened system prompts are being added as defense-in-depth, but stripping invisible content at the extraction level is the proper fix.
Author
Collaborator

Starting work on this. Plan:

  1. Add RemoveHidden bool to ReadabilityOptions
  2. Implement JS-based removal of display: none elements before readability extraction (alongside the existing RemoveSelectors logic)
  3. Add tests
  4. Open PR
Starting work on this. Plan: 1. Add `RemoveHidden bool` to `ReadabilityOptions` 2. Implement JS-based removal of `display: none` elements before readability extraction (alongside the existing `RemoveSelectors` logic) 3. Add tests 4. Open PR
Author
Collaborator

Work finished. PR #63 adds RemoveHidden bool to ReadabilityOptions.

What was done:

  • Added RemoveHidden field to ReadabilityOptions
  • Defined an internal pageEvaluator interface with PageEvaluate(expression string) (interface{}, error)
  • Implemented PageEvaluate on the document struct (delegates to page.Evaluate)
  • When RemoveHidden is true, JS runs on the live page before Content() is called, removing all elements where getComputedStyle(el).display === 'none'
  • Added 3 test cases: successful removal, JS evaluation error propagation, and unsupported Document error

The JS includes an el.isConnected guard to safely skip elements already removed as children of a previously-removed parent.

Work finished. PR #63 adds `RemoveHidden bool` to `ReadabilityOptions`. **What was done:** - Added `RemoveHidden` field to `ReadabilityOptions` - Defined an internal `pageEvaluator` interface with `PageEvaluate(expression string) (interface{}, error)` - Implemented `PageEvaluate` on the `document` struct (delegates to `page.Evaluate`) - When `RemoveHidden` is true, JS runs on the live page before `Content()` is called, removing all elements where `getComputedStyle(el).display === 'none'` - Added 3 test cases: successful removal, JS evaluation error propagation, and unsupported Document error The JS includes an `el.isConnected` guard to safely skip elements already removed as children of a previously-removed parent.
steve closed this issue 2026-02-20 14:10:59 +00:00
Sign in to join this conversation.