Add DOM cleanup option before readability extraction #60

Closed
opened 2026-02-19 01:05:08 +00:00 by Claude · 1 comment
Collaborator

Problem

Sites like The Verge use infinite scroll that loads additional full articles below the current article in the DOM. When Readability() extracts content, these extra articles get included in the extracted text, producing summaries that mix in content from unrelated articles.

The root cause is that Readability() calls doc.Content()readability.FromReader() with no opportunity to clean the DOM between those steps.

Current Workaround

In steve/mort#709, we're working around this by replicating what Readability() does but inserting a goquery-based HTML cleaning step:

doc.Content() → goquery parse → remove problematic elements → readability.FromReader()

This works but duplicates the Readability() logic and would benefit from native support in go-extractor.

Proposed Solution

Add a way to specify CSS selectors for elements to remove before readability extraction. Some options:

Option A: ReadabilityOptions struct

type ReadabilityOptions struct {
    RemoveSelectors []string // CSS selectors for elements to remove before extraction
}

func ReadabilityWithOptions(_ context.Context, doc Document, opts ReadabilityOptions) (Article, error)

Option B: Pre-extraction hook

type ReadabilityHook func(html string) (string, error)

func ReadabilityWithHook(_ context.Context, doc Document, hook ReadabilityHook) (Article, error)

Option C: Add to OpenPageOptions

type OpenPageOptions struct {
    // ... existing fields ...
    RemoveSelectors []string // Elements to remove after page load
}

Any of these would allow consumers to clean problematic DOM elements without reimplementing the readability pipeline.

## Problem Sites like The Verge use infinite scroll that loads additional full articles below the current article in the DOM. When `Readability()` extracts content, these extra articles get included in the extracted text, producing summaries that mix in content from unrelated articles. The root cause is that `Readability()` calls `doc.Content()` → `readability.FromReader()` with no opportunity to clean the DOM between those steps. ## Current Workaround In [steve/mort#709](https://gitea.stevedudenhoeffer.com/steve/mort/issues/709), we're working around this by replicating what `Readability()` does but inserting a goquery-based HTML cleaning step: ``` doc.Content() → goquery parse → remove problematic elements → readability.FromReader() ``` This works but duplicates the `Readability()` logic and would benefit from native support in go-extractor. ## Proposed Solution Add a way to specify CSS selectors for elements to remove before readability extraction. Some options: ### Option A: ReadabilityOptions struct ```go type ReadabilityOptions struct { RemoveSelectors []string // CSS selectors for elements to remove before extraction } func ReadabilityWithOptions(_ context.Context, doc Document, opts ReadabilityOptions) (Article, error) ``` ### Option B: Pre-extraction hook ```go type ReadabilityHook func(html string) (string, error) func ReadabilityWithHook(_ context.Context, doc Document, hook ReadabilityHook) (Article, error) ``` ### Option C: Add to OpenPageOptions ```go type OpenPageOptions struct { // ... existing fields ... RemoveSelectors []string // Elements to remove after page load } ``` Any of these would allow consumers to clean problematic DOM elements without reimplementing the readability pipeline.
Author
Collaborator

Starting work on this. My plan:

  1. Go with Option A (ReadabilityOptions with RemoveSelectors) — it's the most focused and keeps the API clean.
  2. Add a ReadabilityWithOptions() function that takes ReadabilityOptions with a RemoveSelectors []string field.
  3. The implementation will get the page HTML via doc.Content(), use goquery to remove matching elements, then pass the cleaned HTML to readability.FromReader().
  4. Keep the existing Readability() function unchanged for backward compatibility — it will delegate to ReadabilityWithOptions with zero-value options.

Will push a PR shortly.

Starting work on this. My plan: 1. Go with **Option A** (`ReadabilityOptions` with `RemoveSelectors`) — it's the most focused and keeps the API clean. 2. Add a `ReadabilityWithOptions()` function that takes `ReadabilityOptions` with a `RemoveSelectors []string` field. 3. The implementation will get the page HTML via `doc.Content()`, use goquery to remove matching elements, then pass the cleaned HTML to `readability.FromReader()`. 4. Keep the existing `Readability()` function unchanged for backward compatibility — it will delegate to `ReadabilityWithOptions` with zero-value options. Will push a PR shortly.
Sign in to join this conversation.