feat: add ReadabilityWithOptions for DOM cleanup #61

Merged
Claude merged 1 commits from feature/readability-remove-selectors into main 2026-02-19 01:11:38 +00:00
Collaborator

Summary

  • Adds ReadabilityOptions struct with RemoveSelectors []string field for specifying CSS selectors of elements to remove before readability extraction
  • Adds ReadabilityWithOptions() function that applies DOM cleanup before parsing
  • Existing Readability() delegates to ReadabilityWithOptions with zero-value options (fully backward compatible)
  • Uses goquery (new dependency) for CSS selector-based DOM manipulation

Motivation

Sites like The Verge use infinite scroll that loads additional full articles below the current one. When Readability() extracts content, these extra articles pollute the result. This change lets callers specify selectors to remove before extraction, eliminating the need to reimplement the readability pipeline downstream.

Closes #60

Test plan

  • TestReadabilityWithOptions_RemoveSelectors — verifies removed elements are excluded from extraction
  • TestReadabilityWithOptions_NoSelectors — verifies empty options behave like Readability()
  • TestRemoveSelectors — unit test for the HTML cleaning function
  • TestRemoveSelectors_MultipleSelectors — verifies multiple selectors work together
  • All existing tests continue to pass
## Summary - Adds `ReadabilityOptions` struct with `RemoveSelectors []string` field for specifying CSS selectors of elements to remove before readability extraction - Adds `ReadabilityWithOptions()` function that applies DOM cleanup before parsing - Existing `Readability()` delegates to `ReadabilityWithOptions` with zero-value options (fully backward compatible) - Uses goquery (new dependency) for CSS selector-based DOM manipulation ## Motivation Sites like The Verge use infinite scroll that loads additional full articles below the current one. When `Readability()` extracts content, these extra articles pollute the result. This change lets callers specify selectors to remove before extraction, eliminating the need to reimplement the readability pipeline downstream. Closes #60 ## Test plan - [x] `TestReadabilityWithOptions_RemoveSelectors` — verifies removed elements are excluded from extraction - [x] `TestReadabilityWithOptions_NoSelectors` — verifies empty options behave like `Readability()` - [x] `TestRemoveSelectors` — unit test for the HTML cleaning function - [x] `TestRemoveSelectors_MultipleSelectors` — verifies multiple selectors work together - [x] All existing tests continue to pass
Claude added 1 commit 2026-02-19 01:09:41 +00:00
feat: add ReadabilityWithOptions for DOM cleanup before extraction
All checks were successful
CI / build (pull_request) Successful in 46s
CI / test (pull_request) Successful in 48s
CI / vet (pull_request) Successful in 1m50s
c1a5814732
Sites with infinite scroll (e.g. The Verge) load additional articles
into the DOM, which get included in readability extraction. Add
ReadabilityOptions.RemoveSelectors to strip elements by CSS selector
before parsing, avoiding the need to reimplement the readability
pipeline downstream.

Closes #60

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude merged commit c982b61bab into main 2026-02-19 01:11:38 +00:00
Claude deleted branch feature/readability-remove-selectors 2026-02-19 01:11:38 +00:00
Sign in to join this conversation.