Add DOM cleanup option before readability extraction #60
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Sites like The Verge use infinite scroll that loads additional full articles below the current article in the DOM. When
Readability()extracts content, these extra articles get included in the extracted text, producing summaries that mix in content from unrelated articles.The root cause is that
Readability()callsdoc.Content()→readability.FromReader()with no opportunity to clean the DOM between those steps.Current Workaround
In steve/mort#709, we're working around this by replicating what
Readability()does but inserting a goquery-based HTML cleaning step:This works but duplicates the
Readability()logic and would benefit from native support in go-extractor.Proposed Solution
Add a way to specify CSS selectors for elements to remove before readability extraction. Some options:
Option A: ReadabilityOptions struct
Option B: Pre-extraction hook
Option C: Add to OpenPageOptions
Any of these would allow consumers to clean problematic DOM elements without reimplementing the readability pipeline.
Starting work on this. My plan:
ReadabilityOptionswithRemoveSelectors) — it's the most focused and keeps the API clean.ReadabilityWithOptions()function that takesReadabilityOptionswith aRemoveSelectors []stringfield.doc.Content(), use goquery to remove matching elements, then pass the cleaned HTML toreadability.FromReader().Readability()function unchanged for backward compatibility — it will delegate toReadabilityWithOptionswith zero-value options.Will push a PR shortly.