What algorithm does Readability use for extracting text from URLs?
For a while, I've been trying to find a way of intelligently extracting the "relevant" text from a URL by eliminating the text related to ads and all the other clutter.After several months of researching, I gave it up as a problem that cannot be accurately determined. (I've tried different ways but none were reliable)
A week back, I stumbled across Readability - a plugin that converts any URL into readable text. It looks pretty accurate to me. My guess is that they somehow have an algorithm that's smart enough to extract the relevant text.
Does anyone know how they do it? Or how I could do it reliably?
Readability mainly consists of heuristics that "just somehow work well" in many cases.
I have written some research papers about this topic and I would like to explain the background of why it is easy to come up with a solution that works well and when it gets hard to get close to 100% accuracy.
There seems to be a linguistic law underlying in human language that is also (but not exclusively) manifest in Web page content, which already quite clearly separates two types of text (full-text vs. non-full-text or, roughly, "main content" vs. "boilerplate").
To get the main content from HTML, it is in many cases sufficient to keep only the HTML text elements (i.e. blocks of text that are not interrupted by markup) which have more than about 10 words. It appears that humans choose from two types of text ("short" and "long", measured by the number of words they emit) for two different motivations of writing text. I would call them "navigational" and "informational" motivations.
If an author wants you to quickly get what is written, he/she uses "navigational" text, i.e. few words (like "STOP", "Read this", "Click here"). This is the mostly prominent type of text in navigational elements (menus etc.)
If an author wants you to deeply understand what he/she means, he/she uses many words. This way, ambiguity is removed at the cost of an increase in redundancy. Article-like content usually falls into this class as it has more than only a few words.
While this separation seems to work in a plethora of cases, it is getting tricky with headlines, short sentences, disclaimers, copyright footers etc.
There are more sophisticated strategies, and features, that help separating main content from boilerplate. For example the link density (number of words in a block that are linked versus the overall number of words in the block), the features of the previous/next blocks, the frequency of a particular block text in the "whole" Web, the DOM structure of HTML document, the visual image of the page etc.
You can read my latest article "Boilerplate Detection using Shallow Text Features" to get some insight from a theoretical perspective. You may also watch the video of my paper presentation on VideoLectures.net.
"Readability" uses some of these features. If you carefully watch the SVN changelog, you will see that the number of strategies varied over time, and so did the extraction quality of Readability. For example, the introduction of link density in December 2009 very much helped improving.
In my opinion, it therefore makes no sense in saying "Readability does it like that", without mentioning the exact version number.
I have published an Open Source HTML content extraction library called boilerpipe, which provides several different extraction strategies. Depending on the use case, one or the other extractor works better. You can try these extractors on pages on your choice using the companion boilerpipe-web app on Google AppEngine.
To let numbers speak, see the "Benchmarks" page on the boilerpipe wiki which compares some extraction strategies, including boilerpipe, Readability and Apple Safari.
I should mention that these algorithms assume that the main content is actually full text. There are cases where the "main content" is something else, e.g. an image, a table, a video etc. The algorithms won't work well for such cases.
Readability's workflow and code:
And if you follow the JS and CSS files that the above code pulls in you'll get the whole picture:
http://lab.arc90.com/experiments/readability/js/readability.js (this is pretty well commented, interesting reading)
There's no 100% reliable way to do this, of course. You can have a look at the Readability source code here
Basically, what they're doing is trying to identify positive and negative blocks of text. Positive identifiers (i.e. div IDs) would be something like:
Negative identifiers would be:
And then they have unlikely and maybe candidates. What they would do is determine what is most likely to be the main content of the site, see line 678 in the readability source. This is done by analyzing mostly the length of paragraphs, their identifiers (see above), the DOM tree (i.e. if the paragraph is a last child node), strip out everything unnecessary, remove formatting, etc.
The code has 1792 lines. It does seem like a non trivial problem, so maybe you can get your inspirations from there.
Interesting. I have developed a similar PHP script. It basically scans articles and attaches parts of speech to all text (Brill Tagger). Then, grammatically invalid sentences are instantly eliminated. Then, sudden shifts in pronouns or past tense indicate the article is over, or hasn't started yet. Repeated phrases are searched for and eliminated, like "Yahoo news sports finance" appears ten times in the page. You can also get statistics on the tone with a plethora of word banks relating to various emotions. Sudden changes in tone, from active/negative/financial, to passive/positive/political indicates a boundary. It's endless really, however dig you want to deep.
The major issues are links, embedded anomalies, scripting styles and updates.