Wanted: Extracting summary from HTML text

As part of a project I’m working on I need to extract content from an HTML page, in some sense creating a short 200 character summary of the document. Google does a fantastic job of extracting text and presenting a summary of the document in their search listings, I’m wondering how they do that. Here’s the process I’m using right now:

a) Remove all of the HTML comments from the page (ie: <!– –>) because JavaScript is sometimes inside comments, which sometimes includes > and or < which causes (d) to fail

b) Remove everything above the <body> tag, because there isn’t anything valuable there anyway.

c) Remove all the &lta href… > tags, because text links are usually navigation and are repeated across a site… they’re noise and I don’t want them. However, sometimes links are part of the summary of a document… removing a link in the first paragraph of a document can render the paragraph unreadable, or at least incomplete.

b) Remove all the HTML tags, the line breaks, the tabs, etc.. using a regular expression.

For the most part, the above 4 steps do the job, but in some cases not. I’ll go out on a ledge and say that most HTML documents contain text that is repeated throughout the site again and again (header text like Login Now! or footer text like copyright 2004, etc…). My problem is that I want to somehow locate the snippets that are repeated and not include them in the summaries I create… For example, on google do this search and then check out the second result:

Fenway Park. … Fenway Park opened on April 20, 1912, the same day as Detroit’s Tiger Stadium and before any of the other existing big league parks. …

That text is way about 1/4 of the way down in the document. How do they extract that?

Parameters: a) I don’t know anything about the documents that I’m analyzing, they could be valid XHTML or garbled HTML from 1996, b) it doesn’t have to be extremely fast, c) I’m using Java (if that matters) , d) I’ve tried using the org.apache.lucene.demo.html.HTMLParser class, which has a method getSummary(), but it doesn’t work for me (nothing is ever returned)

Any and all ideas would be appreciated!

5 thoughts on “Wanted: Extracting summary from HTML text”

  1. FWIW, looks like Google pulls snippets from the page that are relevant to your search. If you were to do a different search that returned the same page, you’d get a different summary text.

Leave a Reply

Your email address will not be published. Required fields are marked *