As part of a project I’m working on I need to extract content from an HTML page, in some sense creating a short 200 character summary of the document. Google does a fantastic job of extracting text and presenting a summary of the document in their search listings, I’m wondering how they do that. Here’s the process I’m using right now:
a) Remove all of the HTML comments from the page (ie: <!– –>) because JavaScript is sometimes inside comments, which sometimes includes > and or < which causes (d) to fail
b) Remove everything above the <body> tag, because there isn’t anything valuable there anyway.
c) Remove all the <a href… > tags, because text links are usually navigation and are repeated across a site… they’re noise and I don’t want them. However, sometimes links are part of the summary of a document… removing a link in the first paragraph of a document can render the paragraph unreadable, or at least incomplete.
b) Remove all the HTML tags, the line breaks, the tabs, etc.. using a regular expression.
For the most part, the above 4 steps do the job, but in some cases not. I’ll go out on a ledge and say that most HTML documents contain text that is repeated throughout the site again and again (header text like Login Now! or footer text like copyright 2004, etc…). My problem is that I want to somehow locate the snippets that are repeated and not include them in the summaries I create… For example, on google do this search and then check out the second result:
Fenway Park. … Fenway Park opened on April 20, 1912, the same day as Detroit’s Tiger Stadium and before any of the other existing big league parks. …
That text is way about 1/4 of the way down in the document. How do they extract that?
Parameters: a) I don’t know anything about the documents that I’m analyzing, they could be valid XHTML or garbled HTML from 1996, b) it doesn’t have to be extremely fast, c) I’m using Java (if that matters) , d) I’ve tried using the org.apache.lucene.demo.html.HTMLParser class, which has a method getSummary(), but it doesn’t work for me (nothing is ever returned)
Any and all ideas would be appreciated!
Depending on how hardcore you want to be you could try something like MEAD.
For more information:
http://www.summarization.com/
and
http://www.summarization.com/mead/
It works best when it has multiple accounts of the same event to generate a summary but is highly configurable and can probably be employed to give greater weight to a summary containing a specific term.
oh… one last thing: it’s written in Perl đŸ™‚ (BOO!)
Have you looked into Zentext CMS? You may be able to strip down the HTML and pass it through Zentext’s servlet, then capture the summary.
http://www.zentext.com/servlet/dycon/zentext/zentext/live/en/zentext/Summarizer
You may have a look at classifier4j:
http://classifier4j.sourceforge.net/
There’s a simple HTML parser, and a summarizer which seemed to make good enough summaries.
FWIW, looks like Google pulls snippets from the page that are relevant to your search. If you were to do a different search that returned the same page, you’d get a different summary text.