{"id":583,"date":"2004-04-02T16:56:42","date_gmt":"2004-04-02T20:56:42","guid":{"rendered":"http:\/\/wordpress.cephas.net\/?p=583"},"modified":"2004-04-02T16:56:42","modified_gmt":"2004-04-02T20:56:42","slug":"wanted-extracting-summary-from-html-text","status":"publish","type":"post","link":"https:\/\/cephas.net\/blog\/2004\/04\/02\/wanted-extracting-summary-from-html-text\/","title":{"rendered":"Wanted: Extracting summary from HTML text"},"content":{"rendered":"<p>As part of a project I&#8217;m working on I need to extract content from an HTML page, in some sense creating a short 200 character summary of the document.  Google does a fantastic job of extracting text and presenting a summary of the document in their search listings, I&#8217;m wondering how they do that. Here&#8217;s the process I&#8217;m using right now:<\/p>\n<p>a) Remove all of the HTML comments from the page (ie: &lt;!&#8211; &#8211;&gt;) because JavaScript is sometimes inside comments, which sometimes includes &gt; and or &lt; which causes (d) to fail<\/p>\n<p>b) Remove everything above the &lt;body&gt; tag, because there isn&#8217;t anything valuable there anyway.<\/p>\n<p>c) Remove all the &amp;lta href&#8230; &gt; tags, because text links are usually navigation and are repeated across a site&#8230; they&#8217;re noise and I don&#8217;t want them.  However, sometimes links are part of the summary of a document&#8230; removing a link in the first paragraph of a document can render the paragraph unreadable, or at least incomplete.<\/p>\n<p>b) Remove all the HTML tags, the line breaks, the tabs, etc.. using a regular expression.<\/p>\n<p>For the most part, the above 4 steps do the job, but in some cases not.  I&#8217;ll go out on a ledge and say that most HTML documents contain text that is repeated throughout the site again and again (header text like Login Now! or footer text like copyright 2004, etc&#8230;).  My problem is that I want to somehow locate the snippets that are repeated and not include them in the summaries I create&#8230; For example, on google do this <a href=\"http:\/\/www.google.com\/search?sourceid=mozclient&amp;ie=utf-8&amp;oe=utf-8&amp;q=fenway+park\">search<\/a> and then check out the second result:<\/p>\n<blockquote><p>\nFenway Park. &#8230; Fenway Park opened on April 20, 1912, the same day as Detroit\u2019s Tiger Stadium and before any of the other existing big league parks. &#8230;\n<\/p><\/blockquote>\n<p>That text is way about 1\/4 of the way down in the document. How do they extract that?<\/p>\n<p>Parameters: a) I don&#8217;t know anything about the documents that I&#8217;m analyzing, they could be valid XHTML or garbled HTML from 1996, b) it doesn&#8217;t have to be extremely fast, c) I&#8217;m using Java (if that matters) , d) I&#8217;ve tried using the org.apache.lucene.demo.html.HTMLParser class, which has a method getSummary(), but it doesn&#8217;t work for me (nothing is ever returned)<\/p>\n<p>Any and all ideas would be appreciated!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As part of a project I&#8217;m working on I need to extract content from an HTML page, in some sense creating a short 200 character summary of the document. Google does a fantastic job of extracting text and presenting a summary of the document in their search listings, I&#8217;m wondering how they do that. Here&#8217;s &hellip; <a href=\"https:\/\/cephas.net\/blog\/2004\/04\/02\/wanted-extracting-summary-from-html-text\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Wanted: Extracting summary from HTML text<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[3,19,2],"tags":[],"_links":{"self":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts\/583"}],"collection":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/comments?post=583"}],"version-history":[{"count":0,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts\/583\/revisions"}],"wp:attachment":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/media?parent=583"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/categories?post=583"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/tags?post=583"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}