{"id":436,"date":"2003-06-24T09:14:05","date_gmt":"2003-06-24T13:14:05","guid":{"rendered":"http:\/\/wordpress.cephas.net\/?p=436"},"modified":"2003-06-24T09:14:05","modified_gmt":"2003-06-24T13:14:05","slug":"lucene-html-parser-alternative","status":"publish","type":"post","link":"https:\/\/cephas.net\/blog\/2003\/06\/24\/lucene-html-parser-alternative\/","title":{"rendered":"Lucene HTML Parser alternative"},"content":{"rendered":"<p>If you&#8217;ve ever wanted to embed\/use the HTMLParser class in the org.apache.lucene.demo.html package in your own software but wanted a pure Java solution, you should check out the <a href=\"http:\/\/htmlparser.sourceforge.net\/\">HTMLParser<\/a> project on sourceforge.  It includes classes that handle <a href=\"http:\/\/htmlparser.sourceforge.net\/javadoc_1_3\/org\/htmlparser\/parserapplications\/LinkExtractor.html\">link extraction<\/a>, <a href=\"http:\/\/htmlparser.sourceforge.net\/javadoc_1_3\/org\/htmlparser\/parserapplications\/MailRipper.html\">email address ripping<\/a>, a sample <a href=\"http:\/\/htmlparser.sourceforge.net\/javadoc_1_3\/org\/htmlparser\/parserapplications\/Robot.html\">crawling robot<\/a>, and a class that <a href=\"http:\/\/htmlparser.sourceforge.net\/javadoc_1_3\/org\/htmlparser\/parserapplications\/StringExtractor.html\">extracts the text<\/a> (minus tags) of an HTML page.  [via <a href=\"http:\/\/nagoya.apache.org\/eyebrowse\/ReadMsg?listName=lucene-user@jakarta.apache.org&amp;msgNo=4557\">lucene user<\/a>]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you&#8217;ve ever wanted to embed\/use the HTMLParser class in the org.apache.lucene.demo.html package in your own software but wanted a pure Java solution, you should check out the HTMLParser project on sourceforge. It includes classes that handle link extraction, email address ripping, a sample crawling robot, and a class that extracts the text (minus tags) &hellip; <a href=\"https:\/\/cephas.net\/blog\/2003\/06\/24\/lucene-html-parser-alternative\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Lucene HTML Parser alternative<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts\/436"}],"collection":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/comments?post=436"}],"version-history":[{"count":0,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts\/436\/revisions"}],"wp:attachment":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/media?parent=436"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/categories?post=436"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/tags?post=436"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}