Lucene HTML Parser alternative

If you’ve ever wanted to embed/use the HTMLParser class in the org.apache.lucene.demo.html package in your own software but wanted a pure Java solution, you should check out the HTMLParser project on sourceforge. It includes classes that handle link extraction, email address ripping, a sample crawling robot, and a class that extracts the text (minus tags) of an HTML page. [via lucene user]

This entry was posted in J2EE. Bookmark the permalink.

One Response to Lucene HTML Parser alternative

  1. Vui Lo says:

    I’m looking forward to your cfdj’s article on cfmx-lucene integration. That is what I would like to see in cfmx application until I find the problem with Chinese word searches. How is it possible to perform searches on Chinese words with Lucene? I’ve tried Chinese Analyzer by Yiyi Sun (on external Lucene resources) but still could not solve the problem.

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>