Lucene HTML Parser alternative

If you’ve ever wanted to embed/use the HTMLParser class in the org.apache.lucene.demo.html package in your own software but wanted a pure Java solution, you should check out the HTMLParser project on sourceforge. It includes classes that handle link extraction, email address ripping, a sample crawling robot, and a class that extracts the text (minus tags) of an HTML page. [via lucene user]

One thought on “Lucene HTML Parser alternative”

I’m looking forward to your cfdj’s article on cfmx-lucene integration. That is what I would like to see in cfmx application until I find the problem with Chinese word searches. How is it possible to perform searches on Chinese words with Lucene? I’ve tried Chinese Analyzer by Yiyi Sun (on external Lucene resources) but still could not solve the problem.

Aaron Johnson

Lucene HTML Parser alternative

One thought on “Lucene HTML Parser alternative”

Leave a Reply Cancel reply

Now with 50% less caffeine!