Category Archives: Lucene

HTTP Spider & Lucene

Spent the majority of my day today refactoring the HTTP spider & Lucene indexing application I’ve been writing on and off for the last couple months as a learning exercise. One of the first things I did was modify the 3 modules to implement the Runnable interface rather than extending the Thread object. Big thanks for Joe for his detailed thoughts on the subject. Probably the biggest reason for doing so is that implementing the Runnable interface means that the classes (a class that handles retrieving web pages, a class that indexes the web pages using Lucene and a class that saves the resulting web pages to a database) could possibly extend some type of task/thread class that I’d want to implement in the future (again, a Joe suggestion).

After completing that, I explored the various ways in which one might interface with the software… the only way (right now) being via the command line with multiple arguments. Since remembering command line arguments can be tedious, I looked at the Properties class, whose methods give you the ability to load a text file with key/element pairs and then get() and set() properties within the file. Java.sun.com has an introduction to the System and Properties class.

Finally, I rewrote each module (mentioned above) so that while still running inside of a while(boolean) loop, they sleep for .5 seconds before iterating through the loop. Hopefully (and it appears this is true) this means that the CPU isn’t stressed out too much.

I uploaded the source here (it also requires the commons http client jar, the commons logging jar, and the lucene jar. If you’re a Java programmer, I’d love your feedback on the code, not from a feature standpoint but from a syntax and architectural standpoint (ie: I care less about whether or not you think you’d actually use this and more about what you think of the code.) How would you change it? What did I do wrong? What did I do right?

CFX_LUCENE

I mentioned Lindex (described as a “… high performance, full-featured text search engine that allows developers to create document collections for easy indexing and quick searching” two days ago. After inspecting it a bit further, it looks like it offers an interface for developers to create and maintain Lucene indexes using ColdFusion (and I’m guessing allows them to search indexes as well), which is a nice feature.. I’d love to see it.

Anyway, inspired by Lindex, tonight I hacked together a Java CFX tag that closely mimics the the <cfsearch> tag using Lucene as the search engine. You can download the Java source here [ update 11/04/2003: Nick Burch from torchbox.com sent me an updated version that “behaves better under error conditions and … the command line debug now works“, thanks Nick!, clicking on the ‘lucene.java’ link above will download the updated version ].

To compile it, you’ll have to add both the cfx.jar file (usually in \CFusionMX\lib\cfx.jar) and the lucene.jar (get yours here) file to your classpath manually or specify them at compile time. If you’re compiling from the command line, it might look something like this:

$ javac -classpath c:\cfusionmx\lib\cfx.jar;c:\lucene\lucene.jar lucene.java

After you compile the class, you’ll need to

a) copy it to a directory that ColdFusion is aware of (ie: /cfide/administrator/ –> Java and JVM –> Class Path)

b) add the lucene.jar to the Class Path mentioned in ‘a’

c) register the CFX in the ColdFusion Administrator (/cfide/administrator/ –> Extensions –> CFX Tags. Click on ‘Register Java CFX’. The tag name should be ‘cfx_lucene’, the class name should be ‘lucene’.

d) restart CFMX.

e) and finally, create a .cfm page and add this script:

<cfx_lucene
  query=”r_query”
  indexName=”C:\hosts\cephas.net\wwwroot\blog\index”
  startIndex=”1″
  maxPage=”10″
  queryString=”java”>

The above script presumes that you have a Lucene index already created in the directory ‘C:\hosts\cephas.net\wwwroot\blog\index’, is looking for the keyword ‘java’, and will return a ColdFusion query object to the template with the columns ‘title’, ‘url’, and ‘summary’.

To see the results, you can dump the CFDUMP tag:

<cfdump var=”#r_query#”>

Caveat: It’s 1:42amEST so I’ve done no testing on it and it has no interface (like I’m sure Lindex does). Use at your own risk. If you do use it, please keep my name/email in the source somewhere and remember to thank Joe for the idea. Enjoy!

Java application that does searching, indexing, crawling and reporting

Everyone has itches right? Joe has one for an IMAP server. Ray wanted a CFC based blog. I want a Java application that does searching, indexing, crawling and reporting that can be deployed on any servlet container. I’m sure there are people out there that could write one up in a couple days and I’m sure there are already applications that perform these exact functions (for example: ht//Dig does it, just not in Java). I’d like to attack it because I think it would be a fascinating (and fun!) exercise. So anyway, following are a couple of the features I’d like to implement and then some beginning research… I have questions at the end for anyone who has done a project similar in scope to this.

Features