HTTP Spider & Lucene

Spent the majority of my day today refactoring the HTTP spider & Lucene indexing application I’ve been writing on and off for the last couple months as a learning exercise. One of the first things I did was modify the 3 modules to implement the Runnable interface rather than extending the Thread object. Big thanks for Joe for his detailed thoughts on the subject. Probably the biggest reason for doing so is that implementing the Runnable interface means that the classes (a class that handles retrieving web pages, a class that indexes the web pages using Lucene and a class that saves the resulting web pages to a database) could possibly extend some type of task/thread class that I’d want to implement in the future (again, a Joe suggestion).

After completing that, I explored the various ways in which one might interface with the software… the only way (right now) being via the command line with multiple arguments. Since remembering command line arguments can be tedious, I looked at the Properties class, whose methods give you the ability to load a text file with key/element pairs and then get() and set() properties within the file. has an introduction to the System and Properties class.

Finally, I rewrote each module (mentioned above) so that while still running inside of a while(boolean) loop, they sleep for .5 seconds before iterating through the loop. Hopefully (and it appears this is true) this means that the CPU isn’t stressed out too much.

I uploaded the source here (it also requires the commons http client jar, the commons logging jar, and the lucene jar. If you’re a Java programmer, I’d love your feedback on the code, not from a feature standpoint but from a syntax and architectural standpoint (ie: I care less about whether or not you think you’d actually use this and more about what you think of the code.) How would you change it? What did I do wrong? What did I do right?

Leave a Reply

Your email address will not be published. Required fields are marked *