Spider/Text Indexer/Search Web Application update

I hacked a bit on my Spider/Text Indexer/Search Web Application this weekend. All of the Java I’ve written has lived in a servlet container, so (maybe out of ignorance) haven’t spent much time thinking about concurrency. This spider project on the other hand will be doing a) fetching of a URL, b) extracting links from the fetched URL, c) indexing the content, and d) persisting the content into a database. I’m sure this is something that CS grads do in their first year of school, but it’s new (and fun!) to me. So today I read up on threads in Concurrent Programming in Java, recommended by Joe. The code that I hacked at this weekend, I have 3 different threads going: one for fetching and extracting urls, one for indexing with Lucene and one for persisting to the database, each of these classes extends java.lang.Thread. The flow of the program kinda looks like this:

Spider class
— Fetcher extends Thread (retrieve URL, extract URL, loop until all possible URLs are retrieved)
— Indexer extends Thread (use Lucene to index retrieved URLs)
— Archiver extends Thread (persist content to database so that we can offer ‘cached’ versions like google)

The fetcher thread works off of a Vector (I know that an ArrayList would be faster, but it’s not synchronized)of urls, and then feeds the retrieved documents into a Vector of documents that the Indexer picks up, indexes and then updates a Vector of documents to be persisted to the database. The Spider class controls the stopping and starting of each Thread by modifying a boolean property within each class that extends Thread.

However, in the aforementioned book, I read this:

… the best default strategy is to define a Runnable as a separate class and supply it in a Thread constructor. Isolating code within a distinct class relieves you of worrying about any potential interactions of synchronized methods or blocks used in the Runnable with any that may be used by methods of class Thread. More generally, this separation allows independent control over the nature of the action and the context in which it is run: The same Runnable can be supplied to threads that are otherwise initialized in different ways, as well as to other lightweight executors. Also note that subclassing Thread precludes a class from subclassing any other class.

So does anyone have experience using Threads? How would you design something like this? Do you consider it best practice to have your classes implement the Runnable interface and then pass the instances of those classes to a generic Thread constructor?

Second, in my mind I’m imagining that I have an assembly line: urls get retrieved, extracted, then indexed, then archived. My assembly line right now is dumb: each Thread just keeps trying to grab something from the queue until someone presses the stop button. I’ve seen the terms “Producer” and “Consumer” thrown about.. I’m guessing that it might be better to have the threads notify each other whenever they put something into the queue. Better? Worse? Make a difference at all?

One thought on “Spider/Text Indexer/Search Web Application update”

  1. Hi,
    if you could possibly put ur source code in here or sent to my id….i would have a good look to check for theoptions you’ve asked for.
    I have developed 2 spiders..working under the same criteria.
    SO….whats ur point?
    Am i missing something ???

    ramesh

Leave a Reply

Your email address will not be published. Required fields are marked *