All posts by ajohnson

Spider/Text Indexer/Search Web Application update

I hacked a bit on my Spider/Text Indexer/Search Web Application this weekend. All of the Java I’ve written has lived in a servlet container, so (maybe out of ignorance) haven’t spent much time thinking about concurrency. This spider project on the other hand will be doing a) fetching of a URL, b) extracting links from the fetched URL, c) indexing the content, and d) persisting the content into a database. I’m sure this is something that CS grads do in their first year of school, but it’s new (and fun!) to me. So today I read up on threads in Concurrent Programming in Java, recommended by Joe. The code that I hacked at this weekend, I have 3 different threads going: one for fetching and extracting urls, one for indexing with Lucene and one for persisting to the database, each of these classes extends java.lang.Thread. The flow of the program kinda looks like this:

Spider class
— Fetcher extends Thread (retrieve URL, extract URL, loop until all possible URLs are retrieved)
— Indexer extends Thread (use Lucene to index retrieved URLs)
— Archiver extends Thread (persist content to database so that we can offer ‘cached’ versions like google)

The fetcher thread works off of a Vector (I know that an ArrayList would be faster, but it’s not synchronized)of urls, and then feeds the retrieved documents into a Vector of documents that the Indexer picks up, indexes and then updates a Vector of documents to be persisted to the database. The Spider class controls the stopping and starting of each Thread by modifying a boolean property within each class that extends Thread.

However, in the aforementioned book, I read this:

… the best default strategy is to define a Runnable as a separate class and supply it in a Thread constructor. Isolating code within a distinct class relieves you of worrying about any potential interactions of synchronized methods or blocks used in the Runnable with any that may be used by methods of class Thread. More generally, this separation allows independent control over the nature of the action and the context in which it is run: The same Runnable can be supplied to threads that are otherwise initialized in different ways, as well as to other lightweight executors. Also note that subclassing Thread precludes a class from subclassing any other class.

So does anyone have experience using Threads? How would you design something like this? Do you consider it best practice to have your classes implement the Runnable interface and then pass the instances of those classes to a generic Thread constructor?

Second, in my mind I’m imagining that I have an assembly line: urls get retrieved, extracted, then indexed, then archived. My assembly line right now is dumb: each Thread just keeps trying to grab something from the queue until someone presses the stop button. I’ve seen the terms “Producer” and “Consumer” thrown about.. I’m guessing that it might be better to have the threads notify each other whenever they put something into the queue. Better? Worse? Make a difference at all?

CFMX & HttpServletRequest/HttpServletResponse

I’m doing some research for an article I’m writing for Macromedia Devnet (hopefully to be published in May or June). I won’t go into the details of the article, but part of it deals (tangentially) with CFMX and Java integration. Specifically, I’m using CFMX to call a relatively simple Java API that consists of a couple methods, ie:

doSomething(javax.servlet.http.HttpServletRequest req, javax.servlet.http.HttpServletResponse res, long someObject)

Looks pretty imposing doesn’t it? It’s actually pretty easy to call using CFMX.

<cfscript>
myObject = CreateObject(“java”, “com.thirdpartyApp.OtherClass”);
// retrieve the long value from the third party class
t = CreateObject(“java”, “com.thirdpartyApp.Class”);
longValue = t.SOME_CONSTANT;

// get a javax.servlet.http.HttpServletRequest object from CFMX
req = getPageContext().getRequest();

// get a javax.servlet.http.HttpServletResponse object from CFMX
res = getPageContext().getResponse();

myObject.doSomething(req, res, longValue);
</cfscript>

One interesting to note is that if you do getClass() on the req and res objects like this:

<cfoutput>#req.getClass()#</cfoutput>
<cfoutput>#res.getClass()#</cfoutput>

above you get the following class names:

req = jrun.servlet.ForwardRequest
res = coldfusion.jsp.JspWriterIncludeResponse

and although they don’t look like javax.servlet.http.HttpServletResponse and javax.servlet.http.HttpServletRequest objects, they actually extend the javax.servlet.ServletRequestWrapper and javax.servlet.ServletResponseWrapper classes, which are wrappers (as their names imply) for the javax.servlet.ServletRequest and javax.servlet.ServetResponse interfaces.

Anyway, interesting journey.. great help from Sean Corfield for his post about the getClass() method you can use to find out the type of an object and to Charlie Arehart for his comprehensive Java/CFMX archive.

Going for bird

My main man Mike .NET hooked me up big time today with a set of irons from Cobra, a dozen balls from Titleist and a FootJoy bag. Now I just need to learn how to play golf.

On a related note, today at MINDSEYE we announced our new Content Management product, called Element. Titleist, FootJoy.com, FootJoy.co.uk, FootJoy.com.fr, FootJoy.de, FootJoy.nu (and soon Cobra Golf and Scotty Cameron) are using the ASP version of the product. I know we’ll have more information about the product in the coming weeks, but, gosh darn it, we’re pumped!