One of the comments posted to the blog entry introducing jSearch asked why I thought it needed to be created when a tool like nutch already exists. nutch is a massive undertaking, it’s aim is to create a spider and search engine capable of spidering, indexing and searching billions of web pages while also providing a close shave and making breakfast. nutch wants to be an open source version of google. I created jSearch to be a smaller version of google, indexing single websites or even subsections of a website; more like a departmental or corporate spider, indexing and searching system. If you download the application, you’ll see that jSearch provides some of the same functionality that google does: cached copies of web pages, an XML API (using REST intead of SOAP), logging and reporting of searches and content summarization. Sure, you could use the google web api to provide the same search on your own site, but then you’re limited to the number of searches that google allows per day (1000) with the API, you’re making calls over your WAN to retrieve search results and you have less control (ie: you couldn’t have google index your intranet unless you purchased their appliance).
The second reason I created jSearch was that it was and is an interesting problem to work on. I now have a unique appreciation for the problems that google (or any other company that has created a spider and search engine) has faced. Writing a spider is not a trivial task. Creating a 2 or 3 sentence summary of an HTML page (technically called ‘Text Summarization’) is a topic for master’s thesis. And putting a project like this together becomes a study of the various frameworks for search (Lucene), persistence (Hibernate), and web application development (Struts), which is software engineering.
And really, why not? I enjoyed it. It was interesting and I learned something along the way and I plan on using it.
jSearch looks great! There is plenty of need and space in this world for jSearch, Nutch, SearchBlox, zilverline, etc – so don’t worry about such criticisms.
Please add a pointer to the Lucene wiki “powered by” page to jSearch.
I’m just sad to see Struts and a relational database under the covers 🙂 (but those are just two of my personal distastes)
Sounds good, looking forward to playing with it. How long did it take to put together?
AJ, are you going to open source this? Or at least send me the source, so I can give you lots of unsolicited advice about my favorite subject: threading. 🙂
Hi Aaron, nice application!!!
Thought you (and of course your readers) would be interested in a new book by Oreilly.
Better, Faster, Lighter Java
In Chapter 9 the author builds a Simple Spider 🙂
Hibernate / Spring etc. just my cup of tea.
Available on safari.oreilly.com too, which is great, but where the hell is the Hibernate Developers Handbook!
I got tired of waiting for the Handbook too so I decided to join the early access program at Manning to read Hibernate in Action.
jSearch does look to be a very useful, interesting and needed tool. With the Google API, which it is easy to implement, especially building off of the good work done by Peter Freitag, the 1000 page per day limit is potentiallly problematic.