{"id":604,"date":"2004-06-10T22:22:16","date_gmt":"2004-06-11T02:22:16","guid":{"rendered":"http:\/\/wordpress.cephas.net\/?p=604"},"modified":"2004-06-10T22:22:16","modified_gmt":"2004-06-11T02:22:16","slug":"why-create-jsearch","status":"publish","type":"post","link":"https:\/\/cephas.net\/blog\/2004\/06\/10\/why-create-jsearch\/","title":{"rendered":"why create jSearch?"},"content":{"rendered":"<p>One of the comments posted to the <a href=\"http:\/\/cephas.net\/blog\/2004\/06\/09\/introducing_jsearch.html\">blog entry introducing jSearch<\/a> asked why I thought it needed to be created when a tool like <a href=\"http:\/\/www.nutch.org\/\">nutch<\/a> already exists. nutch is a massive undertaking, it&#8217;s aim is to create a spider and search engine capable of spidering, indexing and searching billions of web pages while also providing a close shave and making breakfast. nutch wants to be an open source version of google.  I created jSearch to be a smaller version of google, indexing single websites or even subsections of a website; more like a departmental or corporate spider, indexing and searching system.  If you download the application, you&#8217;ll see that jSearch provides some of the same functionality that google does: cached copies of web pages, an XML API (using REST intead of SOAP), logging and reporting of searches and content summarization.  Sure, you could use the <a href=\"http:\/\/www.google.com\/apis\/\">google web api<\/a> to provide the same search on your own site, but then you&#8217;re limited to the number of searches that google allows per day (1000) with the API, you&#8217;re making calls over your WAN to retrieve search results and you have less control (ie: you couldn&#8217;t have google index your intranet unless you purchased their <a href=\"http:\/\/www.google.com\/appliance\/\">appliance<\/a>).<\/p>\n<p>The second reason I created jSearch was that it was and is an interesting problem to work on.  I now have a unique appreciation for the problems that google (or any other company that has created a spider and search engine) has faced.  Writing a spider is not a trivial task. Creating a 2 or 3 sentence summary of an HTML page (technically called &#8216;Text Summarization&#8217;) is a topic for master&#8217;s thesis. And  putting a project like this together becomes a study of the various frameworks for search (Lucene), persistence (Hibernate), and web application development (Struts), which is software engineering.  <\/p>\n<p>And really, why not? I enjoyed it.  It was interesting and I learned something along the way and I plan on using it.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the comments posted to the blog entry introducing jSearch asked why I thought it needed to be created when a tool like nutch already exists. nutch is a massive undertaking, it&#8217;s aim is to create a spider and search engine capable of spidering, indexing and searching billions of web pages while also providing &hellip; <a href=\"https:\/\/cephas.net\/blog\/2004\/06\/10\/why-create-jsearch\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">why create jSearch?<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[3,19,4,2],"tags":[],"_links":{"self":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts\/604"}],"collection":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/comments?post=604"}],"version-history":[{"count":0,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts\/604\/revisions"}],"wp:attachment":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/media?parent=604"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/categories?post=604"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/tags?post=604"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}