{"id":603,"date":"2004-06-09T11:34:43","date_gmt":"2004-06-09T15:34:43","guid":{"rendered":"http:\/\/wordpress.cephas.net\/?p=603"},"modified":"2004-06-09T11:34:43","modified_gmt":"2004-06-09T15:34:43","slug":"introducing-jsearch","status":"publish","type":"post","link":"https:\/\/cephas.net\/blog\/2004\/06\/09\/introducing-jsearch\/","title":{"rendered":"introducing jSearch"},"content":{"rendered":"<p>I&#8217;ve been scratching an itch off and on for a couple months now on a project I finally called jSearch (naming suggestions are welcomed!).  <a href=\"\/projects\/jsearch\/jsearch.war\">jSearch<\/a> is a tool for spidering (give it a host name and optionally a path and it will attempt to find every HTML page on that site), indexing (the content of the site is stripped of HTML and then indexed using Lucene), archiving (a copy of every page is saved to a database so that you can provide &#8216;cached&#8217; copies of pages just like Google) and searching (either via a secure web-based form, a non-secure XML API or optionally by copying the Lucene indexes)  websites using Java technology. <\/p>\n<p>jSearch is built on a variety of open source software including <a href=\"http:\/\/jakarta.apache.org\/struts\/\">Struts 1.1<\/a>, <a href=\"http:\/\/jakarta.apache.org\/lucene\/docs\/index.html\">Lucene 1.3<\/a>, the <a href=\"http:\/\/jakarta.apache.org\/commons\/\">Jakarta Commons<\/a> project, <a href=\"http:\/\/www.hibernate.org\/\">Hibernate<\/a>, <a href=\"http:\/\/logging.apache.org\/log4j\/docs\/index.html\">log4J<\/a>, <a href=\"http:\/\/www.mysql.com\/\">MySQL<\/a> (although it should work with every database that Hibernate supports), <a href=\"http:\/\/www.dom4j.org\/\">dom4j<\/a>, <a href=\"http:\/\/classifier4j.sourceforge.net\/\">Classifier4J<\/a>, <a href=\"http:\/\/www.jfree.org\/jfreechart\/\">JFreeChart<\/a> and the <a href=\"http:\/\/cewolf.sourceforge.net\/\">cewolf JSP<\/a> graphing tags. The web crawling technology was inspired by work done in the <a href=\"http:\/\/larm.sourceforge.net\/\">Jakarta LARM project<\/a>, which is a subproject of Lucene. <\/p>\n<p>If you have a chance, you download the war file <a href=\"\/projects\/jsearch\/jsearch.war\">here<\/a>, deploy it to Tomcat (or your favorite servlet container), and then make a couple modifications:<\/p>\n<p>a) decide which database you&#8217;re going to use; the war file includes the MySQL driver<\/p>\n<p>b) download and add the appropriate database driver jar files to the \/WEB-INF\/lib directory<\/p>\n<p>c) create the database using the mysql_createtables.sql file (or use the SchemaExport hbm2ddl tool included with Hibernate to create an install for your flavor of persistence)<\/p>\n<p>d) modify the hibernate.cfg.xml: update it with the appropriate driver class, the database connection URL, username, password and database dialect<\/p>\n<p>e) manually add your email address and password to the &#8216;jsearch_user&#8217; table:<br \/>\n<code><br \/>\nINSERT INTO jsearch_user (label,fname,lname,emailaddr,password,type,active)<br \/>\nVALUES('Aaron Johnson','Aaron','Johnson','aaron.s.johnson@gmail.com','password','ADMIN',1)<br \/>\n<\/code><\/p>\n<p>e) restart Tomcat&#8230; <\/p>\n<p>After restarting Tomcat you should be able to access jSearch by going to http:\/\/{yourhost}\/jsearch\/.  You&#8217;ll see the login screen; enter your username and password (the one that you created in step e) and then click the &#8216;login&#8217; button. <\/p>\n<p>Your first step will be to create an &#8216;index&#8217; which is a combination of a &#8216;host&#8217; (ie: www.yahoo.com), a &#8216;path&#8217; (ie: \/sports\/), an index path (the place where you want the Lucene indexes maintained), and check the box to activate reporting (which means that jSearch will keep a record of every search performed against the system).  <\/p>\n<p><a href=\"\/projects\/jsearch\/jsearch_add_index.png\"><img src=\"\/projects\/jsearch\/jsearch_add_index_small.png\" border=\"0\" hspace='5'><\/a><\/p>\n<p>After saving the index, you&#8217;ll need to have jSearch start the spidering process. Check the box next to the index (or indexes) you created and then click the &#8216;spider selected indexes&#8217; link in the lower right hand corner.  jSearch will kick off multiple processes within the context of the servlet container (jSearch can also be run from the command line if necessary) and will begin to download, parse, index and archive all the pages within the host\/path combination.<\/p>\n<p><a href=\"\/projects\/jsearch\/jsearch_spider_index.png\"><img src=\"\/projects\/jsearch\/jsearch_spider_index_small.png\" border=\"0\" hspace='5'><\/a><\/p>\n<p>When the spidering process has been completed, you can use the &#8216;search&#8217; tab to search an index using Lucene&#8217;s <a href=\"http:\/\/jakarta.apache.org\/lucene\/docs\/queryparsersyntax.html\">query parser syntax<\/a>.  After completing a search, you should see a link at the bottom of the page for the &#8216;xml\/rest view&#8217;, which is similar in function to the <a href=\"http:\/\/www.google.com\/apis\/\">Google Web APIs<\/a>, except that is uses REST instead of SOAP.  The jSearch Web APIs can be programatically used by other web or desktop applications.<\/p>\n<p><a href=\"\/projects\/jsearch\/jsearch_search_index.png\"><img src=\"\/projects\/jsearch\/jsearch_search_index_small.png\" border=\"0\" hspace='5'><\/a><\/p>\n<p>Next, you can use the &#8216;reporting&#8217; tab to view keyword search reports to see how many searches the system is handling per day, per week, per month and also to see what the top keywords being searched are.<\/p>\n<p>Finally, you can create \/ edit \/ delete users that are allowed to login to the system using the &#8216;admin&#8217; tab.<\/p>\n<p>I&#8217;d love to get any feedback you have about the application if you use it; comment on this post or send me an <a href=\"mailto:aaron.s.johnson@gmail.com\">email<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve been scratching an itch off and on for a couple months now on a project I finally called jSearch (naming suggestions are welcomed!). jSearch is a tool for spidering (give it a host name and optionally a path and it will attempt to find every HTML page on that site), indexing (the content of &hellip; <a href=\"https:\/\/cephas.net\/blog\/2004\/06\/09\/introducing-jsearch\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">introducing jSearch<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts\/603"}],"collection":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/comments?post=603"}],"version-history":[{"count":0,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts\/603\/revisions"}],"wp:attachment":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/media?parent=603"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/categories?post=603"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/tags?post=603"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}