karakoram

Overview
Binaries
Installation
News
Tutorials
License
FAQ

Overview
karakoram is a tool for spidering (give it a host name and optionally a path and it will attempt to find every HTML page on that site), indexing (the content of the site is stripped of HTML and then indexed using Lucene), archiving (a copy of every page is saved to a database so that you can provide 'cached' copies of pages just like Google) and searching (either via a secure web-based form, a non-secure XML API or optionally by copying the Lucene indexes) websites using Java technology.

karakoram is built on a variety of open source software including Struts 1.1, Lucene 1.3, the Jakarta Commons project, Hibernate, log4J, MySQL (although it should work with every database that Hibernate supports), dom4j, Classifier4J, JFreeChart and the cewolf JSP graphing tags. The web crawling technology was inspired by work done in the Jakarta LARM project, which is a subproject of Lucene.

Binaries
version 1.1
  • modified JSP templates so that you don't have to install the application at /karakoram/
  • renamed project from 'jsearch' to 'karakoram'
  • added Hibernate xdoclet tags to automate DDL export and Hibernate mapping and config document creation
  • updated hibernate.cfg.xml to run off of a JNDI configured datasource instead of a datasource configured in hibernate.cfg.xml


Installation
Installation instructions are available here.

News
Coming soon!

Tutorials
· Installation
· Indexing
· Searching
· Reporting
· XML-API


License
The karakoram project is licensed under the Apache 2.0 license (http://www.apache.org/licenses/LICENSE-2.0.html).

Frequently Asked Questions
Coming soon!