Reading this month’s Java Developers Journal while exercising today, specifically, the article titled “Search-Enable Your Application with Lucene“. Back a couple months ago when I first added Lucene searching to this site, I thought it would have been a great feature to be able to index a URL. So, for example, when creating and updating an index of files in directory on the file system, you’d do something like this:
IndexWriter writer = new IndexWriter(“index”, new StandardAnalyzer(), true);
File file = new File(“c:\htmlToIndex”);
String files = file.list();
for (int i = 0; i Verity Spidering. Very nice! So I guess the same code I mentioned above could be done from the command line like so:
c:\cfusionmx\lib\_nti40\bin\vspider -common c:\cfusionmx\lib\common -collection c:\new -start http://www.mysite.com/products/? -indinclude *
But one of the advantages that Lucene has over a product like Verity is the ability one has to customize indexing and searching routines. For instance, one of the examples the author(Craig Walls) gave was the ability to add synonym-matching capability in your indexing routine. Basically, in Lucene, if you want add synonyms to keywords, you subclass TokenFilter, by writing a short bit of code (he provided an example in the source code) and you’re done. To the best of my knowledge, you can’t do that with Verity. Correction: you can’t “extend” Verity… but it comes with a simliar feature to the above mentioned ‘synonym’ feature called “THESAURUS” (“Expands the search to include the word that you enter and its synonyms”). I’ve not spent much time with Verity, but the evidence operators on the CFMX docs page are really intriguing, specifically the “THESAURUS”, “SOUNDEX” and “TYPO/N” evidence operators.