Verity Spider tips & tricks

Thanks to Phil for sending me a link to the Verity Spider tips & tricks on daemon.com.au. Daemon is/was a big Spectra shop and probably used the spider to search Spectra sites on a regular basis. So why doesn’t that page show up in a google search for “verity spider” or “verity spider tips“? Maybe it’s because of the way their content management system works, where each page is denoted by a CF UUID appended to the URL. This method probably helps the developers, but in the long run, isn’t so good for getting ranked or even indexed by the larger search engines… which led me to todays’ research: mod_rewrite. I got my MCSE from Microsoft back a couple years ago, so my first exposure to web servers was IIS. IIS was then and for the most part, is now very pointy clicky (although I’ve heard that .NET IIS will have a text-based configuration file). Anyway, Apache wasn’t something I played with much until the last year, when I brought up a couple linux machines and thus Apache. So today I dove headfirst into mod_rewrite and came up a solution for making the next version (due out anyday now) of karensrecipes.com more search engine friendly. In short, to get to a recipe on the development site right now, you’d type in something like this:

http://www.karensrecipes.com/recipes/detail.jsp?r=18

Again, just like the link I mentioned above, this is not an example of how to impress the search engines. Some kung foo regular expressions and a dab of JKMount knowledge and we now get something like this:

http://www.karensrecipes.com/recipes/18/Steamed_Mussels.jsp

and in your Apache httpd.conf:

RewriteEngine on
RewriteRule ^/recipes/([0-9]+)/.*$ /recipes/detail.jsp?r=$1 [PT]

which in English says something like “if the request starts with ‘/recipe/’ and then is followed by any number of digits and then is followed by a ‘/’ and any number of other characters, then rewrite the URL to this… (wanna know more about regular expressions? get this fabulous book!)

Pretty snazzy eh? It gives me warm feelings inside because my JSP/Servlet code doesn’t have any knowledge that funny stuff is being done to the URL in Apache, which means you can do all sorts of chicanery to your URL without having to change a lick of server side code.

Search-Enable Your Application with Lucene

Reading this month’s Java Developers Journal while exercising today, specifically, the article titled “Search-Enable Your Application with Lucene“. Back a couple months ago when I first added Lucene searching to this site, I thought it would have been a great feature to be able to index a URL. So, for example, when creating and updating an index of files in directory on the file system, you’d do something like this:

IndexWriter writer = new IndexWriter(“index”, new StandardAnalyzer(), true);
File file = new File(“c:\htmlToIndex”);
String[] files = file.list();
for (int i = 0; i Verity Spidering. Very nice! So I guess the same code I mentioned above could be done from the command line like so:

c:\cfusionmx\lib\_nti40\bin\vspider -common c:\cfusionmx\lib\common -collection c:\new -start http://www.mysite.com/products/? -indinclude *

But one of the advantages that Lucene has over a product like Verity is the ability one has to customize indexing and searching routines. For instance, one of the examples the author(Craig Walls) gave was the ability to add synonym-matching capability in your indexing routine. Basically, in Lucene, if you want add synonyms to keywords, you subclass TokenFilter, by writing a short bit of code (he provided an example in the source code) and you’re done. To the best of my knowledge, you can’t do that with Verity. Correction: you can’t “extend” Verity… but it comes with a simliar feature to the above mentioned ‘synonym’ feature called “THESAURUS” (“Expands the search to include the word that you enter and its synonyms”). I’ve not spent much time with Verity, but the evidence operators on the CFMX docs page are really intriguing, specifically the “THESAURUS”, “SOUNDEX” and “TYPO/N” evidence operators.

Notes from my .NET/ASP.NET reading

Notes from my .NET/ASP.NET reading back in September…

ildasm — command line tool for viewing manifest file contents

wincv — allows you to quickly look up information about a class or series of classes, based on a search pattern

All aspx pages are compiled to a subdirectory of the .NET framework folder, you can change the path to the compiled files by editing machine.config

compilers:
csc for C#
vbc for Visual Basic .NET
jsc for JScript .NET

pattern recognizers

Scoble posted a great analogy 2 days ago which was ignored amidst the religious context. He said this:

“Our brains are extraordinary pattern recognizers (think about it sometime — why can you look at a tree and instantly recognize it as a tree?). Our brains totally freak out when presented with something that has no pattern. Hey, look at the white noise on your TV sometime. You’ll start seeing patterns. You brain HATES not being able to see patterns.”

10 minutes ago I was trying to figure out why I’m so frustrated with the information architecture of the site I’m developing at work and the idea of the brain as a pattern recognizer helps makes perfect sense of that frustration. The client (in this case represented by anywhere from 1 to 8 people of a 1200 person company) has decided that they want a 3 column layout AND a 2 column layout and a single column layout… each with different navigation schemes, and with no apparent order. This frustrates me to no end, but there’s not much I can do or say to change their mind.