Category Archives: Open Source

Verity Spider tips & tricks

Thanks to Phil for sending me a link to the Verity Spider tips & tricks on daemon.com.au. Daemon is/was a big Spectra shop and probably used the spider to search Spectra sites on a regular basis. So why doesn’t that page show up in a google search for “verity spider” or “verity spider tips“? Maybe it’s because of the way their content management system works, where each page is denoted by a CF UUID appended to the URL. This method probably helps the developers, but in the long run, isn’t so good for getting ranked or even indexed by the larger search engines… which led me to todays’ research: mod_rewrite. I got my MCSE from Microsoft back a couple years ago, so my first exposure to web servers was IIS. IIS was then and for the most part, is now very pointy clicky (although I’ve heard that .NET IIS will have a text-based configuration file). Anyway, Apache wasn’t something I played with much until the last year, when I brought up a couple linux machines and thus Apache. So today I dove headfirst into mod_rewrite and came up a solution for making the next version (due out anyday now) of karensrecipes.com more search engine friendly. In short, to get to a recipe on the development site right now, you’d type in something like this:

http://www.karensrecipes.com/recipes/detail.jsp?r=18

Again, just like the link I mentioned above, this is not an example of how to impress the search engines. Some kung foo regular expressions and a dab of JKMount knowledge and we now get something like this:

http://www.karensrecipes.com/recipes/18/Steamed_Mussels.jsp

and in your Apache httpd.conf:

RewriteEngine on
RewriteRule ^/recipes/([0-9]+)/.*$ /recipes/detail.jsp?r=$1 [PT]

which in English says something like “if the request starts with ‘/recipe/’ and then is followed by any number of digits and then is followed by a ‘/’ and any number of other characters, then rewrite the URL to this… (wanna know more about regular expressions? get this fabulous book!)

Pretty snazzy eh? It gives me warm feelings inside because my JSP/Servlet code doesn’t have any knowledge that funny stuff is being done to the URL in Apache, which means you can do all sorts of chicanery to your URL without having to change a lick of server side code.

wget

As part of the site I’m working on, we’re offering a customizable weather swf that gets syndicated weather from intellicast. Intellicast posts their weather downloads as a GZIP xml file every 3 hours during business hours and they recommend that you use wget to retrieve the file. Turns out wget is a pretty cool little piece of software, albeit with spotty directions for Windows users. Here’s how to install and use wget on a Windows machine if you’re curious:

a) Download v1.8.1 from http://space.tin.it/computer/hherold/. Why not 1.8.2? I got errors when trying to use it…

b) Unzip the files to a location on your computer.

c) Create a text file called “config.wgetrc”. Open up the included HTML helper page and cruise to the “Sample Wgetrc” section and copy the sample config to your text file. Save this file.

d) Add a System Variable (right click ‘My Computer’ –> Properties –> Advanced –> Environment Variables –> New). The variable name should be ‘wgetrc’ and the value should be the path AND file name to the file you created in step c (ie: variable value = ‘c:\wget\config.wgetrc’ if you used the file name I suggested).

e) Bring up a command prompt (Start –> Run –> type ‘cmd’). Cruise over to your wget directory (on my computer: c:\wget). Type ‘wget http://cephas.net/’.

f) You’re done! You’ve successfully retrieved my homepage! Notice the file created in the wget directory.

My illustration was very simple, you can do much so much more than just retrieving one web page. It’s real power is illustrated when you need to retrieve an entire website (for archiving or mirroring purposes) or a large file (ie: a 10MB XML file) among other things. Here are some other sample commands:

Saving a file/site to a different directory
‘wget -O c:\mydirectory\newfile.html http://www.cephas.net/’

Retrieve all the gifs from a directory (directory browsing must be on)
‘wget -r -l1 –no-paren -A.gif http://www.server.com/images/’

Mirror your website
wget –mirror http://www.yoursite.com/

For complete syntax and more examples, check the wget.html file that was zipped w/ the source.

interesting articles

Spent numerous hours tonight trying to get a JSP/Servlet app running again on Tomcat 4.01. I’ve learned (for the second time) that Tomcat is really really fussy about web.xml. It likes things in a certain order or else you get an ugly error message that doesn’t say anything about what *actually* went wrong. Uggh.

Well thought out reasoning on why I should go back to school: http://www.joelonsoftware.com/articles/fog0000000319.html

How to internationalize a Struts app: http://www.anassina.com/struts/i18n/i18n.html