full text search w/ lucene

Last week at work was slow.. I installed Jabber, put the finishing touches on footjoy.com, and worked on the linux desktop I have all day on Friday because I had to FTP about 3GB’s of data via a VPN connection, which completely tied up my Windows desktop machine. So Friday I installed and began working with Lucene, an open source full text searching api written entirely in Java. By sheer coincidence, Lucene was written by the same guy who wrote the V-Twin search capability in the Mac, which I mentioned yesterday (and found out about by reading Interface Culture, weird!). By the end of this coming week I hope to have a functional search for this site using Lucene. But for now.. links:

Lucene Tutorial

Javaworld Article on using Lucene

Lucene FAQ Home Page

Lucene Mailing List Archive

Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW (With CD-ROM)

infinity imagined

Finished Interface Culture by Steven Johnson today (it’s 93 outside and humid, which means it’s reading weather). I’ll leave it up to the reviewers on Amazon’s site to give you more information about the book…

I like books.. fiction, non fiction, but books that make you think… think about things to create, think about things as you’ve never thought about them before. Interface Culture was one of those books. His central premise (I think!) is that interface design is an art form, just like a Dickens novel or a Renaissance painting and because it is an art form, it has social and cultural impacts, some of which we can see with the naked eye, some of which we can discover and some that can only be seen in hindsight.

A second theme I found was the idea that emergent technologies, things like personal agents and Apple’s V-Twin search technology, while brilliant, most often end up being applied in areas never imagined by their creators. For instance, Thomas Edison created the phonograph in 1877. But get this: he thought the phonograph would be used mainly for recording phone conversations. These applications were explained as exaptations, which is my official word of the day. 🙂

Finally, though not an official theme, I found numerous mentions of the idea that some, if not all, radical and sometimes breakthrough inventions are initially rejected by popular and mainstream culture. The Mac, with it’s icons and graphical user interface, was seen as simple and labeled as cartoonish… it was not seen as a “serious business application”. Soon, the icons, trash bin and menu system took over the entire business world and every computer we use today uses the same metaphors that the original Mac did in the early 1980’s. Just goes to show that maybe the heated debate about technologies like Flash as an interface device or wireless devices might be the tip of an amazing iceberg… who knows?

jabber

I installed Jabber on my Linux server @ work yesterday. Took about 15 minutes to setup the server side, ‘nuther 15 to get a Linux client up and running and 15 to get a Windows client running and connected. Amazingly easy to do and I think Jabber could be very useful in a small office/department environment, if not an entire enterprise. Interally we use IM and email almost exclusively to communicate, even though we don’t have any cubes and sometimes you’re sitting right next to the person you’re talking w/. Anyway, here’s a fun article on using Jabber and bots.

book crossing

A friend sent me a link to bookcrossing.com… interesting concept. From their email:

“The website encourages people to Read, Register, and then Release their books “into the wild” and then track where they go and the lives they touch. Great concept… share your books and follow their progress forever.”

importing large(45mb) xml files

I mentioned that I had to import a large weather file as part of the FJ project… it *works* using simple VBScript and MSXML but it turns out that it kills the server. Couple other options I found:

a) probably the best way to do it was would be to use SAX instead of DOM, unfortunately MSXML doesn’t support SAX via VBScript, only C++ and Visual Basic. Applicable article here on MSDN re: extracting data from a large document.

b) import the data directly into SQL Server using SQL Server Bulk Load functionality, which is the way I’m heading right now… How to? Here.

Great article here on using SAX 2.0 and Java to process large XML documents.

Translucent Databases

Interesting article on oreillynet.com in response to the recent hacking of Yale student admission information by Princeton. The gist is that sensitive data that you don’t need to physically see, but only compare/search/parse should be put into your DB hashed. Excerpt:

“For example, what if a police department needs to build a database of sexual-assault victims that lets them identify trends but hides personal information? You could use a translucent database where the first column is the hash of the victim’s name, and the second column is a hash of their full address, and the third column is a hash of their block and street. You can now group incidents together by grouping entries with identical block hashes; you can see if the incidents refer to the same person by checking to see if those hashes are different.”

More information on translucent databases can be found here.

crazy browser tricks

Found this via http://cms-list.org/, my small brain can’t figure out how this would be useful, but nonetheless, try changing your <body> tag to look like this:

<body contenteditable=true>

and then view your page… type away, move images, *resize* images, delete text… wow. Kinda cool. As usual, it’s IE 5.5 (and higher) specific, although some people have written workarounds to get it to work in Mozilla.

MSDN documentation: http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/reference/properties/contenteditable.asp

Now with 50% less caffeine!