All posts by ajohnson

Uncategorized

Links: 4-21-2008

April 22, 2008 ajohnson Leave a comment

Slides from Spinn3r Architecture Talk at 2008 MySQL Users Conference « Kevin Burton’s NEW FeedBlog
Scaling MySQL / Java with an emphasis on writes.
(categories: java jdbc mysql performance presentations kevinburton )
IdeaFactory : Weblog : Google Docs and JRoller
Google Docs has the ability to publish documents via the Metaweblog API.
(categories: blogging metaweblogapi app publishing )
The Green Issue – Climate Change – Environment – Energy Efficiency – Consumption – New York Times
On the seeming insignificance of light bulbs.
(categories: food environment energy )
ZombieURL – Waxy.org
Quote: "For the last few weeks, I’ve been spending every Friday with a small group of brilliant geeks … for a weekly one-day hackathon." We get an hour on Wednesdays at Jive. It’s our 2% time.
(categories: innovation )
Unit Structures: Productive Unit Structures: Introducing Freedom
Quote: "Freedom is a Mac application that disables your computer’s networking capabilities for a selected time interval." I’ve found the 1.5 hours I spend on the bus everyday to be the most productive time of my day.
(categories: productivity apple mac osx software networking time )

Uncategorized

Links: 4-17-2008

April 18, 2008 ajohnson Leave a comment

Yahoo! Developer Network: Yahoo! Weather
Free weather data in RSS format for "… your own web site or client application" but then the fine print says only personal / non-commercial. Bollucks.
(categories: weather yahoo api feed feeds rss data )
Flickr Code
Flickr was last deployed 2 hours ago, including 19 changes by 6 people.
(categories: development documentation flickr api code community communities )

Uncategorized

Links: 4-16-2008

April 17, 2008 ajohnson Leave a comment

social Info Page
Mailing list on jabber.org, description: "the mashup of Web and Jabber! OpenID, OAuth, buddy lists beyond IM, XMPP pubsub instead of HTTP polling, etc."
(categories: jabber xmpp openid oauth socialnetworks blogging blogs )
sioc-project.org | Semantically-Interlinked Online Communities
Description: SIOC provides a Semantic Web ontology for representing rich data from the Social Web in RDF.
(categories: communities community data semantic foaf ontology rdf sioc semanticweb )
Comments on Life-altering experiences | Ask Metafilter
Man I’ve had it pretty easy compared to some (most?) people. Just imagine what this thread would look like if people that didn’t have internet access could post.
(categories: askmefi life childhood children experience )
Yahoo! Search Blog: The Yahoo! Search Open Ecosystem
Yahoo crawler now "… support(s) a number of microformats, including hCard, hCalendar, hReview, hAtom, and XFN. Yahoo! Search will work with the web community to evolve the vocabulary framework for embedding structured data."
(categories: semantic rdf searchengine semanticweb microformats yahoo search )

Uncategorized

Links: 4-11-2008

April 12, 2008 ajohnson Leave a comment

PottyMouth
Quote: "PottyMouth transforms completely unstructured and untrusted text to valid, nice-looking, completely safe XHTML." Python. Ported to Ruby as well.
(categories: markup python security content comments )
Where to Find Open Data on the Web – ReadWriteWeb
Great list (and comments) of freely available data sources. When is someone going to make the weather forecaster datasource free so that I don’t have to pay accuweather or weather.com for forecasts / current weather?
(categories: data free database resources open )

Uncategorized

Links: 4-10-2008

April 11, 2008 ajohnson Leave a comment

findability.org: Brian Goodman: Interview
Quote: "… integrating social data (e.g., ratings, tag frequency) directly into the algorithms for enterprise search. A page that’s been bookmarked, for instance, receives a boost."
(categories: search enterprise socialnetworks social )

Uncategorized

Links: 4-6-2008

April 7, 2008 ajohnson Leave a comment

reddit.com: ask reddit: what other visualization tools do you know besides processing, graphviz and nodebox?
References to some other cool Java viz tools.
(categories: data graph nodebox opensource statistics visualization )

Uncategorized

Links: 4-3-2008

April 4, 2008 ajohnson 1 Comment

Datawocky: More data usually beats better algorithms
Quote: "… adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set."
(categories: algorithms bigdata data database databases algorithm netflix datamining )
Looks Good Works Well
Bill has been reviewing a number of UI anti-patterns on his blog recently, he’s covered some really good stuff.
(categories: ui interface javascript design usability )
We’re Taking an Open Direction with Web Communities: Are You In? ~ Chris Pirillo
Quote: "I don’t want a social network, I want a socially *RELEVANT* network (both on-site and beyond). I don’t want a community platform, I want a participation platform where members are rewarded and ranked appropriately…"
(categories: communities community idea social socialnetworks chrispirillo )
Jeff on Cloud Computing: The First Hadoop Summit
Cool review of stuff going on with the Hadoop project from the first annual Hadoop summit.
(categories: pig hbase mahout rapleaf nutch hadoop )
Move Over Lotus Notes, SharePoint is Filling Yer Shoes
Quote: "… SharePoint is fast becoming the next IBM Lotus Notes — and not in a good way."
(categories: sharepoint notes collaboration )
EditGrid Confluence Plugin – Online Spreadsheets – EditGrid
Cool confluence plugin that gives you and everyone else the ability to edit spreadsheets online simultaneously.
(categories: spreadsheet editgrid confluence )
Parsing human-written date and time information « Jon Udell
Good stuff in the comments about parsing human generated (not computer generated) dates, need to check out Chronic.
(categories: dates datetime parsing antlr chronic sandy )

Clearspace, Lucene, Software Development

How MoreLikeThis Works in Lucene

March 30, 2008 ajohnson 33 Comments

We created a ‘related items’ feature way back in Clearspace 1.0 (I mocked some of it out here just to prove that it worked) which shows related content based on the document, thread or blog post that you’re currently viewing. It was built using the MoreLikeThis class, a contribution made to the Lucene project by David Spencer, Mark Harwood and my favorite Canadian coworker Bruce Ritchie.

The really interesting thing about MoreLikeThis (at least to me) is that my first inclination was probably to think that Lucene goes and spends some cycles looking for content related to the source document: the focus being on the search and the other content. The reality is that the majority of the work is done by analyzing content the source document relative to the aggregate of all the other content that exists in the index. Said another way, MoreLikeThis doesn’t work by running some special search or query, it works by comparing the document that you’re asking about and the entire index as a whole (and then by running a query).

Anyway, in general I think it works pretty well, but for some odd reason, people keep asking the question “how does it work?” and since the documentation doesn’t go into much detail, I thought I’d try writing it up. So here goes nothing…

The first thing you need to do if you’re going to find related content is to tell Lucene what you want to find related content for. The MoreLikeThis class gives you five options: you can either tell it to look at an existing document in the Lucene index or let it try to parse an external resource from a file, an inputstream, a reader or a URL. In Clearspace we actually have a way of getting the Lucene document ID given the Clearspace object type and object ID, the code for our ‘related items’ feature looks something like this:

Query q = buildObjectTypeAndIDQuery(String.valueOf(messageID), JiveConstants.MESSAGE);
Hits hits = searcher.search(q);
...
int docNum = hits.id(0);
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{IndexField.subject.toString(), IndexField.body.toString(),
                    IndexField.tagID.toString()});
mlt.setMinWordLen(2);
mlt.setBoost(true);
q = mlt.like(docNum);

So in Clearspace we first fetch the Lucene document, construct a MoreLikeThis instance, tell it that we want to match on the subject, body and tag fields and we only want to calculate ‘relatedness’ based on words whose length is two characters or more. That’s the easy part, now it gets interesting.

Given that you already have a Lucene document, the MoreLikeThis instance loops over the field names (fields are kind of like the Lucene equivalent of database columns) that we specified (or all the field names available in the Lucene index if you don’t specify any) and retrieves a term vector for each of the fields in the document we’re analyzing. A term vector is a data structure that holds a list of all the words that were in the field and the number of times each word was used (excluding words that it considers to be ‘stop words’, you can see a list of common stop words here).

Since I’m sure some of you are tuning out at this point, let’s look at something concrete: assume you’ve created a document that looks like this (by the way, what a thorough wikipedia.org entry!):

Subject: Twinkle Twinkle Little Star
Body: Twinkle Twinkle little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky, Twinkle twinkle little star, How I wonder what you are!

The term vector for the subject would look like this:
terms: twinkle, little, star
termFrequencies: twinkle[2], little[1], star[1]

and for the body:
terms: [above, diamond, high, how, like, little, sky, so, star, twinkle, up, what, wonder, world]
termFrequencies: above[1], diamond[1], high[1], how[2], like[1], little[2], sky[1], so[1], star[2], [twinkle[4], up[1], what[2], wonder[2], world[1]

Meanwhile, back at the ranch.. so we get a term vector for each of the fields available in the document we’re looking for related items for. After we’ve got those, each of the term vectors is merged into a map: the key being the term and the value being the number of times the word was used in the document. The map is then handed to a method (called createQueue) that calculates a reasonably complex score for each word in the map. Since this is really the meat of the entire class, I’ll step through this in a little more detail and I’ll season the meat with data from the index on my blog.

The createQueue method first retrieves the total number of documents that exist in the index that we’re dealing with.

int numDocs = ir.numDocs();

In the index I maintain for my personal blog, there are 998 documents in the index. The second thing it does is create an instance of a class called FreqQ, which extends the PriorityQueue class. This object will maintain an object array whose elements are ordered according to their score, which we’ll cover in a second.

Now the createQueue method iterates over each word in the term frequency map, throwing out words if they don’t occur enough times (see setMinTermFreq) and then testing to find out which field across the entire the Lucene index contains the term the most. Next it calculates the Inverse Document Frequency, which, according to the JavaDoc, is:

… a score factor based on a term’s document frequency (the number of documents which contain the term)… Terms that occur in fewer documents are better indicators of topic, so implementations of this method usually return larger values for rare terms, and smaller values for common terms.

and finally, a score, which is a product of the IDF score and the number of times the word existed in the source document.

Again, I’m guessing you’re getting really sleepy, here’s some real data to look at. Like I mentioned above, I have 998 documents in my Lucene index and if we take a document like this one, you’ll end up with a term vector that looks like this:

a[20], you[20], pre[18], the[17], to[13], of[11], this[10], username[10], column[9], oracle[9], and[8], if[8], null[8], alter[7], appuser[7], into[7], on[7], that[7], href[6], http[6], i[6], sql[6], an[5], com[5], is[5], not[5], password[5], table[5], values[5], varchar[5], www[5], be[4], but[4], case[4], clearspace[4], empty[4], get[4], insert[4], modify[4], which[4], administrator[3], are[3], can[3], database[3], db[3], in[3], jivesoftware[3], mysql[3], nullability[3], server[3], string[3], t[3], thing[3], with[3], about[2], all[2], assume[2], been[2], blogged[2], d[2], different[2], feature[2], for[2], from[2], has[2], have[2], hsqldb[2], jive[2], make[2], modified[2], nvarchar[2], out[2], plan[2], postgres[2], products[2], ran[2], requirements[2], sensitivity[2], so[2], space[2], store[2], strike[2], support[2], where[2], will[2], above[1], again[1], against[1], already[1], andrew[1], any[1], appears[1], application[1], at[1], attempt[1], bennett[1], bit[1], blog[1], borrowed[1], both[1], bottom[1], by[1], cannot[1], changing[1], chars[1], classes[1], cmu[1], code[1], columns[1], consider[1], contrib[1], converted[1], cool[1], couple[1], crazy[1], day[1], decided[1], default[1], detail[1], didn[1], discuss[1], do[1], doing[1], edu[1], error[1], example[1], finally[1], first[1], flushed[1], forums[1], further[1], going[1], good[1], goodness[1], heads[1], helpful[1], his[1], insensitive[1], instead[1], interesting[1], involved[1], issue[1], issues[1], it[1], itself[1], jist[1], job[1], jsp[1], kb[1], kind[1], lately[1], least[1], like[1], line[1], ll[1], look[1], looks[1], lot[1], mcelwee[1], microsoft[1], more[1], my[1], needed[1], nice[1], no[1], notice[1], nugget[1], number[1], only[1], or[1], ora[1], over[1], page[1], past[1], platforms[1], probably[1], product[1], re[1], readers[1], reading[1], reason[1], respect[1], results[1], retrieve[1], ridiculous[1], sample[1], second[1], see[1], select[1], semicolon[1], sensitive[1], servers[1], shadow[1], similar[1], six[1], software[1], something[1], specific[1], specification[1], specify[1], standard[1], statement[1], supporting[1], sure[1], system[1], tails[1], than[1], then[1], there[1], they[1], things[1], those[1], thought[1], thunderguy[1], tolowercase[1], touppercase[1], tried[1], try[1], two[1], txt[1], unless[1], ups[1], used[1], user[1], variety[1], ve[1], very[1], want[1], warned[1], was[1], we[1], week[1], whatever[1], wonderful[1], word[1], work[1], write[1], yes[1], your[1], yourself[1]

and stop words would get stripped out and words that occurred less than two times would get stripped out so you’d be left with a term frequency map that looked like this:

pre[18], username[10], column[9], oracle[9], alter[7], appuser[7], href[6], http[6], sql[6], com[5], password[5], table[5], values[5], varchar[5], www[5], case[4], clearspace[4], empty[4], get[4], insert[4], modify[4], which[4], administrator[3], can[3], database[3], db[3], jivesoftware[3], mysql[3], nullability[3], server[3], string[3], thing[3], about[2], all[2], assume[2], been[2], blogged[2], d[2], different[2], feature[2], from[2], has[2], have[2], hsqldb[2], jive[2], make[2], modified[2], nvarchar[2], out[2], plan[2], postgres[2], products[2], ran[2], requirements[2], sensitivity[2], so[2], space[2], store[2], strike[2], support[2], where[2],

So for each of these terms, Lucene finds the field that contains the most instances of the given term and then calculates the idf value and the score. The default implementation of the idf value looks like this:

return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0)

So given the term frequency map above and the document frequency noted below, we get the following idf values and scores for the most popular terms in this document:

Term	Number of Instances of Term in Document	Number of Documents Matching Term	IDF value	Score
pre	18	26	4.609916	82.978
username	10	23	4.7276993	47.276
column	9	13	5.266696	47.400264
oracle	9	8	5.7085285	51.376
alter	7	1	7.212606	50.488

Finally, after all the terms have been added to the priority queue, we create a Lucene query, looping over the first 25 terms (this is the default and can be changed via setMaxQueryTerms) in the queue

TermQuery tq = new TermQuery(new Term("body", "oracle"));

and optionally boosting each term according to the score:

tq.setBoost(51.376 / 82.978);

The resulting Lucene query (in string format) looks something like this:

body:pre body:username^.56974 body:column^.57123 body:oracle^.61915 ...

Now I’ll be curious to see what the related item is for this post!

I hope you learned as much as I did. If you’re extremely curious about this whole Lucene thing and you like the in-depth stuff, you should definitely check out Luke, it’ll open a whole new world to you.

Uncategorized