Category Archives: Lucene

How MoreLikeThis Works in Lucene

We created a ‘related items’ feature way back in Clearspace 1.0 (I mocked some of it out here just to prove that it worked) which shows related content based on the document, thread or blog post that you’re currently viewing. It was built using the MoreLikeThis class, a contribution made to the Lucene project by David Spencer, Mark Harwood and my favorite Canadian coworker Bruce Ritchie.

The really interesting thing about MoreLikeThis (at least to me) is that my first inclination was probably to think that Lucene goes and spends some cycles looking for content related to the source document: the focus being on the search and the other content. The reality is that the majority of the work is done by analyzing content the source document relative to the aggregate of all the other content that exists in the index. Said another way, MoreLikeThis doesn’t work by running some special search or query, it works by comparing the document that you’re asking about and the entire index as a whole (and then by running a query).

Anyway, in general I think it works pretty well, but for some odd reason, people keep asking the question “how does it work?” and since the documentation doesn’t go into much detail, I thought I’d try writing it up. So here goes nothing…

The first thing you need to do if you’re going to find related content is to tell Lucene what you want to find related content for. The MoreLikeThis class gives you five options: you can either tell it to look at an existing document in the Lucene index or let it try to parse an external resource from a file, an inputstream, a reader or a URL. In Clearspace we actually have a way of getting the Lucene document ID given the Clearspace object type and object ID, the code for our ‘related items’ feature looks something like this:

Query q = buildObjectTypeAndIDQuery(String.valueOf(messageID), JiveConstants.MESSAGE);
Hits hits = searcher.search(q);
...
int docNum = hits.id(0);
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{IndexField.subject.toString(), IndexField.body.toString(),
                    IndexField.tagID.toString()});
mlt.setMinWordLen(2);
mlt.setBoost(true);
q = mlt.like(docNum);

So in Clearspace we first fetch the Lucene document, construct a MoreLikeThis instance, tell it that we want to match on the subject, body and tag fields and we only want to calculate ‘relatedness’ based on words whose length is two characters or more. That’s the easy part, now it gets interesting.

Given that you already have a Lucene document, the MoreLikeThis instance loops over the field names (fields are kind of like the Lucene equivalent of database columns) that we specified (or all the field names available in the Lucene index if you don’t specify any) and retrieves a term vector for each of the fields in the document we’re analyzing. A term vector is a data structure that holds a list of all the words that were in the field and the number of times each word was used (excluding words that it considers to be ‘stop words’, you can see a list of common stop words here).

Since I’m sure some of you are tuning out at this point, let’s look at something concrete: assume you’ve created a document that looks like this (by the way, what a thorough wikipedia.org entry!):

Subject: Twinkle Twinkle Little Star
Body: Twinkle Twinkle little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky, Twinkle twinkle little star, How I wonder what you are!

The term vector for the subject would look like this:
terms: twinkle, little, star
termFrequencies: twinkle[2], little[1], star[1]

and for the body:
terms: [above, diamond, high, how, like, little, sky, so, star, twinkle, up, what, wonder, world]
termFrequencies: above[1], diamond[1], high[1], how[2], like[1], little[2], sky[1], so[1], star[2], [twinkle[4], up[1], what[2], wonder[2], world[1]

Meanwhile, back at the ranch.. so we get a term vector for each of the fields available in the document we’re looking for related items for. After we’ve got those, each of the term vectors is merged into a map: the key being the term and the value being the number of times the word was used in the document. The map is then handed to a method (called createQueue) that calculates a reasonably complex score for each word in the map. Since this is really the meat of the entire class, I’ll step through this in a little more detail and I’ll season the meat with data from the index on my blog.

The createQueue method first retrieves the total number of documents that exist in the index that we’re dealing with.

int numDocs = ir.numDocs();

In the index I maintain for my personal blog, there are 998 documents in the index. The second thing it does is create an instance of a class called FreqQ, which extends the PriorityQueue class. This object will maintain an object array whose elements are ordered according to their score, which we’ll cover in a second.

Now the createQueue method iterates over each word in the term frequency map, throwing out words if they don’t occur enough times (see setMinTermFreq) and then testing to find out which field across the entire the Lucene index contains the term the most. Next it calculates the Inverse Document Frequency, which, according to the JavaDoc, is:

… a score factor based on a term’s document frequency (the number of documents which contain the term)… Terms that occur in fewer documents are better indicators of topic, so implementations of this method usually return larger values for rare terms, and smaller values for common terms.

and finally, a score, which is a product of the IDF score and the number of times the word existed in the source document.

Again, I’m guessing you’re getting really sleepy, here’s some real data to look at. Like I mentioned above, I have 998 documents in my Lucene index and if we take a document like this one, you’ll end up with a term vector that looks like this:

a[20], you[20], pre[18], the[17], to[13], of[11], this[10], username[10], column[9], oracle[9], and[8], if[8], null[8], alter[7], appuser[7], into[7], on[7], that[7], href[6], http[6], i[6], sql[6], an[5], com[5], is[5], not[5], password[5], table[5], values[5], varchar[5], www[5], be[4], but[4], case[4], clearspace[4], empty[4], get[4], insert[4], modify[4], which[4], administrator[3], are[3], can[3], database[3], db[3], in[3], jivesoftware[3], mysql[3], nullability[3], server[3], string[3], t[3], thing[3], with[3], about[2], all[2], assume[2], been[2], blogged[2], d[2], different[2], feature[2], for[2], from[2], has[2], have[2], hsqldb[2], jive[2], make[2], modified[2], nvarchar[2], out[2], plan[2], postgres[2], products[2], ran[2], requirements[2], sensitivity[2], so[2], space[2], store[2], strike[2], support[2], where[2], will[2], above[1], again[1], against[1], already[1], andrew[1], any[1], appears[1], application[1], at[1], attempt[1], bennett[1], bit[1], blog[1], borrowed[1], both[1], bottom[1], by[1], cannot[1], changing[1], chars[1], classes[1], cmu[1], code[1], columns[1], consider[1], contrib[1], converted[1], cool[1], couple[1], crazy[1], day[1], decided[1], default[1], detail[1], didn[1], discuss[1], do[1], doing[1], edu[1], error[1], example[1], finally[1], first[1], flushed[1], forums[1], further[1], going[1], good[1], goodness[1], heads[1], helpful[1], his[1], insensitive[1], instead[1], interesting[1], involved[1], issue[1], issues[1], it[1], itself[1], jist[1], job[1], jsp[1], kb[1], kind[1], lately[1], least[1], like[1], line[1], ll[1], look[1], looks[1], lot[1], mcelwee[1], microsoft[1], more[1], my[1], needed[1], nice[1], no[1], notice[1], nugget[1], number[1], only[1], or[1], ora[1], over[1], page[1], past[1], platforms[1], probably[1], product[1], re[1], readers[1], reading[1], reason[1], respect[1], results[1], retrieve[1], ridiculous[1], sample[1], second[1], see[1], select[1], semicolon[1], sensitive[1], servers[1], shadow[1], similar[1], six[1], software[1], something[1], specific[1], specification[1], specify[1], standard[1], statement[1], supporting[1], sure[1], system[1], tails[1], than[1], then[1], there[1], they[1], things[1], those[1], thought[1], thunderguy[1], tolowercase[1], touppercase[1], tried[1], try[1], two[1], txt[1], unless[1], ups[1], used[1], user[1], variety[1], ve[1], very[1], want[1], warned[1], was[1], we[1], week[1], whatever[1], wonderful[1], word[1], work[1], write[1], yes[1], your[1], yourself[1]

and stop words would get stripped out and words that occurred less than two times would get stripped out so you’d be left with a term frequency map that looked like this:

pre[18], username[10], column[9], oracle[9], alter[7], appuser[7], href[6], http[6], sql[6], com[5], password[5], table[5], values[5], varchar[5], www[5], case[4], clearspace[4], empty[4], get[4], insert[4], modify[4], which[4], administrator[3], can[3], database[3], db[3], jivesoftware[3], mysql[3], nullability[3], server[3], string[3], thing[3], about[2], all[2], assume[2], been[2], blogged[2], d[2], different[2], feature[2], from[2], has[2], have[2], hsqldb[2], jive[2], make[2], modified[2], nvarchar[2], out[2], plan[2], postgres[2], products[2], ran[2], requirements[2], sensitivity[2], so[2], space[2], store[2], strike[2], support[2], where[2],

So for each of these terms, Lucene finds the field that contains the most instances of the given term and then calculates the idf value and the score. The default implementation of the idf value looks like this:

return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0)

So given the term frequency map above and the document frequency noted below, we get the following idf values and scores for the most popular terms in this document:

Term Number of Instances of Term in Document Number of Documents Matching Term IDF value Score
pre 18 26 4.609916 82.978
username 10 23 4.7276993 47.276
column 9 13 5.266696 47.400264
oracle 9 8 5.7085285 51.376
alter 7 1 7.212606 50.488

Finally, after all the terms have been added to the priority queue, we create a Lucene query, looping over the first 25 terms (this is the default and can be changed via setMaxQueryTerms) in the queue

TermQuery tq = new TermQuery(new Term("body", "oracle"));

and optionally boosting each term according to the score:

tq.setBoost(51.376 / 82.978);

The resulting Lucene query (in string format) looks something like this:

body:pre body:username^.56974 body:column^.57123 body:oracle^.61915 ...

Now I’ll be curious to see what the related item is for this post!

I hope you learned as much as I did. If you’re extremely curious about this whole Lucene thing and you like the in-depth stuff, you should definitely check out Luke, it’ll open a whole new world to you.

Using Lucene and MoreLikeThis to show Related Content

If you read this blog, you probably paid a smidgen of attention to the Web 2.0 Conference held last week in San Francisco. Sphere was one of the companies that presented and they launched a product called the “Sphere It Contextual Widget for blogs“, which is JavaScript widget you can add to your blog or content focused site that displays contextually similar blogs and blog posts for the reader. I’ve always wanted to try to do something similar (no pun intended) using Lucene, so I spent a couple hours this weekend banging around on it.

The first step was to get my WordPress content (which is stored in MySQL) into Lucene. A couple lines of code later I had a Lucene index full of all 857 (as of 11/14/2006) posts including the blog post ID, subject, body, date and permalink. Next, I checked out and compiled the Lucene similarity contrib, whose most important asset is the MoreLikeThis class (written in part by co-worker Bruce Ritchie). You provide an instance of MoreLikeThis a document to parse, an index to search and the fields in the index you want to compare against the given document and then execute a Lucene search just like you normally would:

Reader reader = ...;
IndexReader index = IndexReader.open(indexfile);
IndexSearcher searcher = new IndexSearcher(index);
MoreLikeThis mlt = new MoreLikeThis(index);
mlt.setFieldNames(new String[] {"subject", "body"});
Query query = mlt.like(reader);
Hits hits = is.search(query);

I’ll skip all the glue and say that I wired all this up into a servlet that spits out JSON:

Map entries = getRelatedEntries(postID, body);
JSONObject json = JSONObject.fromObject( entries );
response.setContentType("text/javascript");
response.getWriter().write("Related = {}; Related.posts = " + json.toString());

and then used client side JavaScript and some PHP to put it all together:

<h5>Related Content</h5>
<script type="text/javascript"
  src="http://cephas.net/blog/related.js?post=<?php the_ID(); ?>">
</script>
<script type="text/javascript">
for (post in Related.posts) {
document.write('<li><a href="' + Related.posts[post] + '">' + post + '</a></li>');
}
</script>

I’ve been cruising around the blog and so far, I think that MoreLikeThis works really well. For the most part, the posts that I would expect to be related, are related. There are a couple posts which seem to pop to the top of the ‘related content’ feed that I’ll have to fix and I would like to boost the terms in the subject of the original document, but other than that, I’m happy with it.

Back to sphere, and specifically to Brady’s post about it on the Radar blog:

Top-Left Corner: Recent, similar blog posts from other blogs.
Bottom-Left Corner: Recommended blogs that are selected by the site-owner. This is very handy for blog networks.
Top-Right Corner: Similar posts from that blog
Bottom-Right Corner: Ad, currently served by FM Pub.

Given a week, I’m guessing that you could use the Google API to do the top-left corner, hardcode the content in the bottom left, use MoreLikeThis in the top right and the bottom right you’d want to do yourself anyway. So if you were a publisher looking for more page views, why would you even consider the Sphere widget?

Nutch, Yahoo!, and Hadoop

It’s been awhile since I mentioned anything about Lucene, my favorite Java based open source indexing and search library (which I built the karakoram spider / search application around). Doug Cutting, who created Lucene and who has spent the last couple years working on Nutch, was recently hired by Yahoo!. I just have a couple questions:

a) why would Yahoo want to hire a guy writing a Java based web crawler and indexer?

b) where does he get all the cool names? Nutch? Hadoop?

c) How cool does Hadoop sound? Hadoop Distributed Filesystem (HDFS) and an implementation of MapReduce. Hmm.. where else have I heard about those terms bantered about?

Conflicting mindsets of C# vs. Java: Part II

You all read the the ‘Conflicting mindsets of C# vs. Java‘ weblog post right? And you all noticed that the guys running the Lucene.NET project on sourceforge closed up shop, took all their toys and went on home right? I’m gonna go out on a limb and say that they’re related.

The way I see it, *in general* the .NET community conversation is dominated by talk about the latest and greatest that microsoft is putting out; there’s talk about MapPoint Location Server, SQL Server, Longhorn, ASP.NET 2.0 and Visual Studio; all products of Redmond. The same group of Java developers are talking about JBoss, Hibernate, Struts and Eclipse: none of which came out of the Silicon Valley.

Malcolm’s mindset #1 says that .NET developers accept the tools and services that are provided them by Microsoft and I think for the most part this is true. You don’t see .NET developers spending their cycles on persistence layers, web application frameworks or caching solutions probably because Microsoft has provided Microsoft solutions for all these problems. But if it’s just providing tools, then why aren’t JSF, JDO and NetBeans dominating the javablogs conversations? Seriously, take a look at ASP.NET and JSF. They aren’t that different and yet ASP.NET is widely used in conjunction with Visual Studio while JSF is rarely lauded and more often derided. I think he’s right, it’s really a mindset.

Which brings me back to the Lucene.NET guys. Why would they close up shop? Why not continue to donate their time and energy to an excellent cause? Maybe the Microsoft mindset has something to do with it. How about this: a search on google for ‘lucene’ within the weblogs.asp.net domain yields exactly 17 results. The same search on jroller.com yields 2570 results. Admittedly, Lucene has been around longer, but maybe one of the reasons that the Lucene.NET guys packed it up (and are now trying to sell their work) is that no one paid any attention to them because they were all too busy working with SQL Server full-text indexing, a tool given them by Microsoft (but one that costs thousands of dollars per processor). Another reason that a project like Lucene or Struts or Tomcat flourishes is because there is a certain amount of prestige working on a big open source project. If you work on open source projects for the prestige and you’re not getting the attention you think you deserve, you find another motivation. In their case money was a motivation, so they closed up project on sourceforge and they’re selling a personal edition and a business edition. They might make a couple bucks, but I bet in 1 year there won’t be many people writing about searchblackbox.com.

So what’s my point? That all .NET developers are greedy and don’t care about the community? Not really. I think it’s that the two communities have different bus drivers: .NET developers look to Microsoft to provide the tools they need to do their jobs… and if they look elsewhere or copy something else, Microsoft will eventually come in and make a product of their own that does the job, thereby negating any work the developers do in the meantime. Microsoft drives the bus. Java developers look at the products and specs that Sun puts out and then go and build their own tools or frameworks or applications to do the job. Sun will eventually put out something through the JCP that does the job…. but the developers in the Java community will only use it if they want too, witness the continued popularity of Struts and the lack of interest in JSF. In the Java camp, the developers drive the bus.

Extracting Text From MS Word

Someone on the Lucene User list wanted to know if it was possible to search MS Word documents using Lucene. The normal response is to go and take a look at the Jakarta POI project (new blog by the way). Ryan Ackley submitted his website (textmining.org) along with a plug for his TextMining.org Word Text Extractor v0.4 and some sample code:

FileInputStream in = new FileInputStream ("test.doc");
WordExtractor extractor = new WordExtractor();
String str = extractor.extractText();

Nice.

Someone else noted that the Python version of Lucene (called Lupy) has an indexer for MS Word and PDF as well, although it appears to only work on Windows.

jSearch 1.1

Spent some time this last week making some minor updates to the jSearch codebase. Along the way I decided to get creative and rename it. So jSearch is now called karakoram, after the mountain range that saddles Pakistan and China. The updates include:

  • modified JSP templates to use Struts <html:img /> and <html:image /> tags instead of hardcoding the context into the templates
  • added Hibernate xdoclet tags to automate DDL export and Hibernate mapping and config document creation. Basically I updated the four core objects in karakoram, embedding the Hibernate xdoclet tags so that Ant can automatically create an installation.sql file, the four Hibernate mapping files, and the master Hibernate configuration file (hibernate.cfg.xml).
  • updated hibernate.cfg.xml to run off of a JNDI configured datasource instead of a datasource configured in hibernate.cfg.xml (personal preference: I find that it’s easier to configure the application once in server.xml and then deploy the application without worrying about the datasource configuration)

I’ll post updated installation instructions when I have a chance tomorrow, in the meantime you can download the latest war file by visiting the karakoram homepage.