Using Lucene and MoreLikeThis to show Related Content

If you read this blog, you probably paid a smidgen of attention to the Web 2.0 Conference held last week in San Francisco. Sphere was one of the companies that presented and they launched a product called the “Sphere It Contextual Widget for blogs“, which is JavaScript widget you can add to your blog or content focused site that displays contextually similar blogs and blog posts for the reader. I’ve always wanted to try to do something similar (no pun intended) using Lucene, so I spent a couple hours this weekend banging around on it.

The first step was to get my WordPress content (which is stored in MySQL) into Lucene. A couple lines of code later I had a Lucene index full of all 857 (as of 11/14/2006) posts including the blog post ID, subject, body, date and permalink. Next, I checked out and compiled the Lucene similarity contrib, whose most important asset is the MoreLikeThis class (written in part by co-worker Bruce Ritchie). You provide an instance of MoreLikeThis a document to parse, an index to search and the fields in the index you want to compare against the given document and then execute a Lucene search just like you normally would:

Reader reader = ...;
IndexReader index = IndexReader.open(indexfile);
IndexSearcher searcher = new IndexSearcher(index);
MoreLikeThis mlt = new MoreLikeThis(index);
mlt.setFieldNames(new String[] {"subject", "body"});
Query query = mlt.like(reader);
Hits hits = is.search(query);

I’ll skip all the glue and say that I wired all this up into a servlet that spits out JSON:

Map entries = getRelatedEntries(postID, body);
JSONObject json = JSONObject.fromObject( entries );
response.setContentType("text/javascript");
response.getWriter().write("Related = {}; Related.posts = " + json.toString());

and then used client side JavaScript and some PHP to put it all together:

<h5>Related Content</h5>
<script type="text/javascript"
  src="http://cephas.net/blog/related.js?post=<?php the_ID(); ?>">
</script>
<script type="text/javascript">
for (post in Related.posts) {
document.write('<li><a href="' + Related.posts[post] + '">' + post + '</a></li>');
}
</script>

I’ve been cruising around the blog and so far, I think that MoreLikeThis works really well. For the most part, the posts that I would expect to be related, are related. There are a couple posts which seem to pop to the top of the ‘related content’ feed that I’ll have to fix and I would like to boost the terms in the subject of the original document, but other than that, I’m happy with it.

Back to sphere, and specifically to Brady’s post about it on the Radar blog:

Top-Left Corner: Recent, similar blog posts from other blogs.
Bottom-Left Corner: Recommended blogs that are selected by the site-owner. This is very handy for blog networks.
Top-Right Corner: Similar posts from that blog
Bottom-Right Corner: Ad, currently served by FM Pub.

Given a week, I’m guessing that you could use the Google API to do the top-left corner, hardcode the content in the bottom left, use MoreLikeThis in the top right and the bottom right you’d want to do yourself anyway. So if you were a publisher looking for more page views, why would you even consider the Sphere widget?

4 thoughts on “Using Lucene and MoreLikeThis to show Related Content”

  1. Aaron – nice post. I’m glad we could inspire some experimentation 🙂

    You’ve correctly homed in on the QBE algorithm as providing a significant part of the value of the Sphere contextual widget. Not all QBE algorithms are equally effective, however, and the Lucene similarity contrib, while an excellent implementation of traditional QBE, doesn’t work convincingly across large content collections with diverse subject matter. Our QBE alrogithm was designed specifically with large, dynamic content collections in mind, e.g. the blogosphere or mainstream news sites.

    Another value-creator for bloggers and publishers is drawing on our index of the blogosphere for related content, in addition to the publisher’s own content. If you get around to testing your idea of building a “related posts from others” on top of the Google API, I’d be interested to hear how that goes. I’ll admit to being skeptical that a QBE approach is suited to a meta-search implementation, but the proof of the pudding is in the eating, as the saying goes.

    Thanks for posting your experiment, I’m looking forward to the next installment.

    Best regards,

  2. Aron – really nice post.

    Will it also works when we have multiple Lucene documents instead of just one?
    Basically, I retrieve 10 documents, and each of the documents term vectors are merged into the map that the createQueue method handles.

    [rest is supplied as before …]

    So what I want to achieve is something like a MoreLikeThis with multiple documents that is not based on just one document.

    Thanks in advance.

    Cheers,
    MK

Leave a Reply

Your email address will not be published. Required fields are marked *