Clearspace, Lucene, Software Development

How MoreLikeThis Works in Lucene

March 30, 2008 ajohnson 33 Comments

We created a ‘related items’ feature way back in Clearspace 1.0 (I mocked some of it out here just to prove that it worked) which shows related content based on the document, thread or blog post that you’re currently viewing. It was built using the MoreLikeThis class, a contribution made to the Lucene project by David Spencer, Mark Harwood and my favorite Canadian coworker Bruce Ritchie.

The really interesting thing about MoreLikeThis (at least to me) is that my first inclination was probably to think that Lucene goes and spends some cycles looking for content related to the source document: the focus being on the search and the other content. The reality is that the majority of the work is done by analyzing content the source document relative to the aggregate of all the other content that exists in the index. Said another way, MoreLikeThis doesn’t work by running some special search or query, it works by comparing the document that you’re asking about and the entire index as a whole (and then by running a query).

Anyway, in general I think it works pretty well, but for some odd reason, people keep asking the question “how does it work?” and since the documentation doesn’t go into much detail, I thought I’d try writing it up. So here goes nothing…

The first thing you need to do if you’re going to find related content is to tell Lucene what you want to find related content for. The MoreLikeThis class gives you five options: you can either tell it to look at an existing document in the Lucene index or let it try to parse an external resource from a file, an inputstream, a reader or a URL. In Clearspace we actually have a way of getting the Lucene document ID given the Clearspace object type and object ID, the code for our ‘related items’ feature looks something like this:

Query q = buildObjectTypeAndIDQuery(String.valueOf(messageID), JiveConstants.MESSAGE);
Hits hits = searcher.search(q);
...
int docNum = hits.id(0);
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{IndexField.subject.toString(), IndexField.body.toString(),
                    IndexField.tagID.toString()});
mlt.setMinWordLen(2);
mlt.setBoost(true);
q = mlt.like(docNum);

So in Clearspace we first fetch the Lucene document, construct a MoreLikeThis instance, tell it that we want to match on the subject, body and tag fields and we only want to calculate ‘relatedness’ based on words whose length is two characters or more. That’s the easy part, now it gets interesting.

Given that you already have a Lucene document, the MoreLikeThis instance loops over the field names (fields are kind of like the Lucene equivalent of database columns) that we specified (or all the field names available in the Lucene index if you don’t specify any) and retrieves a term vector for each of the fields in the document we’re analyzing. A term vector is a data structure that holds a list of all the words that were in the field and the number of times each word was used (excluding words that it considers to be ‘stop words’, you can see a list of common stop words here).

Since I’m sure some of you are tuning out at this point, let’s look at something concrete: assume you’ve created a document that looks like this (by the way, what a thorough wikipedia.org entry!):

Subject: Twinkle Twinkle Little Star
Body: Twinkle Twinkle little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky, Twinkle twinkle little star, How I wonder what you are!

The term vector for the subject would look like this:
terms: twinkle, little, star
termFrequencies: twinkle[2], little[1], star[1]

and for the body:
terms: [above, diamond, high, how, like, little, sky, so, star, twinkle, up, what, wonder, world]
termFrequencies: above[1], diamond[1], high[1], how[2], like[1], little[2], sky[1], so[1], star[2], [twinkle[4], up[1], what[2], wonder[2], world[1]

Meanwhile, back at the ranch.. so we get a term vector for each of the fields available in the document we’re looking for related items for. After we’ve got those, each of the term vectors is merged into a map: the key being the term and the value being the number of times the word was used in the document. The map is then handed to a method (called createQueue) that calculates a reasonably complex score for each word in the map. Since this is really the meat of the entire class, I’ll step through this in a little more detail and I’ll season the meat with data from the index on my blog.

The createQueue method first retrieves the total number of documents that exist in the index that we’re dealing with.

int numDocs = ir.numDocs();

In the index I maintain for my personal blog, there are 998 documents in the index. The second thing it does is create an instance of a class called FreqQ, which extends the PriorityQueue class. This object will maintain an object array whose elements are ordered according to their score, which we’ll cover in a second.

Now the createQueue method iterates over each word in the term frequency map, throwing out words if they don’t occur enough times (see setMinTermFreq) and then testing to find out which field across the entire the Lucene index contains the term the most. Next it calculates the Inverse Document Frequency, which, according to the JavaDoc, is:

… a score factor based on a term’s document frequency (the number of documents which contain the term)… Terms that occur in fewer documents are better indicators of topic, so implementations of this method usually return larger values for rare terms, and smaller values for common terms.

and finally, a score, which is a product of the IDF score and the number of times the word existed in the source document.

Again, I’m guessing you’re getting really sleepy, here’s some real data to look at. Like I mentioned above, I have 998 documents in my Lucene index and if we take a document like this one, you’ll end up with a term vector that looks like this:

a[20], you[20], pre[18], the[17], to[13], of[11], this[10], username[10], column[9], oracle[9], and[8], if[8], null[8], alter[7], appuser[7], into[7], on[7], that[7], href[6], http[6], i[6], sql[6], an[5], com[5], is[5], not[5], password[5], table[5], values[5], varchar[5], www[5], be[4], but[4], case[4], clearspace[4], empty[4], get[4], insert[4], modify[4], which[4], administrator[3], are[3], can[3], database[3], db[3], in[3], jivesoftware[3], mysql[3], nullability[3], server[3], string[3], t[3], thing[3], with[3], about[2], all[2], assume[2], been[2], blogged[2], d[2], different[2], feature[2], for[2], from[2], has[2], have[2], hsqldb[2], jive[2], make[2], modified[2], nvarchar[2], out[2], plan[2], postgres[2], products[2], ran[2], requirements[2], sensitivity[2], so[2], space[2], store[2], strike[2], support[2], where[2], will[2], above[1], again[1], against[1], already[1], andrew[1], any[1], appears[1], application[1], at[1], attempt[1], bennett[1], bit[1], blog[1], borrowed[1], both[1], bottom[1], by[1], cannot[1], changing[1], chars[1], classes[1], cmu[1], code[1], columns[1], consider[1], contrib[1], converted[1], cool[1], couple[1], crazy[1], day[1], decided[1], default[1], detail[1], didn[1], discuss[1], do[1], doing[1], edu[1], error[1], example[1], finally[1], first[1], flushed[1], forums[1], further[1], going[1], good[1], goodness[1], heads[1], helpful[1], his[1], insensitive[1], instead[1], interesting[1], involved[1], issue[1], issues[1], it[1], itself[1], jist[1], job[1], jsp[1], kb[1], kind[1], lately[1], least[1], like[1], line[1], ll[1], look[1], looks[1], lot[1], mcelwee[1], microsoft[1], more[1], my[1], needed[1], nice[1], no[1], notice[1], nugget[1], number[1], only[1], or[1], ora[1], over[1], page[1], past[1], platforms[1], probably[1], product[1], re[1], readers[1], reading[1], reason[1], respect[1], results[1], retrieve[1], ridiculous[1], sample[1], second[1], see[1], select[1], semicolon[1], sensitive[1], servers[1], shadow[1], similar[1], six[1], software[1], something[1], specific[1], specification[1], specify[1], standard[1], statement[1], supporting[1], sure[1], system[1], tails[1], than[1], then[1], there[1], they[1], things[1], those[1], thought[1], thunderguy[1], tolowercase[1], touppercase[1], tried[1], try[1], two[1], txt[1], unless[1], ups[1], used[1], user[1], variety[1], ve[1], very[1], want[1], warned[1], was[1], we[1], week[1], whatever[1], wonderful[1], word[1], work[1], write[1], yes[1], your[1], yourself[1]

and stop words would get stripped out and words that occurred less than two times would get stripped out so you’d be left with a term frequency map that looked like this:

pre[18], username[10], column[9], oracle[9], alter[7], appuser[7], href[6], http[6], sql[6], com[5], password[5], table[5], values[5], varchar[5], www[5], case[4], clearspace[4], empty[4], get[4], insert[4], modify[4], which[4], administrator[3], can[3], database[3], db[3], jivesoftware[3], mysql[3], nullability[3], server[3], string[3], thing[3], about[2], all[2], assume[2], been[2], blogged[2], d[2], different[2], feature[2], from[2], has[2], have[2], hsqldb[2], jive[2], make[2], modified[2], nvarchar[2], out[2], plan[2], postgres[2], products[2], ran[2], requirements[2], sensitivity[2], so[2], space[2], store[2], strike[2], support[2], where[2],

So for each of these terms, Lucene finds the field that contains the most instances of the given term and then calculates the idf value and the score. The default implementation of the idf value looks like this:

return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0)

So given the term frequency map above and the document frequency noted below, we get the following idf values and scores for the most popular terms in this document:

Term	Number of Instances of Term in Document	Number of Documents Matching Term	IDF value	Score
pre	18	26	4.609916	82.978
username	10	23	4.7276993	47.276
column	9	13	5.266696	47.400264
oracle	9	8	5.7085285	51.376
alter	7	1	7.212606	50.488

Finally, after all the terms have been added to the priority queue, we create a Lucene query, looping over the first 25 terms (this is the default and can be changed via setMaxQueryTerms) in the queue

TermQuery tq = new TermQuery(new Term("body", "oracle"));

and optionally boosting each term according to the score:

tq.setBoost(51.376 / 82.978);

The resulting Lucene query (in string format) looks something like this:

body:pre body:username^.56974 body:column^.57123 body:oracle^.61915 ...

Now I’ll be curious to see what the related item is for this post!

I hope you learned as much as I did. If you’re extremely curious about this whole Lucene thing and you like the in-depth stuff, you should definitely check out Luke, it’ll open a whole new world to you.

33 thoughts on “How MoreLikeThis Works in Lucene”

Bruce says:

March 31, 2008 at 5:21 am

Great post! I hope this is helpful to all those curious people out there.

Reply
Pingback: links for 2008-03-31 by Leesy’s Lounge
Hasith says:

August 28, 2008 at 1:07 am

good post and learned a lot. thanks

Reply
Prafulla says:

September 29, 2008 at 11:28 pm

Nice post..covers the “MoreLikeThis” class in great depth…

Can you please tell me how efficient this class is.. and does it scale well to large number of documents/terms ?

Reply
Mounir says:

November 7, 2008 at 12:14 pm

Thank you for this post.

In reply to Prafulla, the moreLikeThis scales extermely well:
a query of a 150 word document into a database containing about 10 million of documents of similar size returns an answer within 5 sec (depending on your settings).
And it works extremely well. The results returned are very related to the query.

Reply
Mounir says:

November 7, 2008 at 12:51 pm

One of the problems I have right now with this function is the following:
when I do a search using MoreLikeThis, the best hit is returned with a score of 1. However, I would like to get a score so that the query would be scored 1 and everything else less than that.
The only solution I have is to add the query to the index. It will be returned as the best match scored 1. But there must be a better way. Any thoughts?

Thank you

Reply
arekanderu says:

March 28, 2009 at 4:43 pm

Hello,

Very interesting article. I do have one question though. I understand how the IDF is calculated but you do not mention how the Score is.

Can you please clarify? Thank you for your time.

Reply
arekanderu says:

March 28, 2009 at 4:45 pm

No need to clarify; figured it out

You multiply the number of instances of term in document * IDF value

Reply
Subbu says:

August 4, 2009 at 9:08 pm

Very good one. Thanks for sharing.

Reply
Wilson Penha says:

October 20, 2009 at 11:18 am

Hi Aaron,

That is a very nice article,

But I am still a little lost, I got a challenger like this,

I have a lucene index, which have 300 docs, text docs with almost 20 pages each doc, then I need to check if a new doc with almost the same 20 pages, has it content already into my lucene index, what means this new doc could be a duplicated one,

That is my problem, how do I build a code using MoreLikeThis and Lucene search, to check if this new doc has a duplicated content, bcz I don’t want to allow documents with the samething being insert into my index.

How it can be done by using what you show here,
Have you done it before?

Thanks in advance

Wilson

Reply
Pingback: Taking ‘Possibly Related Posts’ to the next level « Poblish Blog
Mike says:

February 8, 2010 at 7:05 am

Great article!
When using like(numDoc) you get better matches, as information about fields, ect. can be used – correct?
Any way of getting the same accuracy when using like(reader) without having to add the target document to the index?

Reply
Lisheng says:

July 8, 2010 at 4:09 pm

Thanks very much for sharing such valuable information, this post is very helpful!!

Reply
Anita says:

July 28, 2010 at 12:50 am

Very good post..I have one problem with lucene searcher.If you could, plz help me.
Am getting arrayIndexOutOfBounds exception when i do multi field query. for eg,
title:one subject:two
The exception arising in TermScorer.score() method

Reply
holster says:

January 18, 2011 at 7:39 am

Hi,
I am trying out the method mlt.retrieveInterestingTerms(reader) but I am always getting zero results. Any pointer to a good documentation on this how to do it?
Thanks.

Reply
Tran says:

April 29, 2011 at 9:21 pm

Hey,
Thank your post!
I have a problem and need some helps:
– I have a string, for example: “Lodon is city of England…”
– And I want to represent that string into weight vector (using if-tdf), so in Lucene, this can be done ?
Thanks.

Reply
Pingback: Open & Search » Blog Archive » Mejorando el MoreLikeThis en Apache Solr (1 de 3)
Sandeep says:

October 31, 2011 at 9:10 am

Thanks for the post !

Reply
Pingback: eFolded » Related posts and fulltext search with Lucene
Pingback: MLT – MoreLikeThis | ALPIÑOLO
HobbyHorse says:

May 2, 2012 at 3:20 am

Great post. Well Done and thank you very much for it.

Reply
Mike Bria says:

May 11, 2012 at 8:34 am

Fantastic post, very clear and helpful!

Mike

Reply
Sahil Shah says:

May 27, 2012 at 10:27 pm

Aaron

This is a great post. I am working on the More Like This feature and have to customize it for some purposes and this was the best resource to understand how MLT works.

However, I had a few questions.

1. Can I use terms from one field (say f1) and generate an MLT query with this term in a different field (say f2)?

2. I want the MLT query to run on a subset of the index and not the entire index (specifically the more recent ones, there is a date field in the schema). How do I cutomize MLT to do that?

3. Instead of stop words can I have a RegEx elimination od terms?

Do I need to edit the source code for this or writing a plug-in would do? I am new to Java and even newer to Solr so it would be great if you could give a detailed reply.

Thanks a lot.

Reply
Sanket Jain says:

September 29, 2013 at 10:09 pm

Great Post. Thank you for writing this.

Reply
Alich says:

February 12, 2014 at 12:53 am

GreaaaaaaaaaaaaaaaaT Post! Thank u so much!

Reply
Lcng says:

March 27, 2014 at 10:36 pm

RavenDB references to your blog, and then I get here.
Great “how does it work”.
Thank you.

Reply
Megha Redkar says:

December 6, 2014 at 8:27 am

brilliant explanation!

Reply
Ashok Chilakapati says:

January 9, 2016 at 9:17 am

Very nicely done. It is 2016 now, and your write up is still the best I have found.

Reply
Yossi says:

March 10, 2016 at 6:51 am

Thanks! A great read, very clear and very useful 🙂

Reply
Pingback: The Cogworks Ltd › Similarity new uComponent datatype
ahmadmehr says:

December 19, 2016 at 4:28 am

gooood!

Reply
Mike says:

June 8, 2017 at 7:51 am

Hi,

sorry I’m thick.

I have tried to follow the workings of calculating the IDF for “pre” with a body of 998 documents and the table as you described it, and I simply can’t follow the numbers.

would you please be kind enough to show the workings, step by step on how to calculate the IDF for the first vector (pre), and also the score?

Reply
Pingback: Finding duplicates in a large set of files - DEXSTR

Aaron Johnson

How MoreLikeThis Works in Lucene

33 thoughts on “How MoreLikeThis Works in Lucene”

Leave a Reply Cancel reply

Now with 50% less caffeine!