How MoreLikeThis Works in Lucene

We created a ‘related items’ feature way back in Clearspace 1.0 (I mocked some of it out here just to prove that it worked) which shows related content based on the document, thread or blog post that you’re currently viewing. It was built using the MoreLikeThis class, a contribution made to the Lucene project by David Spencer, Mark Harwood and my favorite Canadian coworker Bruce Ritchie.

The really interesting thing about MoreLikeThis (at least to me) is that my first inclination was probably to think that Lucene goes and spends some cycles looking for content related to the source document: the focus being on the search and the other content. The reality is that the majority of the work is done by analyzing content the source document relative to the aggregate of all the other content that exists in the index. Said another way, MoreLikeThis doesn’t work by running some special search or query, it works by comparing the document that you’re asking about and the entire index as a whole (and then by running a query).

Anyway, in general I think it works pretty well, but for some odd reason, people keep asking the question “how does it work?” and since the documentation doesn’t go into much detail, I thought I’d try writing it up. So here goes nothing…

The first thing you need to do if you’re going to find related content is to tell Lucene what you want to find related content for. The MoreLikeThis class gives you five options: you can either tell it to look at an existing document in the Lucene index or let it try to parse an external resource from a file, an inputstream, a reader or a URL. In Clearspace we actually have a way of getting the Lucene document ID given the Clearspace object type and object ID, the code for our ‘related items’ feature looks something like this:

Query q = buildObjectTypeAndIDQuery(String.valueOf(messageID), JiveConstants.MESSAGE);
Hits hits = searcher.search(q);
...
int docNum = hits.id(0);
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{IndexField.subject.toString(), IndexField.body.toString(),
                    IndexField.tagID.toString()});
mlt.setMinWordLen(2);
mlt.setBoost(true);
q = mlt.like(docNum);

So in Clearspace we first fetch the Lucene document, construct a MoreLikeThis instance, tell it that we want to match on the subject, body and tag fields and we only want to calculate ‘relatedness’ based on words whose length is two characters or more. That’s the easy part, now it gets interesting.

Given that you already have a Lucene document, the MoreLikeThis instance loops over the field names (fields are kind of like the Lucene equivalent of database columns) that we specified (or all the field names available in the Lucene index if you don’t specify any) and retrieves a term vector for each of the fields in the document we’re analyzing. A term vector is a data structure that holds a list of all the words that were in the field and the number of times each word was used (excluding words that it considers to be ‘stop words’, you can see a list of common stop words here).

Since I’m sure some of you are tuning out at this point, let’s look at something concrete: assume you’ve created a document that looks like this (by the way, what a thorough wikipedia.org entry!):

Subject: Twinkle Twinkle Little Star
Body: Twinkle Twinkle little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky, Twinkle twinkle little star, How I wonder what you are!

The term vector for the subject would look like this:
terms: twinkle, little, star
termFrequencies: twinkle[2], little[1], star[1]

and for the body:
terms: [above, diamond, high, how, like, little, sky, so, star, twinkle, up, what, wonder, world]
termFrequencies: above[1], diamond[1], high[1], how[2], like[1], little[2], sky[1], so[1], star[2], [twinkle[4], up[1], what[2], wonder[2], world[1]

Meanwhile, back at the ranch.. so we get a term vector for each of the fields available in the document we’re looking for related items for. After we’ve got those, each of the term vectors is merged into a map: the key being the term and the value being the number of times the word was used in the document. The map is then handed to a method (called createQueue) that calculates a reasonably complex score for each word in the map. Since this is really the meat of the entire class, I’ll step through this in a little more detail and I’ll season the meat with data from the index on my blog.

The createQueue method first retrieves the total number of documents that exist in the index that we’re dealing with.

int numDocs = ir.numDocs();

In the index I maintain for my personal blog, there are 998 documents in the index. The second thing it does is create an instance of a class called FreqQ, which extends the PriorityQueue class. This object will maintain an object array whose elements are ordered according to their score, which we’ll cover in a second.

Now the createQueue method iterates over each word in the term frequency map, throwing out words if they don’t occur enough times (see setMinTermFreq) and then testing to find out which field across the entire the Lucene index contains the term the most. Next it calculates the Inverse Document Frequency, which, according to the JavaDoc, is:

… a score factor based on a term’s document frequency (the number of documents which contain the term)… Terms that occur in fewer documents are better indicators of topic, so implementations of this method usually return larger values for rare terms, and smaller values for common terms.

and finally, a score, which is a product of the IDF score and the number of times the word existed in the source document.

Again, I’m guessing you’re getting really sleepy, here’s some real data to look at. Like I mentioned above, I have 998 documents in my Lucene index and if we take a document like this one, you’ll end up with a term vector that looks like this:

a[20], you[20], pre[18], the[17], to[13], of[11], this[10], username[10], column[9], oracle[9], and[8], if[8], null[8], alter[7], appuser[7], into[7], on[7], that[7], href[6], http[6], i[6], sql[6], an[5], com[5], is[5], not[5], password[5], table[5], values[5], varchar[5], www[5], be[4], but[4], case[4], clearspace[4], empty[4], get[4], insert[4], modify[4], which[4], administrator[3], are[3], can[3], database[3], db[3], in[3], jivesoftware[3], mysql[3], nullability[3], server[3], string[3], t[3], thing[3], with[3], about[2], all[2], assume[2], been[2], blogged[2], d[2], different[2], feature[2], for[2], from[2], has[2], have[2], hsqldb[2], jive[2], make[2], modified[2], nvarchar[2], out[2], plan[2], postgres[2], products[2], ran[2], requirements[2], sensitivity[2], so[2], space[2], store[2], strike[2], support[2], where[2], will[2], above[1], again[1], against[1], already[1], andrew[1], any[1], appears[1], application[1], at[1], attempt[1], bennett[1], bit[1], blog[1], borrowed[1], both[1], bottom[1], by[1], cannot[1], changing[1], chars[1], classes[1], cmu[1], code[1], columns[1], consider[1], contrib[1], converted[1], cool[1], couple[1], crazy[1], day[1], decided[1], default[1], detail[1], didn[1], discuss[1], do[1], doing[1], edu[1], error[1], example[1], finally[1], first[1], flushed[1], forums[1], further[1], going[1], good[1], goodness[1], heads[1], helpful[1], his[1], insensitive[1], instead[1], interesting[1], involved[1], issue[1], issues[1], it[1], itself[1], jist[1], job[1], jsp[1], kb[1], kind[1], lately[1], least[1], like[1], line[1], ll[1], look[1], looks[1], lot[1], mcelwee[1], microsoft[1], more[1], my[1], needed[1], nice[1], no[1], notice[1], nugget[1], number[1], only[1], or[1], ora[1], over[1], page[1], past[1], platforms[1], probably[1], product[1], re[1], readers[1], reading[1], reason[1], respect[1], results[1], retrieve[1], ridiculous[1], sample[1], second[1], see[1], select[1], semicolon[1], sensitive[1], servers[1], shadow[1], similar[1], six[1], software[1], something[1], specific[1], specification[1], specify[1], standard[1], statement[1], supporting[1], sure[1], system[1], tails[1], than[1], then[1], there[1], they[1], things[1], those[1], thought[1], thunderguy[1], tolowercase[1], touppercase[1], tried[1], try[1], two[1], txt[1], unless[1], ups[1], used[1], user[1], variety[1], ve[1], very[1], want[1], warned[1], was[1], we[1], week[1], whatever[1], wonderful[1], word[1], work[1], write[1], yes[1], your[1], yourself[1]

and stop words would get stripped out and words that occurred less than two times would get stripped out so you’d be left with a term frequency map that looked like this:

pre[18], username[10], column[9], oracle[9], alter[7], appuser[7], href[6], http[6], sql[6], com[5], password[5], table[5], values[5], varchar[5], www[5], case[4], clearspace[4], empty[4], get[4], insert[4], modify[4], which[4], administrator[3], can[3], database[3], db[3], jivesoftware[3], mysql[3], nullability[3], server[3], string[3], thing[3], about[2], all[2], assume[2], been[2], blogged[2], d[2], different[2], feature[2], from[2], has[2], have[2], hsqldb[2], jive[2], make[2], modified[2], nvarchar[2], out[2], plan[2], postgres[2], products[2], ran[2], requirements[2], sensitivity[2], so[2], space[2], store[2], strike[2], support[2], where[2],

So for each of these terms, Lucene finds the field that contains the most instances of the given term and then calculates the idf value and the score. The default implementation of the idf value looks like this:

return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0)

So given the term frequency map above and the document frequency noted below, we get the following idf values and scores for the most popular terms in this document:

Term Number of Instances of Term in Document Number of Documents Matching Term IDF value Score
pre 18 26 4.609916 82.978
username 10 23 4.7276993 47.276
column 9 13 5.266696 47.400264
oracle 9 8 5.7085285 51.376
alter 7 1 7.212606 50.488

Finally, after all the terms have been added to the priority queue, we create a Lucene query, looping over the first 25 terms (this is the default and can be changed via setMaxQueryTerms) in the queue

TermQuery tq = new TermQuery(new Term("body", "oracle"));

and optionally boosting each term according to the score:

tq.setBoost(51.376 / 82.978);

The resulting Lucene query (in string format) looks something like this:

body:pre body:username^.56974 body:column^.57123 body:oracle^.61915 ...

Now I’ll be curious to see what the related item is for this post!

I hope you learned as much as I did. If you’re extremely curious about this whole Lucene thing and you like the in-depth stuff, you should definitely check out Luke, it’ll open a whole new world to you.

33 thoughts on “How MoreLikeThis Works in Lucene”

  1. Nice post..covers the “MoreLikeThis” class in great depth…

    Can you please tell me how efficient this class is.. and does it scale well to large number of documents/terms ?

  2. Thank you for this post.

    In reply to Prafulla, the moreLikeThis scales extermely well:
    a query of a 150 word document into a database containing about 10 million of documents of similar size returns an answer within 5 sec (depending on your settings).
    And it works extremely well. The results returned are very related to the query.

  3. One of the problems I have right now with this function is the following:
    when I do a search using MoreLikeThis, the best hit is returned with a score of 1. However, I would like to get a score so that the query would be scored 1 and everything else less than that.
    The only solution I have is to add the query to the index. It will be returned as the best match scored 1. But there must be a better way. Any thoughts?

    Thank you

  4. Hello,

    Very interesting article. I do have one question though. I understand how the IDF is calculated but you do not mention how the Score is.

    Can you please clarify? Thank you for your time.

  5. No need to clarify; figured it out

    You multiply the number of instances of term in document * IDF value

  6. Hi Aaron,

    That is a very nice article,

    But I am still a little lost, I got a challenger like this,

    I have a lucene index, which have 300 docs, text docs with almost 20 pages each doc, then I need to check if a new doc with almost the same 20 pages, has it content already into my lucene index, what means this new doc could be a duplicated one,

    That is my problem, how do I build a code using MoreLikeThis and Lucene search, to check if this new doc has a duplicated content, bcz I don’t want to allow documents with the samething being insert into my index.

    How it can be done by using what you show here,
    Have you done it before?

    Thanks in advance

    Wilson

  7. Great article!
    When using like(numDoc) you get better matches, as information about fields, ect. can be used – correct?
    Any way of getting the same accuracy when using like(reader) without having to add the target document to the index?

  8. Very good post..I have one problem with lucene searcher.If you could, plz help me.
    Am getting arrayIndexOutOfBounds exception when i do multi field query. for eg,
    title:one subject:two
    The exception arising in TermScorer.score() method

  9. Hi,
    I am trying out the method mlt.retrieveInterestingTerms(reader) but I am always getting zero results. Any pointer to a good documentation on this how to do it?
    Thanks.

  10. Hey,
    Thank your post!
    I have a problem and need some helps:
    – I have a string, for example: “Lodon is city of England…”
    – And I want to represent that string into weight vector (using if-tdf), so in Lucene, this can be done ?
    Thanks.

  11. Aaron

    This is a great post. I am working on the More Like This feature and have to customize it for some purposes and this was the best resource to understand how MLT works.

    However, I had a few questions.

    1. Can I use terms from one field (say f1) and generate an MLT query with this term in a different field (say f2)?

    2. I want the MLT query to run on a subset of the index and not the entire index (specifically the more recent ones, there is a date field in the schema). How do I cutomize MLT to do that?

    3. Instead of stop words can I have a RegEx elimination od terms?

    Do I need to edit the source code for this or writing a plug-in would do? I am new to Java and even newer to Solr so it would be great if you could give a detailed reply.

    Thanks a lot.

  12. Pingback: The Cogworks Ltd › Similarity new uComponent datatype
  13. Hi,

    sorry I’m thick.

    I have tried to follow the workings of calculating the IDF for “pre” with a body of 998 documents and the table as you described it, and I simply can’t follow the numbers.

    would you please be kind enough to show the workings, step by step on how to calculate the IDF for the first vector (pre), and also the score?

Leave a Reply

Your email address will not be published. Required fields are marked *