{"id":874,"date":"2006-11-14T01:08:58","date_gmt":"2006-11-14T09:08:58","guid":{"rendered":"http:\/\/cephas.net\/blog\/2006\/11\/14\/using-lucene-and-morelikethis-to-show-related-content\/"},"modified":"2006-11-14T01:08:58","modified_gmt":"2006-11-14T09:08:58","slug":"using-lucene-and-morelikethis-to-show-related-content","status":"publish","type":"post","link":"https:\/\/cephas.net\/blog\/2006\/11\/14\/using-lucene-and-morelikethis-to-show-related-content\/","title":{"rendered":"Using Lucene and MoreLikeThis to show Related Content"},"content":{"rendered":"<p>If you read this blog, you probably paid a smidgen of attention to the <a href=\"http:\/\/www.web2con.com\/\">Web 2.0 Conference<\/a> held last week in San Francisco. <a href=\"http:\/\/sphere.wordpress.com\/\">Sphere<\/a> was one of the companies that presented and they launched a product called the &#8220;<a href=\"http:\/\/sphere.wordpress.com\/2006\/11\/12\/sphere-contextual-widget\/\">Sphere It Contextual Widget for blogs<\/a>&#8220;, which is JavaScript widget you can add to your blog or content focused site that displays contextually similar blogs and blog posts for the reader. I&#8217;ve always wanted to try to do something similar (no pun intended) using <a href=\"http:\/\/lucene.apache.org\/\">Lucene<\/a>, so I spent a couple hours this weekend banging around on it. <\/p>\n<p>The first step was to get my <a href=\"http:\/\/wordpress.org\/\">WordPress<\/a> content (which is stored in MySQL) into Lucene.  A couple lines of code later I had a Lucene index full of all 857 (as of 11\/14\/2006) posts including the blog post ID, subject, body, date and permalink.  Next, I checked out and compiled the <a href=\"http:\/\/svn.apache.org\/repos\/asf\/lucene\/java\/trunk\/contrib\/similarity\/\">Lucene similarity contrib<\/a>, whose most important asset is the <a href=\"http:\/\/lucene.apache.org\/java\/docs\/api\/org\/apache\/lucene\/search\/similar\/MoreLikeThis.html\">MoreLikeThis<\/a> class (written in part by co-worker Bruce Ritchie).  You provide an instance of MoreLikeThis a document to parse, an index to search and the fields in the index you want to compare against the given document and then execute a Lucene search just like you normally would:<\/p>\n<pre>\r\nReader reader = ...;\r\nIndexReader index = IndexReader.open(indexfile);\r\nIndexSearcher searcher = new IndexSearcher(index);\r\nMoreLikeThis mlt = new MoreLikeThis(index);\r\nmlt.setFieldNames(new String[] {\"subject\", \"body\"});\r\nQuery query = mlt.like(reader);\r\nHits hits = is.search(query);\r\n<\/pre>\n<p>I&#8217;ll skip all the glue and say that I wired all this up into a servlet that spits out <a href=\"http:\/\/json.org\/\">JSON<\/a>:<\/p>\n<pre>\r\nMap entries = getRelatedEntries(postID, body);\r\nJSONObject json = JSONObject.fromObject( entries );\r\nresponse.setContentType(\"text\/javascript\");\r\nresponse.getWriter().write(\"Related = {}; Related.posts = \" + json.toString());\r\n<\/pre>\n<p>and then used client side JavaScript and some PHP to put it all together:<\/p>\n<pre>\r\n&lt;h5&gt;Related Content&lt;\/h5&gt;\r\n&lt;script type=\"text\/javascript\"\r\n  src=\"http:\/\/cephas.net\/blog\/related.js?post=&lt;?php the_ID(); ?&gt;\"&gt;\r\n&lt;\/script&gt;\r\n&lt;script type=\"text\/javascript\"&gt;\r\nfor (post in Related.posts) {\r\ndocument.write('&lt;li&gt;&lt;a href=\"' + Related.posts[post] + '\"&gt;' + post + '&lt;\/a&gt;&lt;\/li&gt;');\r\n}\r\n&lt;\/script&gt;\r\n<\/pre>\n<p>I&#8217;ve been cruising around the blog and so far, I think that MoreLikeThis works really well.  For the most part, the posts that I would expect to be related, are related. There are a couple posts which seem to pop to the top of the &#8216;related content&#8217; feed that I&#8217;ll have to fix and I would like to boost the terms in the subject of the original document, but other than that, I&#8217;m happy with it.  <\/p>\n<p>Back to sphere, and specifically to <a href=\"http:\/\/radar.oreilly.com\/archives\/2006\/11\/spheres_blog_widget.html\">Brady&#8217;s post about it on the Radar blog<\/a>:<\/p>\n<blockquote><p>\nTop-Left Corner: Recent, similar blog posts from other blogs.<br \/>\nBottom-Left Corner: Recommended blogs that are selected by the site-owner. This is very handy for blog networks.<br \/>\nTop-Right Corner: Similar posts from that blog<br \/>\nBottom-Right Corner: Ad, currently served by FM Pub.\n<\/p><\/blockquote>\n<p>Given a week, I&#8217;m guessing that you could use the Google API to do the top-left corner, hardcode the content in the bottom left, use MoreLikeThis in the top right and the bottom right you&#8217;d want to do yourself anyway.  So if you were a publisher looking for more page views, why would you even consider the Sphere widget?<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you read this blog, you probably paid a smidgen of attention to the Web 2.0 Conference held last week in San Francisco. Sphere was one of the companies that presented and they launched a product called the &#8220;Sphere It Contextual Widget for blogs&#8220;, which is JavaScript widget you can add to your blog or &hellip; <a href=\"https:\/\/cephas.net\/blog\/2006\/11\/14\/using-lucene-and-morelikethis-to-show-related-content\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Using Lucene and MoreLikeThis to show Related Content<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[31,5,19,32],"tags":[],"_links":{"self":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts\/874"}],"collection":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/comments?post=874"}],"version-history":[{"count":0,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts\/874\/revisions"}],"wp:attachment":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/media?parent=874"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/categories?post=874"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/tags?post=874"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}