Indexing Database Content with Lucene & ColdFusion

Terry emailed me a couple days ago wondering how he could use ColdFusion and Lucene to index and then search a database table. Since we’re completely socked in here in Boston, I had nothing better to do today that hack together a quick snippet that does just that:

<cfset an = CreateObject("java", "org.apache.lucene.analysis.StopAnalyzer")>
<cfset an.init()>
<cfset writer = CreateObject("java", "org.apache.lucene.index.IndexWriter")>
<cfset writer.init("C:\mysite\index\", an, "true")>
<cfquery name="contentIndex" datasource="sample">
select label, description, id
FROM product
</cfquery>
<cfloop query="contentIndex">
  <cfset d = CreateObject("java", "org.apache.lucene.document.Document")>
  <cfset fld = CreateObject("java", "org.apache.lucene.document.Field")>
  <cfset content = contentIndex.description>
  <cfset title = contentIndex.label>
  <cfset urlpath = "/products/detail.cfm?id=" & contentIndex.id>
  <cfset d.add(fld.Keyword("url", urlpath))>
  <cfset d.add(fld.Text("title", title))>
  <cfset d.add(fld.UnIndexed("summary", content))>
  <cfset d.add(fld.UnStored("body", content))>
  <cfset writer.addDocument(doc)>
</cfloop>  
<cfset writer.close()>

The only real change from the code that I wrote previously to index a document was that instead of looping over the file system looking for documents, I loop over a query and then indexed the text of a column from the database rather than the text of a document. (I would have written in in CFScript, but you can’t do queries from CFScript yet, unless you use a UDF to do the query)

You can download the source here, if you’re so inclined.

9 thoughts on “Indexing Database Content with Lucene & ColdFusion”

  1. AJ, do the Document and Field objects have any kind of “clear” method? If so, you would probably get a huge performance boost by moving the createObject calls outside of the loop. For a big query I bet this code is a bit slow. (Although that isn’t a big deal if you index on a timed basis.)

  2. hey Ray

    > AJ, do the Document and Field objects have any kind of “clear”
    > method? If so, you would probably get a huge performance boost by
    > moving the createObject calls outside of the loop. For a big
    > query I bet this code is a bit slow. (Although that isn’t a big
    > deal if you index on a timed basis.)
    — The first createObject call can definitely be moved out of the loop because it’s only being used for the properties that live on it, so this:

    <cfloop query=”contentIndex”>
      <cfset d = CreateObject(“java”, “org.apache.lucene.document.Document”)>
      <cfset fld = CreateObject(“java”, “org.apache.lucene.document.Field”)>
      <cfset content = contentIndex.description>
    ….
    </cfloop>

    can be changed to this:

    <cfset fld = CreateObject(“java”, “org.apache.lucene.document.Field”)>
    <cfloop query=”contentIndex”>
      <cfset d = CreateObject(“java”, “org.apache.lucene.document.Document”)>

    </cfloop>

    They don’t have clear() methods so I think you’re stuck with creating the Document object each time, unfortunately.

    Great points! Thanks Ray!

    AJ

  3. I recently wrote a ColdFusion component that will let you do something similar on the ColdFusion server instead. I haven’t tested it extensively, but its doing the job for me.

    I invoked the .get_any() method which returns a 2 element array (the schema and the data), next I invoke the .toString() method of each element and then simply parse through the xml to get out the data then inserting it into my new local queryResult. Its pretty handy and doesn’t take too much time to run. Feel free to check it out.

  4. Hi Aaron,

    I’m getting ready to use Lucene to search some fields that are going to be stored in a database, but I’m thinking that we can build the index incrementally by adding each group of fields to the index before they are added to the database and removing them from the index when they are removed from the database. That way the search doesn’t require any database access.

    Is there anything wrong with this approach?

  5. How do you search on the DB Index? What will the name of the index be?

    I purchased the “Lindex” utility but would like to search the DB content.

    Thanks again & I picked up a .NET and Java booked.

  6. Where to and how to set the lucen path for the jar file

    i am getting the follwoing error
    Class not found: org.apache.lucene.analysis.StopAnalyzer

    The error occurred in C:\cwhb\Myriad\RecordSheets\indexing_database.cfm: line 1

    1 :
    2 :
    3 : <cfset writer = CreateObject(“java”, “org.apache.lucene.index.IndexWriter

  7. Hello Aaron.

    About a year ago, I read your post about indexing database content with Lucene. I wanted to do more with that, but couldn’t find the time. Recently I did invest a few days effort into making a project called CFLucene — not because I suddenly have lots of spare time, but because I got fed up with not having a CFMX search solution on Mac OS X.

    You might be interested in the way that I put my CFCs together. Feel free to comment on the design.

  8. hi aaron..
    I am trying to implement lucene to search database tables using JSP..

    can you help me how to do this.??

    shreeya

  9. thank you very much , i want to use Lucene in asp.net Project , but i can find the mathod with the DB Table these days , but i get it now ,although i don’t know Java .

Leave a Reply to Aaron Johnson Cancel reply

Your email address will not be published. Required fields are marked *