Aaron Johnson Now with 50% less caffeine!

Posted
6 December 2003 @ 10pm

Tagged
ColdFusion, Lucene

Indexing Database Content with Lucene & ColdFusion

Terry emailed me a couple days ago wondering how he could use ColdFusion and Lucene to index and then search a database table. Since we’re completely socked in here in Boston, I had nothing better to do today that hack together a quick snippet that does just that:

<cfset an = CreateObject("java", "org.apache.lucene.analysis.StopAnalyzer")>
<cfset an.init()>
<cfset writer = CreateObject("java", "org.apache.lucene.index.IndexWriter")>
<cfset writer.init("C:\mysite\index\", an, "true")>
<cfquery name="contentIndex" datasource="sample">
select label, description, id
FROM product
</cfquery>
<cfloop query="contentIndex">
  <cfset d = CreateObject("java", "org.apache.lucene.document.Document")>
  <cfset fld = CreateObject("java", "org.apache.lucene.document.Field")>
  <cfset content = contentIndex.description>
  <cfset title = contentIndex.label>
  <cfset urlpath = "/products/detail.cfm?id=" & contentIndex.id>
  <cfset d.add(fld.Keyword("url", urlpath))>
  <cfset d.add(fld.Text("title", title))>
  <cfset d.add(fld.UnIndexed("summary", content))>
  <cfset d.add(fld.UnStored("body", content))>
  <cfset writer.addDocument(doc)>
</cfloop>  
<cfset writer.close()>

The only real change from the code that I wrote previously to index a document was that instead of looping over the file system looking for documents, I loop over a query and then indexed the text of a column from the database rather than the text of a document. (I would have written in in CFScript, but you can’t do queries from CFScript yet, unless you use a UDF to do the query)

You can download the source here, if you’re so inclined.


9 Comments

Posted by
Raymond Camden
8 December 2003 @ 9am

AJ, do the Document and Field objects have any kind of “clear” method? If so, you would probably get a huge performance boost by moving the createObject calls outside of the loop. For a big query I bet this code is a bit slow. (Although that isn’t a big deal if you index on a timed basis.)


Posted by
Aaron Johnson
17 December 2003 @ 9pm

hey Ray

> AJ, do the Document and Field objects have any kind of “clear”
> method? If so, you would probably get a huge performance boost by
> moving the createObject calls outside of the loop. For a big
> query I bet this code is a bit slow. (Although that isn’t a big
> deal if you index on a timed basis.)
— The first createObject call can definitely be moved out of the loop because it’s only being used for the properties that live on it, so this:

<cfloop query=”contentIndex”>
  <cfset d = CreateObject(”java”, “org.apache.lucene.document.Document”)>
  <cfset fld = CreateObject(”java”, “org.apache.lucene.document.Field”)>
  <cfset content = contentIndex.description>
….
</cfloop>

can be changed to this:

<cfset fld = CreateObject(”java”, “org.apache.lucene.document.Field”)>
<cfloop query=”contentIndex”>
  <cfset d = CreateObject(”java”, “org.apache.lucene.document.Document”)>

</cfloop>

They don’t have clear() methods so I think you’re stuck with creating the Document object each time, unfortunately.

Great points! Thanks Ray!

AJ


Posted by
Erik Giberti
18 December 2003 @ 4pm

I recently wrote a ColdFusion component that will let you do something similar on the ColdFusion server instead. I haven’t tested it extensively, but its doing the job for me.

I invoked the .get_any() method which returns a 2 element array (the schema and the data), next I invoke the .toString() method of each element and then simply parse through the xml to get out the data then inserting it into my new local queryResult. Its pretty handy and doesn’t take too much time to run. Feel free to check it out.


Posted by
Richard Preston
22 September 2004 @ 1pm

Hi Aaron,

I’m getting ready to use Lucene to search some fields that are going to be stored in a database, but I’m thinking that we can build the index incrementally by adding each group of fields to the index before they are added to the database and removing them from the index when they are removed from the database. That way the search doesn’t require any database access.

Is there anything wrong with this approach?


Posted by
Grover Fields
21 December 2004 @ 5am

How do you search on the DB Index? What will the name of the index be?

I purchased the “Lindex” utility but would like to search the DB content.

Thanks again & I picked up a .NET and Java booked.


Posted by
vishnu
31 May 2005 @ 1am

Where to and how to set the lucen path for the jar file

i am getting the follwoing error
Class not found: org.apache.lucene.analysis.StopAnalyzer

The error occurred in C:\cwhb\Myriad\RecordSheets\indexing_database.cfm: line 1

1 :
2 :
3 : <cfset writer = CreateObject(”java”, “org.apache.lucene.index.IndexWriter


Posted by
Joseph Lamoree
13 November 2005 @ 2pm

Hello Aaron.

About a year ago, I read your post about indexing database content with Lucene. I wanted to do more with that, but couldn’t find the time. Recently I did invest a few days effort into making a project called CFLucene — not because I suddenly have lots of spare time, but because I got fed up with not having a CFMX search solution on Mac OS X.

You might be interested in the way that I put my CFCs together. Feel free to comment on the design.


Posted by
shreeya
18 December 2005 @ 1am

hi aaron..
I am trying to implement lucene to search database tables using JSP..

can you help me how to do this.??

shreeya


Posted by
Aliama
22 December 2005 @ 12pm

thank you very much , i want to use Lucene in asp.net Project , but i can find the mathod with the DB Table these days , but i get it now ,although i don’t know Java .


Leave a Comment

Optimal number of arguments to a method or function ASP.NET TextBox MultiLine Incorrect Documentation