Category Archives: Lucene

why create jSearch?

One of the comments posted to the blog entry introducing jSearch asked why I thought it needed to be created when a tool like nutch already exists. nutch is a massive undertaking, it’s aim is to create a spider and search engine capable of spidering, indexing and searching billions of web pages while also providing a close shave and making breakfast. nutch wants to be an open source version of google. I created jSearch to be a smaller version of google, indexing single websites or even subsections of a website; more like a departmental or corporate spider, indexing and searching system. If you download the application, you’ll see that jSearch provides some of the same functionality that google does: cached copies of web pages, an XML API (using REST intead of SOAP), logging and reporting of searches and content summarization. Sure, you could use the google web api to provide the same search on your own site, but then you’re limited to the number of searches that google allows per day (1000) with the API, you’re making calls over your WAN to retrieve search results and you have less control (ie: you couldn’t have google index your intranet unless you purchased their appliance).

The second reason I created jSearch was that it was and is an interesting problem to work on. I now have a unique appreciation for the problems that google (or any other company that has created a spider and search engine) has faced. Writing a spider is not a trivial task. Creating a 2 or 3 sentence summary of an HTML page (technically called ‘Text Summarization’) is a topic for master’s thesis. And putting a project like this together becomes a study of the various frameworks for search (Lucene), persistence (Hibernate), and web application development (Struts), which is software engineering.

And really, why not? I enjoyed it. It was interesting and I learned something along the way and I plan on using it.

Wanted: Extracting summary from HTML text

As part of a project I’m working on I need to extract content from an HTML page, in some sense creating a short 200 character summary of the document. Google does a fantastic job of extracting text and presenting a summary of the document in their search listings, I’m wondering how they do that. Here’s the process I’m using right now:

a) Remove all of the HTML comments from the page (ie: <!– –>) because JavaScript is sometimes inside comments, which sometimes includes > and or < which causes (d) to fail

b) Remove everything above the <body> tag, because there isn’t anything valuable there anyway.

c) Remove all the &lta href… > tags, because text links are usually navigation and are repeated across a site… they’re noise and I don’t want them. However, sometimes links are part of the summary of a document… removing a link in the first paragraph of a document can render the paragraph unreadable, or at least incomplete.

b) Remove all the HTML tags, the line breaks, the tabs, etc.. using a regular expression.

For the most part, the above 4 steps do the job, but in some cases not. I’ll go out on a ledge and say that most HTML documents contain text that is repeated throughout the site again and again (header text like Login Now! or footer text like copyright 2004, etc…). My problem is that I want to somehow locate the snippets that are repeated and not include them in the summaries I create… For example, on google do this search and then check out the second result:

Fenway Park. … Fenway Park opened on April 20, 1912, the same day as Detroit’s Tiger Stadium and before any of the other existing big league parks. …

That text is way about 1/4 of the way down in the document. How do they extract that?

Parameters: a) I don’t know anything about the documents that I’m analyzing, they could be valid XHTML or garbled HTML from 1996, b) it doesn’t have to be extremely fast, c) I’m using Java (if that matters) , d) I’ve tried using the org.apache.lucene.demo.html.HTMLParser class, which has a method getSummary(), but it doesn’t work for me (nothing is ever returned)

Any and all ideas would be appreciated!

Indexing Database Content with Lucene & ColdFusion

Terry emailed me a couple days ago wondering how he could use ColdFusion and Lucene to index and then search a database table. Since we’re completely socked in here in Boston, I had nothing better to do today that hack together a quick snippet that does just that:

<cfset an = CreateObject("java", "org.apache.lucene.analysis.StopAnalyzer")>
<cfset an.init()>
<cfset writer = CreateObject("java", "org.apache.lucene.index.IndexWriter")>
<cfset writer.init("C:\mysite\index\", an, "true")>
<cfquery name="contentIndex" datasource="sample">
select label, description, id
FROM product
</cfquery>
<cfloop query="contentIndex">
  <cfset d = CreateObject("java", "org.apache.lucene.document.Document")>
  <cfset fld = CreateObject("java", "org.apache.lucene.document.Field")>
  <cfset content = contentIndex.description>
  <cfset title = contentIndex.label>
  <cfset urlpath = "/products/detail.cfm?id=" & contentIndex.id>
  <cfset d.add(fld.Keyword("url", urlpath))>
  <cfset d.add(fld.Text("title", title))>
  <cfset d.add(fld.UnIndexed("summary", content))>
  <cfset d.add(fld.UnStored("body", content))>
  <cfset writer.addDocument(doc)>
</cfloop>  
<cfset writer.close()>

The only real change from the code that I wrote previously to index a document was that instead of looping over the file system looking for documents, I loop over a query and then indexed the text of a column from the database rather than the text of a document. (I would have written in in CFScript, but you can’t do queries from CFScript yet, unless you use a UDF to do the query)

You can download the source here, if you’re so inclined.

QueryParser … in NLucene

Misleading title. I implemented the first of the examples that Erik Hatcher used in his
article about the Lucene QueryParser
, only I used NLucene. Lucene and NLucene are very similar, so if anything, it’s interesting only because it highlights a couple of the differences between C# and Java.

First, here’s the Java example taken directly from Erik’s article:

public static void search(File indexDir, String q) {
  Directory fsDir = FSDirectory.getDirectory(indexDir, false);
  IndexSearcher is = new IndexSearcher(fsDir);
  Query query = QueryParser.parse(q, "contents", new StandardAnalyzer());
  Hits hits = is.search(query);
  System.out.println("Found " hits.length() +
    " document(s) that matched query '" q "':");
  for (int i = 0; i
The NLucene version looks eerily similar:

public static void Search(DirectoryInfo indexDir, string q) {
  DotnetPark.NLucene.Store.Directory fsDir = FsDirectory.GetDirectory(indexDir, false);
  IndexSearcher searcher = new IndexSearcher(fsDir);
  Query query = QueryParser.Parse(q, "contents", new StandardAnalyzer());
  Hits hits = searcher.Search(query);
  Console.WriteLine("Found " + hits.Length +
    " document(s) that matched query '" + q + "':");
  for (int i = 0; i
The differences are mainly syntax.

First, Erik used the variable name 'is' for his IndexSearcher. In C# 'is' is a keyword, so I switched the variable name to 'searcher'. If you're really geeky, you might want to brush up on all the Java keywords and the C# keywords.

Second, while Java uses the File class to describe directories and files, the .NET Framework uses the DirectoryInfo class.

Third, Java programmers are encouraged to capitalize class names and use camel Case notation for method and variable names while C# programmers are encouraged to Pascal notation for methods and camel Case for variables, so I switched the static method name from 'search' to 'Search'.

Next, 'Directory' is a system class, so the reference to the NLucene directory needed to be fully qualified:

DotnetPark.NLucene.Store.Directory fsDir = FsDirectory.GetDirectory(indexDir, false);

rather than this:

Directory fsDir = FsDirectory.GetDirectory(indexDir, false);

Finally, the Hits class contains a couple differences. Java programmers use the length() method on a variety of classes, so it made sense for the Java version to use a length() method as well. C# introduced the idea of a property, which is nothing more than syntactic sweetness that allows the API developer to encapsulate the implementation of a variable, but allow access to it as if it were a public field. The end result is that instead of writing:

for (int i = 0; i
in Java, you'd use this in C#:

for (int i = 0; i
The authors of Lucene also decided to use the C# indexer functionality (which I wrote about a couple days ago) so that an instance of the Hits class can be accessed as if it were an array:

Document doc = hits[i].Document;

I put together a complete sample that you can download and compile yourself if you're interested in using NLucene. Download it here.

Lucene’s Query API

Erik Hatcher wrote an excellent article on the specifics of Lucene’s Query API, specifically on how the QueryParser class uses the Query subclasses including TermQuery, PhraseQuery, RangeQuery, WildcardQuery, PrefixQuery, FuzzyQuery and BooleanQuery. Very useful stuff.

Not unsurprisingly, he’s also writing a book on Lucene titled “Lucene in Action”, to be published by Manning.

Lucene Index Browser

From the lucene-user list today: Lucene Index Browser: Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display their contents in several ways:
· browse by document number, or by term
· view documents / copy to clipboard
· retrieve a ranked list of most frequent terms
· execute a search, and browse the results
· selectively delete documents from the index
and more…