Category Archives: Software Development

Clearspace, Lucene, Software Development

How MoreLikeThis Works in Lucene

March 30, 2008 ajohnson 33 Comments

We created a ‘related items’ feature way back in Clearspace 1.0 (I mocked some of it out here just to prove that it worked) which shows related content based on the document, thread or blog post that you’re currently viewing. It was built using the MoreLikeThis class, a contribution made to the Lucene project by David Spencer, Mark Harwood and my favorite Canadian coworker Bruce Ritchie.

The really interesting thing about MoreLikeThis (at least to me) is that my first inclination was probably to think that Lucene goes and spends some cycles looking for content related to the source document: the focus being on the search and the other content. The reality is that the majority of the work is done by analyzing content the source document relative to the aggregate of all the other content that exists in the index. Said another way, MoreLikeThis doesn’t work by running some special search or query, it works by comparing the document that you’re asking about and the entire index as a whole (and then by running a query).

Anyway, in general I think it works pretty well, but for some odd reason, people keep asking the question “how does it work?” and since the documentation doesn’t go into much detail, I thought I’d try writing it up. So here goes nothing…

The first thing you need to do if you’re going to find related content is to tell Lucene what you want to find related content for. The MoreLikeThis class gives you five options: you can either tell it to look at an existing document in the Lucene index or let it try to parse an external resource from a file, an inputstream, a reader or a URL. In Clearspace we actually have a way of getting the Lucene document ID given the Clearspace object type and object ID, the code for our ‘related items’ feature looks something like this:

Query q = buildObjectTypeAndIDQuery(String.valueOf(messageID), JiveConstants.MESSAGE);
Hits hits = searcher.search(q);
...
int docNum = hits.id(0);
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{IndexField.subject.toString(), IndexField.body.toString(),
                    IndexField.tagID.toString()});
mlt.setMinWordLen(2);
mlt.setBoost(true);
q = mlt.like(docNum);

So in Clearspace we first fetch the Lucene document, construct a MoreLikeThis instance, tell it that we want to match on the subject, body and tag fields and we only want to calculate ‘relatedness’ based on words whose length is two characters or more. That’s the easy part, now it gets interesting.

Given that you already have a Lucene document, the MoreLikeThis instance loops over the field names (fields are kind of like the Lucene equivalent of database columns) that we specified (or all the field names available in the Lucene index if you don’t specify any) and retrieves a term vector for each of the fields in the document we’re analyzing. A term vector is a data structure that holds a list of all the words that were in the field and the number of times each word was used (excluding words that it considers to be ‘stop words’, you can see a list of common stop words here).

Since I’m sure some of you are tuning out at this point, let’s look at something concrete: assume you’ve created a document that looks like this (by the way, what a thorough wikipedia.org entry!):

Subject: Twinkle Twinkle Little Star
Body: Twinkle Twinkle little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky, Twinkle twinkle little star, How I wonder what you are!

The term vector for the subject would look like this:
terms: twinkle, little, star
termFrequencies: twinkle[2], little[1], star[1]

and for the body:
terms: [above, diamond, high, how, like, little, sky, so, star, twinkle, up, what, wonder, world]
termFrequencies: above[1], diamond[1], high[1], how[2], like[1], little[2], sky[1], so[1], star[2], [twinkle[4], up[1], what[2], wonder[2], world[1]

Meanwhile, back at the ranch.. so we get a term vector for each of the fields available in the document we’re looking for related items for. After we’ve got those, each of the term vectors is merged into a map: the key being the term and the value being the number of times the word was used in the document. The map is then handed to a method (called createQueue) that calculates a reasonably complex score for each word in the map. Since this is really the meat of the entire class, I’ll step through this in a little more detail and I’ll season the meat with data from the index on my blog.

The createQueue method first retrieves the total number of documents that exist in the index that we’re dealing with.

int numDocs = ir.numDocs();

In the index I maintain for my personal blog, there are 998 documents in the index. The second thing it does is create an instance of a class called FreqQ, which extends the PriorityQueue class. This object will maintain an object array whose elements are ordered according to their score, which we’ll cover in a second.

Now the createQueue method iterates over each word in the term frequency map, throwing out words if they don’t occur enough times (see setMinTermFreq) and then testing to find out which field across the entire the Lucene index contains the term the most. Next it calculates the Inverse Document Frequency, which, according to the JavaDoc, is:

… a score factor based on a term’s document frequency (the number of documents which contain the term)… Terms that occur in fewer documents are better indicators of topic, so implementations of this method usually return larger values for rare terms, and smaller values for common terms.

and finally, a score, which is a product of the IDF score and the number of times the word existed in the source document.

Again, I’m guessing you’re getting really sleepy, here’s some real data to look at. Like I mentioned above, I have 998 documents in my Lucene index and if we take a document like this one, you’ll end up with a term vector that looks like this:

a[20], you[20], pre[18], the[17], to[13], of[11], this[10], username[10], column[9], oracle[9], and[8], if[8], null[8], alter[7], appuser[7], into[7], on[7], that[7], href[6], http[6], i[6], sql[6], an[5], com[5], is[5], not[5], password[5], table[5], values[5], varchar[5], www[5], be[4], but[4], case[4], clearspace[4], empty[4], get[4], insert[4], modify[4], which[4], administrator[3], are[3], can[3], database[3], db[3], in[3], jivesoftware[3], mysql[3], nullability[3], server[3], string[3], t[3], thing[3], with[3], about[2], all[2], assume[2], been[2], blogged[2], d[2], different[2], feature[2], for[2], from[2], has[2], have[2], hsqldb[2], jive[2], make[2], modified[2], nvarchar[2], out[2], plan[2], postgres[2], products[2], ran[2], requirements[2], sensitivity[2], so[2], space[2], store[2], strike[2], support[2], where[2], will[2], above[1], again[1], against[1], already[1], andrew[1], any[1], appears[1], application[1], at[1], attempt[1], bennett[1], bit[1], blog[1], borrowed[1], both[1], bottom[1], by[1], cannot[1], changing[1], chars[1], classes[1], cmu[1], code[1], columns[1], consider[1], contrib[1], converted[1], cool[1], couple[1], crazy[1], day[1], decided[1], default[1], detail[1], didn[1], discuss[1], do[1], doing[1], edu[1], error[1], example[1], finally[1], first[1], flushed[1], forums[1], further[1], going[1], good[1], goodness[1], heads[1], helpful[1], his[1], insensitive[1], instead[1], interesting[1], involved[1], issue[1], issues[1], it[1], itself[1], jist[1], job[1], jsp[1], kb[1], kind[1], lately[1], least[1], like[1], line[1], ll[1], look[1], looks[1], lot[1], mcelwee[1], microsoft[1], more[1], my[1], needed[1], nice[1], no[1], notice[1], nugget[1], number[1], only[1], or[1], ora[1], over[1], page[1], past[1], platforms[1], probably[1], product[1], re[1], readers[1], reading[1], reason[1], respect[1], results[1], retrieve[1], ridiculous[1], sample[1], second[1], see[1], select[1], semicolon[1], sensitive[1], servers[1], shadow[1], similar[1], six[1], software[1], something[1], specific[1], specification[1], specify[1], standard[1], statement[1], supporting[1], sure[1], system[1], tails[1], than[1], then[1], there[1], they[1], things[1], those[1], thought[1], thunderguy[1], tolowercase[1], touppercase[1], tried[1], try[1], two[1], txt[1], unless[1], ups[1], used[1], user[1], variety[1], ve[1], very[1], want[1], warned[1], was[1], we[1], week[1], whatever[1], wonderful[1], word[1], work[1], write[1], yes[1], your[1], yourself[1]

and stop words would get stripped out and words that occurred less than two times would get stripped out so you’d be left with a term frequency map that looked like this:

pre[18], username[10], column[9], oracle[9], alter[7], appuser[7], href[6], http[6], sql[6], com[5], password[5], table[5], values[5], varchar[5], www[5], case[4], clearspace[4], empty[4], get[4], insert[4], modify[4], which[4], administrator[3], can[3], database[3], db[3], jivesoftware[3], mysql[3], nullability[3], server[3], string[3], thing[3], about[2], all[2], assume[2], been[2], blogged[2], d[2], different[2], feature[2], from[2], has[2], have[2], hsqldb[2], jive[2], make[2], modified[2], nvarchar[2], out[2], plan[2], postgres[2], products[2], ran[2], requirements[2], sensitivity[2], so[2], space[2], store[2], strike[2], support[2], where[2],

So for each of these terms, Lucene finds the field that contains the most instances of the given term and then calculates the idf value and the score. The default implementation of the idf value looks like this:

return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0)

So given the term frequency map above and the document frequency noted below, we get the following idf values and scores for the most popular terms in this document:

Term	Number of Instances of Term in Document	Number of Documents Matching Term	IDF value	Score
pre	18	26	4.609916	82.978
username	10	23	4.7276993	47.276
column	9	13	5.266696	47.400264
oracle	9	8	5.7085285	51.376
alter	7	1	7.212606	50.488

Finally, after all the terms have been added to the priority queue, we create a Lucene query, looping over the first 25 terms (this is the default and can be changed via setMaxQueryTerms) in the queue

TermQuery tq = new TermQuery(new Term("body", "oracle"));

and optionally boosting each term according to the score:

tq.setBoost(51.376 / 82.978);

The resulting Lucene query (in string format) looks something like this:

body:pre body:username^.56974 body:column^.57123 body:oracle^.61915 ...

Now I’ll be curious to see what the related item is for this post!

I hope you learned as much as I did. If you’re extremely curious about this whole Lucene thing and you like the in-depth stuff, you should definitely check out Luke, it’ll open a whole new world to you.

Software Development

Increasing your rate of failure

January 6, 2008 ajohnson Leave a comment

Cool article from HBS that was published this past April. In it, the author put a new spin on one of my favorite quotes, supposedly spoken by Thomas Watson:

If you want to increase your success rate, double your failure rate.

The gist of the research was this:

In well-led teams, a climate of openness could make it easier to report and discuss errorsâ€”compared to teams with poor relationships or with punitive leaders. The good teams, according to this interpretation, don’t make more mistakes, they report more.

So the quote I like so much might be reworded to be something like this: If you want to increase your success rate, foster a culture of openness where recognizing and reporting mistakes is encouraged.

Takeaway for software development teams? A decreasing number of bugs may not be indicative of a good product.

Software Development, SQL

SQL: getting the count of the result of a derived table

December 7, 2007 ajohnson Leave a comment

Recording this for posterity: let’s say you’re working on a store and you’ve got a database table that stores orders and a database table that stores customers and that you want to get a count of all the customers who’ve ordered more than 10 times. You’re going to write a query that looks like this:

SELECT customerID FROM orders GROUP BY customerID HAVING COUNT(*) > 10

to get a list of all the customerID’s who’ve ordered more than ten times and a query like this:

SELECT COUNT(*) FROM (SELECT customerID FROM orders GROUP BY customerID HAVING COUNT(*) > 10) as n

to get a count of those customers. Just sayin’.

Clearspace, J2EE, JavaScript, Rich Internet Applications, Software Development, work

Creating a Firefox Sidebar for Clearspace: Part II

December 1, 2007 ajohnson 1 Comment

It looks like it was almost 2 months ago that I wrote a blog post about the Clearspace plugin for Firefox (called Clearfox), promising that I would follow up with the details on the JavaScript side of the project. I guess time flies when you’re having fun.

Getting started on the JavaScript sidebar was easy. The Mozilla folks have a nice document here that shows you how to get a really simple sidebar created, but like a lot of things in software, the last 20% of the features take 80% of the time. It’s also worth noting that Firefox extensions are deployed as an XPI file, themselves nothing more than glorified zip files, so if you’re curious about how an extension (say Firebug, LiveHTTPHeaders or Del.icio.us Bookmarks), you can download the extension, unzip it’s contents and then poke around to your hearts content. It’s just like viewing source on HTML, which is something that other browser extensions don’t offer out of the box. Here are couple things that either weren’t documented in the above document or caused me to repeatedly bang my head against the wall.

Adding Your Icon

When I originally wrote about the plug-in, a number of people asked “why?” I think one of most important things about browser plug-ins is that they bring your web application (in this case Clearspace) front and center in your customers everyday browsing experience. Front and center in Firefox means that your plug-in sits right next to the Back | Forward | Reload | Stop | Home button in the navigation bar. If you want to put a 24×24 picture of yourself in that spot, be my guest. Since the purpose of Clearfox was to get embed Clearspace in the browser, I chose (wisely I think) to use the Clearspace logo. This was one of the simpler things to accomplish. I added the following element to the overlay.xul:

<toolbox id="navigator-toolbox">
    <toolbarpalette id="BrowserToolbarPalette">
      <toolbarbutton id="clearfox-button" class="clearfoxbutton-1 chromeclass-toolbar-additional"
                     observes="viewClearfoxSidebar" />
    </toolbarpalette>
  </toolbox>

and then in a CSS file (that you also reference in overlay.xul), I added this:

#clearfox-button {
  list-style-image: url("chrome://clearfox/skin/clearfox.png");
  -moz-image-region: rect(0px 24px 24px 0px);
}

Finally, you’ll need to have execute some JavaScript when Firefox loads:

var toolbox = document.getElementById("navigator-toolbox");
var toolboxDocument = toolbox.ownerDocument;
var hasClearfoxButton = false;
for (var i = 0; i < toolbox.childNodes.length; ++i) {
  var toolbar = toolbox.childNodes[i];
  if (toolbar.localName == "toolbar" && toolbar.getAttribute("customizable") == "true" ) {
    if (toolbar.currentSet.indexOf("clearfox-button") > -1)
      hasClearfoxButton = true;
  }
}
if (!hasClearfoxButton) {
  for (var i = 0; i < toolbox.childNodes.length; ++i) {
    toolbar = toolbox.childNodes[i];
    if (toolbar.localName == "toolbar" &&  toolbar.getAttribute("customizable") == "true"
      && toolbar.id == "nav-bar") {
      var newSet = "";
      var child = toolbar.firstChild;
      while (child) {
        if (!hasClearfoxButton && (child.id=="clearfox-button" || child.id=="urlbar-container")) {
          newSet += "clearfox-button,";
          hasClearfoxButton = true;
        }
        newSet += child.id + ",";
        child = child.nextSibling;
      }
      newSet = newSet.substring(0, newSet.length-1);
      toolbar.currentSet = newSet;
      toolbar.setAttribute("currentset", newSet);
      toolboxDocument.persist(toolbar.id, "currentset");
      BrowserToolboxCustomizeDone(true)
      break;
    }
  }
}

The key takeaways: the ID of the toolbarbutton element is used in the CSS declaration, the list-style-image property is used to specify the image you want for the button and the observes attribute points to the ID of the broadcaster element, which is used for showing / hiding the sidebar. I'm not all that good with Fireworks / Photoshop so I didn't go the extra mile to create a separate image to show when a user mouses over the button (check out the the excellent del.icio.us bookmarks extension for an example), but adding a mouseover / hover image is as simple as adding another CSS property:

clearfox-button:hover {
  list-style-image: url("chrome://clearfox/skin/clearfox-hover.png");
}

Loading Sidebar Content

I think I mentioned this when I wrote the original post, but the thing that really got me started with the Firefox extension was looking at the source for the twitbin Firefox extension. When I checked out the source for that extension, I was stunned to learn that they were simply loading up an HTML page from their server and then refreshing the page every couple minutes using an AJAX request. I thought it must have been way more complex than that, but HTML + JavaScript + CSS with a little bit of XUL sprinkled in and you're golden. In the broadcaster element (mentioned above), you add an attribute 'oncommand' that tells Firefox what JavaScript method(s) you want invoked when the user clicks on your button; in Clearfox I use that hook to load the HTML content from Clearspace. The JavaScript looks like this:

var sidebar = top.document.getElementById("sidebar");
sidebar.contentWindow.addEventListener("DOMContentLoaded",
  Clearfox.clearfoxContentLoaded, false);
sidebar.loadURI('http://example.com/yourpage.html');

DOMContentLoaded

Once the content has loaded the DOMContentLoaded event is fired and then Clearfox runs the clearfoxContentLoaded method. I initially tried loading the page and then re-loading that same page every couple minutes: I was cheating and trying not have to do the AJAX part. I added a meta refresh tag to the page to have it reload every n minutes, but the DOMContentLoaded event was only fired the first time the page was loaded. Key takeaway: if you rely on DOMContentLoaded, you're only going to get the event once, even if your page reloads.

Opening New Tabs

You can run your own little application over in the sidebar if you want to, never opening new tabs or doing anything in the main window, but the main point of the Clearfox extension was to give users the ability to see new content in Clearspace and then able to view the full thread, document or blog post in their main browser window as a new tab. There's no way you can create new tabs in JavaScript outside of XUL, you have to be inside XUL to create a tab so the DOMContentLoaded event is important because this is where the links are all rewritten to open new tabs rather than work as normal links. There are two parts to the clearfoxContentLoaded method: the first rewrites all the links so that clicking on a link in the sidebar opens a new tab, the second adds listeners to the document so that when new content is added via AJAX, the same first part is repeated. The link conversion looks like this:

var sidebar = top.document.getElementById("sidebar");
var doc = sidebar.contentDocument;
var all_links = new Array();
var links = doc.evaluate("//a", doc, null, XPathResult.ANY_TYPE, null);
var link = links.iterateNext();
while (link) {
  all_links.push(link);
  link = links.iterateNext();
}
for (var i = 0; i < all_links.length; i++) {
  link = all_links[i];
  if (!link.hasAttribute('onclick') &&
    link.hasAttribute('href') &&
    (link.hasAttribute('class') && link.getAttribute('class').indexOf('tabbable') > -1)
  ) {
    var target = link.getAttribute('href');
    link.setAttribute('onclick', 'return false;');
    link.removeAttribute('target');
    link.addEventListener('click', Clearfox.clearfoxCreateOpenFunction(target), true);
  }
}

In English, get all the links that don't have an onclick attribute and add an onclick event, whose callback creates a new tab:

clearfoxCreateOpenFunction: function(url) {
  return function() {
    gBrowser.selectedTab = gBrowser.addTab(url);
  };
}

The second part of the clearfoxContentLoaded method listens for any modifications (anytime something is added to the document via an AJAX call, I remove the last n rows, so in this case I listen for the removal of nodes) to the document and adds handlers for the modification event:

var sidebar = top.document.getElementById("sidebar");
var table = sidebar.contentDocument.getElementById("clearfox-hidden-content");
table.addEventListener("DOMNodeRemoved", Clearfox.clearfoxConvertLinks, false);

Debugging

Two things about debugging your extension: a) none of the tools you've got installed for debugging JavaScript / CSS (cough! Firebug cough!) will work in the sidebar window and b) your only other option is to resort to the Firefox equivalent of Java's System.out.println:

function logClearfoxMsg(message) {
  var consoleService = Components.classes["@mozilla.org/consoleservice;1"].getService(Components.interfaces.nsIConsoleService);
  consoleService.logStringMessage("Clearfox: " + message);
}

and then to log a message:

logClearfoxMsg('your advertisement here');

To see the log message, go to Tools --> Error Console.

Showing The Sidebar When Firefox Opens

I spent way more time than I should have working on this last part: when you first install the extension, you're asked to restart Firefox, which you do and then you see the nice Clearspace button, you click on it and the sidebar opens and you're happy. Then a couple days later when you need to restart Firefox again (for whatever reason), you'll notice that the sidebar is open but that no content shows and you're sad. You probably didn't take it personally, but I did and I wanted to figure out why Firefox wouldn't load up the content if the sidebar was already open. There actually is onload attribute that I tried using in the sidebar.xul and that didn't work for reasons I won't get into here. What ended up doing the trick for me (and I really think it's a trick but I couldn't get anything else to work and this was just about the last option I had) was to check to see if the sidebar was open when Firefox was loading:

var broadcaster = top.document.getElementById('viewClearfoxSidebar');
  if (broadcaster.hasAttribute('checked')) {
    ...

and then, if it is open, *forcing* the sidebar to open again:

toggleSidebar('viewClearfoxSidebar', true);

For whatever reason, this was the only way I could get the rest of my code to work, but work it did. And now I'm happy.

And I hope you are too. If you have any questions about creating a Firefox sidebar, shoot me an email, I'll be glad to help. If you want to see all the code, you can download the Clearfox source over on the Jive Software Community site. All the JavaScript / XUL code I discussed in this post is in the 'xpi' directory off the root.

J2EE, Software Development, Systems Administration, work

Java ZipEntry bug on Windows

November 18, 2007 ajohnson 2 Comments

I rolled out the Clearfox plugin on the Jive Software Community site a couple weeks ago and got some good feedback and some bad feedback. A number of people said they tried to install the Firefox part of the plugin, restarted Firefox and then didn’t see the Clearspace icon like my screenshots / screencast showed. There were no errors in the Clearspace error logs and no errors showed up in the Firefox JavaScript debug console. Through the help of a couple customers, I was able to narrow it down to running Clearspace on Windows: for some reason the zip file (really the XPI file) that the Clearfox plugin creates on the fly was invalid, at least according to Firefox. If you opened the XPI file using any common zip file utility the contents appeared to fine. As always, google came to the rescue and pointed me to this bug filed on bugs.sun.com, which has two parts. The Unicode file name bug didn’t matter to me, but this one did:

Within a ZIP file, pathnames use the forward slash / as separator, as required by the ZIP spec. This requires a conversion from or to the local file.separator on systems like Windows. The API (ZipEntry) does not take care of the transformation, and the need for the programmer to deal with it is not documented.

which wouldn’t hurt so much if it hadn’t been filed back in… get this… 1999. Are you kidding me?

Anyway, long story short: if you’re writing Java, creating a zip file that has paths while on a Windows based machine and deploying said zip file to a place that actually cares about the zip file specification (or violates Postel’s Law), then make sure to do something like this in your Java code:

String zipFilePath = file.getPath();
if (File.separatorChar != '/') {
  zipFilePath = zipFilePath.replace('\\', '/');
}
ZipEntry zipAdd = new ZipEntry(zipFilePath);

noting that even the workaround they give in the aforementioned bug is incorrect because they show

... file.getName();

which doesn’t contain the path separators. Awesome.

Clearspace, J2EE, Open Source, Software Development, WebWork, work

Creating a Firefox Sidebar for Clearspace: Part I

October 2, 2007 ajohnson 1 Comment

It’s been embarassingly quiet on this blog of late, I apologize for all the delicious links, although a case could be made that blogs were originally nothing more than sharing links so maybe I shouldn’t be apologizing, but that’s a different blog post. Today I want to talk about the thing I’ve been working on at night, my non-day job if you will. A couple months ago I was reading all the hype about how JavaScript is going to take over the world and I’d been doing a lot of JavaScript during the day but I needed a project to get me through the night. Right about that time was when Twitter started taking off and I came across twitbin, which is a cool Firefox sidebar that shows you all of your friends tweets in a Firefox sidebar (the same sidebar that livehttpheaders and selenium IDE show up in), updated in real time. Like any good hacker, I wondered “how’d they do that” and started poking around the xpi file that you download to install Firefox extensions. Lo and behold, it’s JavaScript and HTML behind the scenes. Since Clearspace is one of those addictive, constantly updating, can’t get enough of it kind of applications (unlike Twitter you can actually use more than 140 characters! amazing!), I thought it would be both useful and potentially easy to create a sidebar for Clearspace, which brings me to this blog post.

I’m not sure where I started, but I’m pretty sure the first step wasn’t to create a plugin in Clearspace, I messed around for awhile with the technologies that go into creating a Firefox Extension: JavaScript, XUL (pronounced ‘zool’), install manifests, etc. I got a sample Firefox sidebar up and running that and spent way more time than I should have installing, viewing, uninstalling and restarting Firefox than I should have. I spent a lot of time digesting these sites:

I’ll discuss some of the things I ran into on the Firefox / JavaScript side in a second blog post: right now I want to talk about the Clearspace side of the plugin.

Eventually, I got to a place where I was comfortable enough with the Firefox side of things to creating the Clearspace part of the plugin. The content that is displayed in the sidebar is nothing more than plain HTML and CSS. If you install the Clearspace plugin you can actually view the content in any browser: go to http://example.com/clearspace/cf-view.jspa (replacing example.com with the host name of your installation). You’ll see that the content looks surprisingly similiar to the content that shows up on the the homepage of Clearspace, and in fact, it is the same content. The one difference between this page and other pages in Clearspace is that it never does a page reload. All the views (the login page, the settings page, the view page, etc.) are included in the resulting HTML, but hidden using CSS so that you only see one view at a time. When you click a button at the top of the page, two things usually happen: a) the current view is hidden using CSS and the new view is displayed using CSS and b) an AJAX request is sent out to a WebWork action that performs some action: logging you out, getting the updated content since your last time you viewed the page, saving your Clearfox settings, etc. This process might seem a little backwards: why not just work the same way a regular web page browsing session works where you click a link and your browser loads another page? It’s actually a limitation imposed by the Firefox extension model, so I won’t go into it here, but if you’re curious, you can do a search for ‘DOMContentLoaded firefox extension’.

So now that you (hopefully) understand how the client works, I’m going to assume that you won’t have any problems copying the ‘example’ plugin that Clearspace ships with and dive right into the pieces that are distinctive about the Clearspace part of the Clearfox plugin.

First, I needed to define the WebWork actions that Clearfox was going to use, which means I needed to create the xwork-plugin.xml file. The descriptor for the view that I mentioned earlier looks like this:

<action name="cf-view"
   class="com.jivesoftware.clearspace.plugin.clearfoxplugin.ViewAction">
   <result name="success">/plugins/clearfox/resources/view.ftl</result>
   <result name="update">/plugins/clearfox/resources/update.ftl</result>
</action>

That action is pretty standard: there are two possible results: the ‘success’ result shows the list of the 25 most recently updated pieces of content, the update result is used by an AJAX request to update the existing page with the n most recently updated pieces of content since the last refresh (which by default is invoked every 5 minute). The ViewAction class extends com.jivesoftware.community.action.MainAction, which is the WebWork action that handles the display of the homepage of Clearspace, so ViewAction simply invokes

super.execute();

to get the same content you’d get if you viewed the homepage. When the Clearfox plugin refreshes every 5 or so minutes it invokes the

public String doUpdate()

method, passing in the time (in milliseconds) that the last piece of content it has a record of was updated. It gets back a (presumably) shorter list of content that it then updates the view with.

The login, logout and settings actions are all using in combination with AJAX and since none of them need to provide any data, they all return HTTP status code headers:

<action name="cf-login"
   class="com.jivesoftware.clearspace.plugin.clearfoxplugin.LoginAction">
   <result name="success" type="httpheader">
      <param name="status">200</param>
   </result>
   <result name="unauth" type="httpheader">
      <param name="status">401</param>
   </result>
</action>
<action name="cf-logout"
   class="com.jivesoftware.clearspace.plugin.clearfoxplugin.LogoutAction">
   <result name="success" type="httpheader">
   <param name="status">200</param>
   </result>
</action>
<action name="cf-settings"
   class="com.jivesoftware.clearspace.plugin.clearfoxplugin.SettingsAction">
   <result name="success" type="httpheader">
      <param name="status">200</param>
   </result>
</action>

The last action is used by the Firefox Extension framework to determine if the plugin (on the Firefox side) needs to be updated. The action descriptor looks like this:

<action name="cf-updater"
   class="com.jivesoftware.clearspace.plugin.clearfoxplugin.UpdaterAction">
   <result name="success">
      <param name="location">/plugins/clearfox/resources/updater.ftl</param>
      <param name="contentType">text/rdf</param>
   </result>
</action>

The one trick about this one is that the result sets an HTTP contentType header, which isn’t remarkable except that Clearspace uses Sitemesh, which attempts to decorate everything it can parse with a header and footer. The contentType header should be a hint to Sitemesh that you don’t want the result to be decorated / wrapped with a header and footer, but apparently the hint is to subtle for Sitemesh because it attempts to wrap the result of this action anyway, which leads to the second distinctive part of this plugin.

The way we bypass Sitemesh decoration in the core product is by modifying a file called templates.xml, which gives us the ability tell Sitemesh to exclude certain paths from being decorated. In Clearspace 1.7, which will be out in a couple weeks, plugins (which can’t modify templates.xml) will have the ability to tell Sitemesh about paths that they don’t want decorated. Hence the following entry in the plugin descriptor (always located at the root of plugin and named plugin.xml):

<sitemesh>
   <excludes>
      <pattern>/cf-updater.jsp*</pattern>
      </excludes>
</sitemesh>

Third, Another thing you may have noticed if you were following along with the source code (which is available from clearspace.jivesoftware.com) is that two of the action classes (ViewAction and LoginAction if you must know) are marked with the class annotation ‘AlwaysAllowAnonymous’. This annotation tells the Clearspace security interceptor that the anonymous users should be allowed to invoke the action without being logged in. This is probably a good place to remind you of one of the differences between Clearspace and ClearspaceX. Clearspace, by default, is configured to *require* users to login before they can see any content while ClearspaceX uses the opposite default: you only need to login (usually) if you want to post content. So back to the AlwaysAllowAnonymous annotation: it’s important mostly for the Clearspace (not ClearspaceX) implementations because you want to give people the ability to invoke the action and then the action itself handles the display of the login page in it’s own specific way. The RSS related actions in Clearspace work exactly the same way: they are all annotated with the AlwaysAllowAnonymous marker and then handle security via HTTP Basic Auth (Clearspace) or simply allow anonymous usage (ClearspaceX) because feed readers

Fourth, because the Firefox part of the plugin needs to be specific to your installation, all the Firefox plugin related files need to be zipped up to create the XPI file that your users will install into Firefox. The plugin framework gives you the ability to define a plugin class:

<class>com.jivesoftware.clearspace.plugin.clearfoxplugin.ClearFoxPlugin</class>

which implements com.jivesoftware.base.plugin.Plugin. The interface specifies a method:

public void initializePlugin(PluginManager manager, PluginMetaData pluginData);

which means that your plugin will get a chance to initialize itself. The Clearfox plugin uses this initialization hook to zip up the Firefox related files. The plugin is told where it lives:

public void initializePlugin(PluginManager manager, PluginMetaData metaData) {
   File pluginDir = metaData.getPluginDirectory();
   ...

and then goes on to zip up the files located in the xpi directory of the plugin source.

So that’s that… if you got this far you probably don’t care, but I actually did a screencast of the whole thing that lives over here or you can check it out on blip.tv.

J2EE, Software Development, work

Fun with Supporting Multiple Databases

September 24, 2007 ajohnson 2 Comments

In my day job over at Jive Software I get to work on the crazy cool Clearspace product and unless you’ve been reading the system requirements lately, you probably didn’t notice that we support six different database platforms: MySQL, Oracle, Postgres, DB2, SQL Server and HSQLDB. Clearspace borrowed a number of classes from Jive Forums so a lot of the database specific code has already been flushed out, but I ran into a couple interesting things this past week that I thought needed to be blogged.

The first thing I ran into was a varchar case sensitivity issue: DB2, Oracle, Postgres and HSQLDB are all sensitive about case with respect to the values they store. So if you do this:

INSERT INTO appuser(username,password) VALUES('Administrator','password');

and then try to retrieve the user:

SELECT username,password FROM appuser WHERE username = 'administrator'

you’ll get different results than you will with MySQL and SQL Server (which are both case insensitive by default). Bottom line: if you plan supporting an application on a variety of database servers, make sure you toLowerCase() or toUpperCase() the string values you store if you plan on doing look ups against those values. If you’re into this kind of thing, page 87 of the SQL 92 standard appears to discuss the case sensitivity issues, but I can’t make heads of tails of it. Any good specification readers out there?

The second thing involved Oracle, which has this wonderful feature where you can’t insert an empty string into a varchar column. Example:

INSERT INTO appuser(username, password) VALUES('administrator', '');

Bennett McElwee blogged about this feature in more detail on his blog, but the jist of this nugget of goodness is that if you attempt to insert an empty space into an Oracle varchar, your empty space will get converted to

null

which is not helpful in the least bit. Consider yourself warned.

Finally, and again Oracle. Assume you have the sample table I used above with two columns: username and password. Assume further that you want to modify the username column to be 100 chars instead of 50. You write an ALTER statement that looks something like this for SQL Server:

ALTER TABLE appuser ALTER COLUMN username nvarchar(100) NOT NULL

on MySQL you’d have this:

ALTER TABLE appuser MODIFY COLUMN username nvarchar(100) NOT NULL

on DB2:

ALTER TABLE appuser ALTER COLUMN username varchar(100) NOT NULL

All very similar. But Oracle, no, if you tried this:

ALTER TABLE appuser MODIFY(COLUMN username varchar(100) NOT NULL)

on Oracle you’d get this error:

ORA-01451: column to be modified to NULL cannot be modified to NULL

See Oracle, for whatever reason, decided that if you are going to modify the column that you can ONLY specify the nullability of a column IF the nullability itself is changing. Which is ~~ridiculous~~ nice.

And yes, nullability is a word.

Software Development, Syndication, work

Using ROME to get the body / summary of an item

September 13, 2007 ajohnson Leave a comment

I’ve been using ROME for a couple years now and I’m still learning new things. Today I was working on an issue in Clearspace where we give users the ability to show RSS / Atom feeds in a widget, optionally giving them the choice to show the full content of each item in the feed or just a summary of each item in the feed. The existing logic / pseudo-code looked something like this:

for (SyndEntry entry : feed.getEntries()) {
  if (showFullContent) {
    write(entry.getContents()[0].value);
  } else {
    write(entry.getDescription().value);
  }
}

The assumption was that description would return a summary and contents would return the full content. The problem is that Atom and RSS are spec’ed umm.. differently. RSS 2.0 says that ‘description’ is a synopsis of the item but then goes on in an example to show how the description can be much more than just a short plain text description. So then you’re left with descriptions that aren’t really a synopsis, it’s the full content… or it is sometimes and sometimes not. Then Atom came along with well defined atom:summary and atom:content elements which means ROME had to figure out a way to map description and content-encoded elements in RSS to atom:summary and atom:content. Dave Johnson summarized the mappings nicely in a blog post discussing the release of ROME 0.9, in short the mapping looks like this:

RSS <description> <--> SyndEntry.description <--> Atom <summary>
RSS <content:encoded> <--> SyndEntry.contents[0] <--> Atom <content>

Anyway, all this is to say that if you’re doing any work with SyndEntry, you’ll need to check both description and contents. Generally, if you’re looking for the full content, check the value of contents first. If that’s null, check the value of description. If you’re looking for a summary, check the value of description first BUT don’t assume that you’ll actually get a short summary. Use something like StringUtils.abbreviate(…) to make certain that you’ll get a short summary back and not the entire content.

collaboration, Open Source, Software Development

Public Health and Information Technology

July 20, 2007 ajohnson 1 Comment

Jon Udell posted the transcript of his conversation with Dr. Joel Selanikio (who, as the co-founder of DataDyne, is in the business of collecting public health data in developing countries) last Thursday, which I believe was right about the same time I finished a book I picked up at my parents house while on vacation. My dad is the public health officer for Mono County in California and can’t leave a bookstore without buying a couple books. He just so happened to have a book written by Tracy Kidder (author of The Soul Of A New Machine, among other books) titled ‘Mountains Beyond Mountains: Healing the World: The Quest of Dr. Paul Farmer, which is the story of one Dr. Paul Farmer and his quest to

“… make a difference in solving global health problems through a clear-eyed understanding of the interaction of politics, wealth, social systems and disease.”

It’s a tremendous book and Jon’s blog post reminded me that I dog-eared a couple pages from the book that I wanted to note here for posterity and also that I was bothered while reading the book that information technology seemed not to have a role in Dr. Farmer’s plans for improving international public health. It’s fitting that one of the sayings that Dr. Farmer is fond of using (from page 177) is:

On what data exactly do you base that statement?

because Dr. Selanikio seems to have the same opinion:

Well, in clinical medicine, the way that we understand things is â€” if itâ€™s a rash, I look at the rash, I think about it, I look stuff up, but I donâ€™t systematically create a database. For one patient you can juggle the variables in your head. But when you have a population of affected people, you need to collect data and analyze it. Thatâ€™s the basis of epidemiology.

So there ya go… information technology does have a place.

If you’re interested in the data collection that DataDyne does, check out these YouTube video interviews that Dr. Joel Selanikio did this last May:

Rich Internet Applications, Software Development

Not drinking the Flex / Silverlight / Apollo Kool-Aid

May 21, 2007 ajohnson Leave a comment

From billhiggins.us:

If youâ€™re considering or actively building Ajax/RIA applications, you should consider the Uncanny Valley of user interface design and recognize that when you build a “desktop in the web browser”-style application, you’re violating users’ unwritten expectations of how a web application should look and behave. This choice may have significant negative impact on learnability, pleasantness of use, and adoption. The fact that you can create web applications that resemble desktop applications does not imply that you should; it only means that you have one more option and subsequent set of trade-offs to consider when making design decisions.

See also: The Other Road Ahead by Paul Graham. Scroll down to the section entitled ‘Just Good Enough’.

Aaron Johnson