Category Archives: Syndication

instantFeeds: real time notification of RSS updates via Openfire

I wrote a long winded post a couple months ago with a nebulous title called “IM and RSS: Rome is on Fire” where I talked about the feed bot that I wrote for Wildfire (which is now called Openfire). It’s been up and running now for a couple months, giving me a chance to work out some of the bugs and add a couple features and I think it’s ready for you to look at and use. The project page has all the details.

Also, I updated the client-side sign up form to use Prototype / script.aculo.us rather than the YUI stuff (which ended up costing somewhere north of 200k in JavaScript). Since you can run your own Openfire server and you can download and use the instantFeeds plugin, feel free to rip the JavaScript off this site if you want to enable your users to sign up to receive your blog updates via IM.

Using Outlook 2007 as a RSS aggregator: Not so much

I installed the 60 day preview of Office 2007 a couple days ago to try out some of the blogging / RSS features it included. Publishing from Word 2007 to a blog via the MetaWeblog API? Works pretty well (but seriously, Word to create a blog post?). Outlook 2007 to read feeds? I wouldn’t recommend it. First, I exported all 251 of the feeds I subscribe to using Bloglines as an OPML file and then imported those into Outlook, that part worked great… reading posts worked great. Now because I’m just testing I want to delete all these. I start looking for the ‘manage your subscriptions’ button / option. None exist. I try the multi-select using CTRL-CLICK. No luck. So I have to hand delete 251 feeds. But wait, it gets better. Some of the feeds I apparently don’t have permission to delete (view image) even though I created the subscription! I’m sure it’s a bug (since the documentation says you can use CTRL-CLICK to delete), but it sure would be nice to be able to have a full view / window that gives you the ability to manage all your feeds in one place.

IM and RSS: Rome is on Fire

Last August, Marshall Kirkpatrick (another Portland resident) posted an entry to TechCrunch about a company called FeedCrier which:

… makes it easy to receive rapid notification of new items in an RSS feed by IM

I bookmarked the link on del.icio.us, noting offhandedly that it would probably be easy to do something like this using Wildfire and Rome Fetcher… Almost 5 months to the day later, I’m really proud to say that it wasn’t easy, but it’s definitely doable, albeit with a slightly different aim.

If you swing by my website (instead of viewing this post in your favorite feed reader), you’ll see a list of ‘subscription options’ in the right hand navigation bar: RSS, AIM, Yahoo, MSN, Google Talk and Jabber / XMPP (the full set of IM services thanks IM Gateway Plugin). Clicking on RSS takes you to the feed so you can subscribe with a feed reader, clicking on any of the others results in a fancy schmancy dialog box (courtesy of YUI), into which you can plug in your preferred instant messaging username … click ‘subscribe’ and AJAX will send a request to the Wildfire plugin I created (proxied by mod_proxy), which will then send you an IM to confirm that you really want to receive alerts for this feed. Click the link in the IM and you’re off and running. The service then polls the feed you subscribed to at regular intervals, sending you a message if it finds something new. It supports all the feed formats that Rome supports and also supports XML-RPC pings (my blog is configured to ping the service when I post something to my blog).

I’ll be the first to admit that the UI sucks and that the dialog box should show a confirmation, that the YUI stuff is really heavy (380K of JavaScript and CSS to make a dialog box? sheesh!), that it’s not a ‘professional’ service like FeedCrier, and that I haven’t passed the code by the Wildfire team yet (I’m hoping they’ll accept it as a plugin that’ll be included as part of the base Wildfire distribution) but I’m really excited about the idea of RSS to IM in general and this implementation in particular for the following reasons:

  • As far as I know, all of the existing RSS to IM services (immedi.at, Zaptxt, Rasasa and the aforementioned FeedCrier) are hosted services. If I subscribe to your feed via any of the above services, I’ve got a relationship with them, not with you. If you’re a hip publisher, you’re probably sending pings their way, but you don’t know who is subscribed to your feed. You probably don’t have access to the list of subscribers (and as a subscriber maybe you wouldn’t want them too, but I’ll get to that in a second). With this plugin and an instance of Wildfire, you can go one to one with your customers, rather than working through some third party. Said another way, given the ability to run a Wildfire server, what company wouldn’t want to offer a ‘subscribe to this blog via IM’ as part of the ‘subscribe via email’ and ‘subscribe via RSS’ feature set?
  • Because you host it, you might configure the server in such a way as to give it access to feeds on your intranet, feeds that are completely inaccessible to *all* of the above services. What’s that you say? Your internal feeds are protected by Basic Authentication? That’s ok, the plugin can retrieve protected feeds as well. Specify the username and password in the URL (http://username:password@yourserver/feeds/my.xml) and you’re golden. So if you work at a big corporation that’s producing RSS feeds like rabbits produce baby bunnies, don’t fire up your desktop feed reader. Get someone to set up a Wildfire server and then pester them to install the plugin for you.
  • It’s truly instant: one of the things about RSS to instant messages is that you’d hope that you do get notified or alerted *instantly*. The reality is that it takes two to tango: for FeedCrier to alert you instantly when a feed is updated, they have to have the cooperation of the publisher, the publisher has to send them a ping when the feed is updated. Since the plugin supports XML-RPC pings, you as a publisher can configure your blogging software (or whatever else produces your RSS feeds) to send standard XML-RPC pings to the plugin, so while polling is supported, it should be the exception to the rule.
  • Finally, as a subscriber, the thing that’s valuable is that you a) get your content and b) that you get it instantly. You could care less about FeedCrier or any of these other services, you want your content now. So (and this is what I’d I’d get to earlier) you might be willing to give up the anonymity that RSS normally provides in exchange for immediate access to the information you want (or maybe anonymity isn’t a big deal to you at all). In other words, for truly valuable information, this service puts publishers in a position of power: subscribers get their fix instantly as long as they cough up their instant message information.

If you’ve gotten this far, thanks for reading. I’d love to hear your feedback. And don’t forget to subscribe. You know, to get your fix.

Blogs: Not just for breakfast anymore, part II

A couple weeks ago I added a short post to the Jive Software corporate blog entitled ‘Blogs: Not just for breakfast anymore‘, In the post, I hoped to squash the notion that blogs are all about opinions and are useless within a corporation, which was the ‘opinion’ of quite a number of people that took part in our user acceptance tests. I’m not sure that my four bullet points did the topic justice, but I found a post a couple days later written by Steve Yegge called “You Should Write Blogs“, which was a whole lot longer and not surprisingly a whole lot better than my post. And then today I read an article in the NY Times by Clive Thompson (whose blog I’m subscribed too) called Open-Source Spying, which I think is one of the most exciting articles I’ve read about blogging (and also wikis) ever. See it turns out that no less than the CIA, FBI and NSA are all embracing blogs and wikis as fantastic tools for collaboration and information dissemination, which (while admittedly knowing nothing about the spy business) sounds like a no brainer to me. Give everyone a blog, every team a wiki, throw a couple Google Enterprise Search boxes at’em and see what happens. Even if it does eventually ‘fail’, it’ll sure cost a lot less than the $170 million dollar FBI project that never even launched. But of course, it won’t fail:

… While the C.I.A. and Fingar’s office set up their wiki, Meyerrose’s office was dabbling in the other half of Andrus’s equation. In July, his staff decided to create a test blog to collect intelligence. It would focus on spotting and predicting possible avian-flu outbreaks and function as part of a larger portal on the subject to collect information from hundreds of sources around the world, inside and outside of the intelligence agencies. Avian flu, Meyerrose reasoned, is a national-security problem uniquely suited to an online-community effort, because information about the danger is found all over the world. An agent in Southeast Asia might be the first to hear news of dangerous farming practices; a medical expert in Chicago could write a crucial paper on transmission that was never noticed by analysts.

In August, one of Meyerrose’s assistants sat me down to show me a very brief glimpse of the results. In the months that it has been operational, the portal has amassed 38,000 “active” participants, though not everyone posts information. In one corner was the active-discussion area — the group blog where the participants could post their latest thoughts about avian flu and others could reply and debate. I noticed a posting, written by a university academic, on whether the H5N1 virus could actually be transmitted to humans, which had provoked a dozen comments. “See, these people would never have been talking before, and we certainly wouldn’t have heard about it if they did,” the assistant said. By September, the site had become so loaded with information and discussion that Rear Adm. Arthur Lawrence, a top official in the health department, told Meyerrose it had become the government’s most crucial resource on avian flu (emphasis mine).

Also, I haven’t read the entire paper yet, but the NY Times article mentions an essay entitled ‘The Wiki and the Blog: Toward a Complex Adaptive Intelligence Community’ written by a guy from the CIA, a quick google search turns it up over on the Social Science Research Network, you can download it or get it emailed to you for free here.

ROME and wfw namespace elements

I created a ROME parser and generator for wfw:comment and wfw:commentRss elements today. You can read all about it here and download the source code here.

Not sure what the wfw:comment or wfw:commentRss elements are for? Imagine you’re reading my blog in desktop aggregator and you want to post a comment to my blog. The aggregator has no way of knowing where your comments should be HTTP posted too, so you have to open up another tab in Firefox and go to my blog to enter a comment. wfw:comment is an element that provides the HTTP post endpoint in the feed so that aggregators and widget developers can comment inline, just like this. wfw:commentRss is an element that provides you (or your favorite aggregator) with a link to the comments for the post you’re currently viewing.

ROME and custom Generator elements

The majority of the articles I’ve seen on using ROME to create Atom and RSS feeds don’t show you how to customize the optional ‘generator’ element that both Atom and RSS support. It’s really easy. I’m assuming that you’ve already created and populated a SyndFeed instance:

SyndFeed feed = ..
WireFeedOutput feedOutput = new WireFeedOutput();
WireFeed wireFeed = feed.createWireFeed();
if (wireFeed.getFeedType().startsWith("atom")) {
  Feed atomFeed = (Feed)wireFeed;
  Generator gen = new Generator();
  gen.setUrl("http://yoursite.com/");
  gen.setValue("Your Site");
  gen.setVersion("1.0");
  atomFeed.setGenerator(gen);
  feedOutput.output(atomFeed, ...);
} else {
  Channel rssFeed = (Channel)wireFeed;
  rssFeed.setGenerator("Your Site 1.0 (http://yoursite.com/)");
  feedOutput.output(rssFeed, ...);
}

Make a decision about which feed you’re going to produce, cast to the appropriate implementation of WireFeed (Channel for RSS and Feed for Atom), and then use the appropriate setters on each of those respective classes.

RSS/Atom feeds, Last Modified and Etags

Sometime last week I read this piece by Sam Ruby, which summarized says this:

…don’t send Etag and Last-Modified headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some processing.

The product I’ve been working on at work (which I should be able to start talking about soon which I can talk about now) for the last couple months uses feeds (either Atom, RSS 1.0 or RSS 2.0, your choice) extensively but didn’t have Etag or Last-Modified support so I spent a couple hours working on it this past weekend. We’re using ROME, so the code ended up looking something like this:

HttpServletRequest request = ...
HttpServletResponse response = ....
SyndFeed feed = ...
if (!isModified(request, feed)) {
  response.setStatus(HttpServletResponse.SC_NOT_MODIFIED);
} else {
  long publishDate = feed.getPublishedDate().getTime();
  response.setDateHeader("Last-Modified", publishDate);
  response.setHeader("Etag", getEtag(feed));
}
...
private String getEtag(SyndFeed feed) {
  return "\"" + String.valueOf(feed.getPublishedDate().getTime()) + "\"";
}
...
private boolean isModified(HttpServletRequest request, SyndFeed feed) {
  if (request.getHeader("If-Modified-Since") != null && request.getHeader("If-None-Match") != null) {
  String feedTag = getEtag(feed);
    String eTag = request.getHeader("If-None-Match");
    Calendar ifModifiedSince = Calendar.getInstance();
    ifModifiedSince.setTimeInMillis(request.getDateHeader("If-Modified-Since"));
    Calendar publishDate = Calendar.getInstance();
    publishDate.setTime(feed.getPublishedDate());
    publishDate.set(Calendar.MILLISECOND, 0);
    int diff = ifModifiedSince.compareTo(publishDate);
    return diff != 0 || !eTag.equalsIgnoreCase(feedTag);
  } else {
    return true;
  }
}

There are only a two gotchas in the code:

  1. The value of the Etag must be quoted, hence the getEtag(...) method above returning a string wrapped in quotes. Not hard to do, but easy to miss.
  2. The first block of code above uses the setDateHeader(String name, long date) to set the ‘Last-Modified’ HTTP header, which conveniently takes care of formatting the given date according to the RFC 822 specification for dates and times. The published date comes from ROME. Here’s where it gets tricky: if the client returns the ‘If-Modified-Since’ header and you retrieve said date from the request using getDateHeader(String name), you’ll get a Date in the GMT timezone, which means if you want to compare the date you’ll have to get the date into your own timezone. That’s relatively easy to do by creating a Calendar instance and setting the time of the instance to the value you retrieved from the header. The Calendar instance will transparently take care of the timezone change for you. But there’s still one thing left: the date specification for RFC 822 doesn’t specify a millisecond so if the long value you hand to setDateHeader(long date) method contains a millisecond value and you then try to use the same value to compare against the ‘If-Modified-Since’ header, you’ll never get a match. The easy way around that is to manually set the millisecond bits on the date you get back from the ‘If-Modified-Since’ header to zero.

If you’re interested, there are a number of other blogs / articles about Etags and Last-Modified headers:

Using Lucene and MoreLikeThis to show Related Content

If you read this blog, you probably paid a smidgen of attention to the Web 2.0 Conference held last week in San Francisco. Sphere was one of the companies that presented and they launched a product called the “Sphere It Contextual Widget for blogs“, which is JavaScript widget you can add to your blog or content focused site that displays contextually similar blogs and blog posts for the reader. I’ve always wanted to try to do something similar (no pun intended) using Lucene, so I spent a couple hours this weekend banging around on it.

The first step was to get my WordPress content (which is stored in MySQL) into Lucene. A couple lines of code later I had a Lucene index full of all 857 (as of 11/14/2006) posts including the blog post ID, subject, body, date and permalink. Next, I checked out and compiled the Lucene similarity contrib, whose most important asset is the MoreLikeThis class (written in part by co-worker Bruce Ritchie). You provide an instance of MoreLikeThis a document to parse, an index to search and the fields in the index you want to compare against the given document and then execute a Lucene search just like you normally would:

Reader reader = ...;
IndexReader index = IndexReader.open(indexfile);
IndexSearcher searcher = new IndexSearcher(index);
MoreLikeThis mlt = new MoreLikeThis(index);
mlt.setFieldNames(new String[] {"subject", "body"});
Query query = mlt.like(reader);
Hits hits = is.search(query);

I’ll skip all the glue and say that I wired all this up into a servlet that spits out JSON:

Map entries = getRelatedEntries(postID, body);
JSONObject json = JSONObject.fromObject( entries );
response.setContentType("text/javascript");
response.getWriter().write("Related = {}; Related.posts = " + json.toString());

and then used client side JavaScript and some PHP to put it all together:

<h5>Related Content</h5>
<script type="text/javascript"
  src="http://cephas.net/blog/related.js?post=<?php the_ID(); ?>">
</script>
<script type="text/javascript">
for (post in Related.posts) {
document.write('<li><a href="' + Related.posts[post] + '">' + post + '</a></li>');
}
</script>

I’ve been cruising around the blog and so far, I think that MoreLikeThis works really well. For the most part, the posts that I would expect to be related, are related. There are a couple posts which seem to pop to the top of the ‘related content’ feed that I’ll have to fix and I would like to boost the terms in the subject of the original document, but other than that, I’m happy with it.

Back to sphere, and specifically to Brady’s post about it on the Radar blog:

Top-Left Corner: Recent, similar blog posts from other blogs.
Bottom-Left Corner: Recommended blogs that are selected by the site-owner. This is very handy for blog networks.
Top-Right Corner: Similar posts from that blog
Bottom-Right Corner: Ad, currently served by FM Pub.

Given a week, I’m guessing that you could use the Google API to do the top-left corner, hardcode the content in the bottom left, use MoreLikeThis in the top right and the bottom right you’d want to do yourself anyway. So if you were a publisher looking for more page views, why would you even consider the Sphere widget?

ROME, custom modules, publishdate and RSS

At work, I’ve taken on the work of migrating our RSS feeds currently being produced using JSP to ROME. Since we’ve added a few custom elements to the feeds available in Jive Forums (things like message and thread counts), I’m taking advantage of the feature in ROME that gives you the ability to programtically define namespaces in your RSS 2.0, Atom 0.3 and Atom 1.0 feeds (examples: the iTunes module and the OpenSearch module). Anyway, the code I wrote to add an item to the list of available items in a feed looked something like this:

...
entry = new SyndEntryImpl();
entry.setTitle(thread.getSubject());
entry.setLink("http://mysite.com/community/threads.jspa?id=" + 
   thread.getID());
entry.setUpdatedDate(thread.getModificationDate());
entry.setPublishedDate(thread.getCreationDate());
...
JiveForumsModule module = new JiveForumsModuleImpl();
module.setReplyCount(thread.getReplyCount());
List modules = new ArrayList();
modules.add(module);
entry.setModules(modules);
...

This code works, but if you view the feed, you don’t get a publish date on the item. I dug into the ROME source code a bit and found that the publish date is stored as part of the Dublin Core module, which I came to find out is a ‘special’ module that always exists on a SyndEntryImpl object. Take a look at the implementation of the getModules() method on the SyndEntryImpl class:

public List getModules() {
  if  (_modules==null) {
    _modules=new ArrayList();
  }
  if (ModuleUtils.getModule(_modules,DCModule.URI)==null) {
    _modules.add(new DCModuleImpl());
  }
  return _modules;
}

See how the method automatically injects a DCModuleImpl into the _modules property if the DCModule doesn’t exist? Long story short, the code I wrote blew away the _modules property on the SyndEntryImpl instance which contained a single DCModule which itself contained the publishedDate date instance. So by the time the feed was produced, the publish date I set on each SyndEntry was long gone. I should have written my code like this:

JiveForumsModule module = new JiveForumsModuleImpl();
module.setReplyCount(thread.getReplyCount());
entry.getModules().add(module);

Better yet, the ROME team could have done two things:

  1. Added documentation to the setModules(List modules) method that pointed out that any information in the existing DCModule instance will be lost if the provided list doesn’t contain the existing DCModule instance.
  2. Added a method to the SyndEntry interface called addModule(Module module).

Open source: I’m lovin it.