Category Archives: Blogs

The Data Life Cycle of a Blog Post

Cool flash infographic in the latest issue of Wired that shows what happens to your blog post after you click the ‘publish’ button (I’ll save you the hassle of actually viewing it: after you click the ‘publish’ button, exciting things like ping servers, data miners, search engines, text scrapers, aggregators, social bookmarking sites, online media, spam blogs and finally readers get involved). Since it’s Wired and not XML Journal, they stopped at the infographic, but man, it should would be cool to see all the ways that data massaged, reformatted, sliced and diced and transmitted, because there’s a lot that happens in that process. Just for the fun of it, I’m gonna walk through the scenarios I know about.

First, you click the publish button. But that might be a publish button on a desktop blogging client like Windows Live Writer or it might be the publish button in Microsoft Word or it might be a real live HTML button that says ‘publish’. So before you even get to the publish part, we’ve got the possibility of the MetaWeblog API (which is XML-RPC, effectively XML over HTTP), Atom Publishing Protocol (again effectively XML over HTTP) or a plain HTTP (or HTTPS!) POST.

OK, so now your blog post has been published on your blog. What next? Probably unbeknownst to you, your blog post has been automatically submitted to one or more ping servers using XML-RPC (XML over HTTP). Because search engines got into the blogging business, you can even ping Google and Yahoo (curiously not Microsoft, why?). If you don’t want to hassle with a bunch of different sites, you can always use pingomatic.com, which will ping (as of 1/27/2008) twenty one different ping servers for you.

Oh, I forgot to mention. If you’re using TypePad, Livejournal or Vox, the information about your blog post isn’t sent to these ping servers using XML-RPC, it’s streamed as XML in real-time over HTTP to many of the same parties.

Great, your blog post has now been sent to everyone, you’re good right? Nope. Now comes the onslaught of spiders and bots, awoken by the ping you sent, who will request your feed (RSS / Atom over HTTP) and your blog post (HTML over HTTP) and your first born child again and again and again. And now that your blog post is published and assuming that you’ve published something of value, you’ll see real people stop by and comment on your blog post and maybe bookmark it in a site like del.icio.us or ma.gnolia.com, snipping a quote from your blog post and then publishing that snippet to their own blogs or to their bug tracker and now your blog post has replicated, it lives in small parts all over the web, each part getting published and spidered and syndicated and ripped again and again and again. It’s beautiful isn’t it?

IM and RSS: Rome is on Fire

Last August, Marshall Kirkpatrick (another Portland resident) posted an entry to TechCrunch about a company called FeedCrier which:

… makes it easy to receive rapid notification of new items in an RSS feed by IM

I bookmarked the link on del.icio.us, noting offhandedly that it would probably be easy to do something like this using Wildfire and Rome Fetcher… Almost 5 months to the day later, I’m really proud to say that it wasn’t easy, but it’s definitely doable, albeit with a slightly different aim.

If you swing by my website (instead of viewing this post in your favorite feed reader), you’ll see a list of ‘subscription options’ in the right hand navigation bar: RSS, AIM, Yahoo, MSN, Google Talk and Jabber / XMPP (the full set of IM services thanks IM Gateway Plugin). Clicking on RSS takes you to the feed so you can subscribe with a feed reader, clicking on any of the others results in a fancy schmancy dialog box (courtesy of YUI), into which you can plug in your preferred instant messaging username … click ‘subscribe’ and AJAX will send a request to the Wildfire plugin I created (proxied by mod_proxy), which will then send you an IM to confirm that you really want to receive alerts for this feed. Click the link in the IM and you’re off and running. The service then polls the feed you subscribed to at regular intervals, sending you a message if it finds something new. It supports all the feed formats that Rome supports and also supports XML-RPC pings (my blog is configured to ping the service when I post something to my blog).

I’ll be the first to admit that the UI sucks and that the dialog box should show a confirmation, that the YUI stuff is really heavy (380K of JavaScript and CSS to make a dialog box? sheesh!), that it’s not a ‘professional’ service like FeedCrier, and that I haven’t passed the code by the Wildfire team yet (I’m hoping they’ll accept it as a plugin that’ll be included as part of the base Wildfire distribution) but I’m really excited about the idea of RSS to IM in general and this implementation in particular for the following reasons:

  • As far as I know, all of the existing RSS to IM services (immedi.at, Zaptxt, Rasasa and the aforementioned FeedCrier) are hosted services. If I subscribe to your feed via any of the above services, I’ve got a relationship with them, not with you. If you’re a hip publisher, you’re probably sending pings their way, but you don’t know who is subscribed to your feed. You probably don’t have access to the list of subscribers (and as a subscriber maybe you wouldn’t want them too, but I’ll get to that in a second). With this plugin and an instance of Wildfire, you can go one to one with your customers, rather than working through some third party. Said another way, given the ability to run a Wildfire server, what company wouldn’t want to offer a ‘subscribe to this blog via IM’ as part of the ‘subscribe via email’ and ‘subscribe via RSS’ feature set?
  • Because you host it, you might configure the server in such a way as to give it access to feeds on your intranet, feeds that are completely inaccessible to *all* of the above services. What’s that you say? Your internal feeds are protected by Basic Authentication? That’s ok, the plugin can retrieve protected feeds as well. Specify the username and password in the URL (http://username:password@yourserver/feeds/my.xml) and you’re golden. So if you work at a big corporation that’s producing RSS feeds like rabbits produce baby bunnies, don’t fire up your desktop feed reader. Get someone to set up a Wildfire server and then pester them to install the plugin for you.
  • It’s truly instant: one of the things about RSS to instant messages is that you’d hope that you do get notified or alerted *instantly*. The reality is that it takes two to tango: for FeedCrier to alert you instantly when a feed is updated, they have to have the cooperation of the publisher, the publisher has to send them a ping when the feed is updated. Since the plugin supports XML-RPC pings, you as a publisher can configure your blogging software (or whatever else produces your RSS feeds) to send standard XML-RPC pings to the plugin, so while polling is supported, it should be the exception to the rule.
  • Finally, as a subscriber, the thing that’s valuable is that you a) get your content and b) that you get it instantly. You could care less about FeedCrier or any of these other services, you want your content now. So (and this is what I’d I’d get to earlier) you might be willing to give up the anonymity that RSS normally provides in exchange for immediate access to the information you want (or maybe anonymity isn’t a big deal to you at all). In other words, for truly valuable information, this service puts publishers in a position of power: subscribers get their fix instantly as long as they cough up their instant message information.

If you’ve gotten this far, thanks for reading. I’d love to hear your feedback. And don’t forget to subscribe. You know, to get your fix.

Blogs: Not just for breakfast anymore, part II

A couple weeks ago I added a short post to the Jive Software corporate blog entitled ‘Blogs: Not just for breakfast anymore‘, In the post, I hoped to squash the notion that blogs are all about opinions and are useless within a corporation, which was the ‘opinion’ of quite a number of people that took part in our user acceptance tests. I’m not sure that my four bullet points did the topic justice, but I found a post a couple days later written by Steve Yegge called “You Should Write Blogs“, which was a whole lot longer and not surprisingly a whole lot better than my post. And then today I read an article in the NY Times by Clive Thompson (whose blog I’m subscribed too) called Open-Source Spying, which I think is one of the most exciting articles I’ve read about blogging (and also wikis) ever. See it turns out that no less than the CIA, FBI and NSA are all embracing blogs and wikis as fantastic tools for collaboration and information dissemination, which (while admittedly knowing nothing about the spy business) sounds like a no brainer to me. Give everyone a blog, every team a wiki, throw a couple Google Enterprise Search boxes at’em and see what happens. Even if it does eventually ‘fail’, it’ll sure cost a lot less than the $170 million dollar FBI project that never even launched. But of course, it won’t fail:

… While the C.I.A. and Fingar’s office set up their wiki, Meyerrose’s office was dabbling in the other half of Andrus’s equation. In July, his staff decided to create a test blog to collect intelligence. It would focus on spotting and predicting possible avian-flu outbreaks and function as part of a larger portal on the subject to collect information from hundreds of sources around the world, inside and outside of the intelligence agencies. Avian flu, Meyerrose reasoned, is a national-security problem uniquely suited to an online-community effort, because information about the danger is found all over the world. An agent in Southeast Asia might be the first to hear news of dangerous farming practices; a medical expert in Chicago could write a crucial paper on transmission that was never noticed by analysts.

In August, one of Meyerrose’s assistants sat me down to show me a very brief glimpse of the results. In the months that it has been operational, the portal has amassed 38,000 “active” participants, though not everyone posts information. In one corner was the active-discussion area — the group blog where the participants could post their latest thoughts about avian flu and others could reply and debate. I noticed a posting, written by a university academic, on whether the H5N1 virus could actually be transmitted to humans, which had provoked a dozen comments. “See, these people would never have been talking before, and we certainly wouldn’t have heard about it if they did,” the assistant said. By September, the site had become so loaded with information and discussion that Rear Adm. Arthur Lawrence, a top official in the health department, told Meyerrose it had become the government’s most crucial resource on avian flu (emphasis mine).

Also, I haven’t read the entire paper yet, but the NY Times article mentions an essay entitled ‘The Wiki and the Blog: Toward a Complex Adaptive Intelligence Community’ written by a guy from the CIA, a quick google search turns it up over on the Social Science Research Network, you can download it or get it emailed to you for free here.

Using Lucene and MoreLikeThis to show Related Content

If you read this blog, you probably paid a smidgen of attention to the Web 2.0 Conference held last week in San Francisco. Sphere was one of the companies that presented and they launched a product called the “Sphere It Contextual Widget for blogs“, which is JavaScript widget you can add to your blog or content focused site that displays contextually similar blogs and blog posts for the reader. I’ve always wanted to try to do something similar (no pun intended) using Lucene, so I spent a couple hours this weekend banging around on it.

The first step was to get my WordPress content (which is stored in MySQL) into Lucene. A couple lines of code later I had a Lucene index full of all 857 (as of 11/14/2006) posts including the blog post ID, subject, body, date and permalink. Next, I checked out and compiled the Lucene similarity contrib, whose most important asset is the MoreLikeThis class (written in part by co-worker Bruce Ritchie). You provide an instance of MoreLikeThis a document to parse, an index to search and the fields in the index you want to compare against the given document and then execute a Lucene search just like you normally would:

Reader reader = ...;
IndexReader index = IndexReader.open(indexfile);
IndexSearcher searcher = new IndexSearcher(index);
MoreLikeThis mlt = new MoreLikeThis(index);
mlt.setFieldNames(new String[] {"subject", "body"});
Query query = mlt.like(reader);
Hits hits = is.search(query);

I’ll skip all the glue and say that I wired all this up into a servlet that spits out JSON:

Map entries = getRelatedEntries(postID, body);
JSONObject json = JSONObject.fromObject( entries );
response.setContentType("text/javascript");
response.getWriter().write("Related = {}; Related.posts = " + json.toString());

and then used client side JavaScript and some PHP to put it all together:

<h5>Related Content</h5>
<script type="text/javascript"
  src="http://cephas.net/blog/related.js?post=<?php the_ID(); ?>">
</script>
<script type="text/javascript">
for (post in Related.posts) {
document.write('<li><a href="' + Related.posts[post] + '">' + post + '</a></li>');
}
</script>

I’ve been cruising around the blog and so far, I think that MoreLikeThis works really well. For the most part, the posts that I would expect to be related, are related. There are a couple posts which seem to pop to the top of the ‘related content’ feed that I’ll have to fix and I would like to boost the terms in the subject of the original document, but other than that, I’m happy with it.

Back to sphere, and specifically to Brady’s post about it on the Radar blog:

Top-Left Corner: Recent, similar blog posts from other blogs.
Bottom-Left Corner: Recommended blogs that are selected by the site-owner. This is very handy for blog networks.
Top-Right Corner: Similar posts from that blog
Bottom-Right Corner: Ad, currently served by FM Pub.

Given a week, I’m guessing that you could use the Google API to do the top-left corner, hardcode the content in the bottom left, use MoreLikeThis in the top right and the bottom right you’d want to do yourself anyway. So if you were a publisher looking for more page views, why would you even consider the Sphere widget?

XSL / CSS Processing Instructions using ROME

Have you seen the way that the smart guys at FeedBurner display RSS feeds in a browser (here’s a sample if you haven’t)? If you’re like me, the first time you see a feed they manage, you’ll probably think that you’re viewing a page that contains a link to an RSS or ATOM feed, not the actual feed. In fact what you’re seeing is the feed transformed by your browser using XSL and CSS. Take a peek at the source and you’ll see that the XSL and CSS transformations are produced by what are technically called processing instructions. I won’t go into the work that they’ve done to create that look (but if you poke around the source it’s not trivial), but the inclusion of the processing instruction… now that’s something I can help you out with. I’ve used ROME on a couple different projects now because it is trivial to create RSS, RSS2 or ATOM feeds with only a couple lines of code. There are a number of tutorials up on the ROME site that show you how to create a feed and then write to a String, a File or a Writer which you should check out if you don’t have any experience with ROME. Assuming that you do however and given that you have a feed (technically a SyndFeed in ROME), you would write it out in a servlet like this:

SyndFeedOutput output = new SyndFeedOutput();
output.output(feed, response.getWriter());

ROME abstracts you from having to work directly with an XML document, which is handy most of the time. But if you want to add a processing instruction to style your feed, it takes a little fiddling with the source. If you peek at the source of SyndFeedOutput, you’ll see that it wraps WireFeedOutput. WireFeedOutput uses a JDOM Document and the XMLOutputter class to create write the feed. Conveniently, the Document class has methods for adding processing instructions, in fact it has a ProcessingInstruction class. Putting all these things together, you’d get this if you were creating the RSS feeds for FeedBurner using ROME:

WireFeedOutput feedOutput = new WireFeedOutput();
Document doc = feedOutput.outputJDom(feed.createWireFeed());
// create the XSL processing instruction
Map xsl = new HashMap();
xsl.put("href", "http://feeds.feedburner.com/~d/styles/rss2full.xsl");
xsl.put("type", "text/xsl");
xsl.put("media", "screen");
ProcessingInstruction pXsl = new ProcessingInstruction("xml-stylesheet", xsl);
doc.addContent(0, pXsl);
// create the CSS processing instruction
Map css = new HashMap();
css.put("href", "http://feeds.feedburner.com/~d/styles/itemcontent.css");
css.put("type", "text/css");
css.put("media", "screen");
ProcessingInstruction pCss = new ProcessingInstruction("xml-stylesheet", css);
doc.addContent(1, pCss);
// write the document to the servlet response
XMLOutputter outputter = new XMLOutputter(format);
outputter.output(doc,response.getWriter());

So now that I’ve made that easy for you, who’s going to show me the website that has a bunch of easy to use CSS and XSL templates for transforming RSS and ATOM feeds? Better yet, when are browsers going to have this kind of logic baked in so that my mom doesn’t have to look at an RSS feed that looks like a bunch of gobbly gook XML?