Category Archives: Syndication

deliciousposter 1.02 released and fun with html entities

I fixed a bug in deliciousposter that’s probably been wrecking havoc on anyone reading this site using an aggregator for a long time. You’re probably a nerd if you’re reading this post anyway, so I’ll bore you with the details. The deliciousposter project uses the delicious java library to get a list of posts from del.icio.us, creates a blog post using Velocity (which is really nasty now that I’ve been using Freemarker for the last year) and then uses the MetaWeblog API to publish the resulting blog post to a blog. So the data gets pushed to del.icio.us originally:
Aaron’s Stuff » DeliciousPoster
returned from del.icio.us in XML
Aaron’s Stuff » DeliciousPoster
to the del.icio.us java library, which decodes the XML so that you again have this:
Aaron’s Stuff » DeliciousPoster
but then you post that using XML-RPC and you end up with something like this:
Aaron?s Stuff ? DeliciousPoster
Why? Because you need to escape any HTML entities before sending them along via XML-RPC, I used Commons-Lang, which has a utility for escaping HTML entities:
StringEscapeUtils.escapeHtml(yourstring)
Think that’s nerdy? Wait until you have to do the same thing with titles in RSS.

The Data Life Cycle of a Blog Post

Cool flash infographic in the latest issue of Wired that shows what happens to your blog post after you click the ‘publish’ button (I’ll save you the hassle of actually viewing it: after you click the ‘publish’ button, exciting things like ping servers, data miners, search engines, text scrapers, aggregators, social bookmarking sites, online media, spam blogs and finally readers get involved). Since it’s Wired and not XML Journal, they stopped at the infographic, but man, it should would be cool to see all the ways that data massaged, reformatted, sliced and diced and transmitted, because there’s a lot that happens in that process. Just for the fun of it, I’m gonna walk through the scenarios I know about.

First, you click the publish button. But that might be a publish button on a desktop blogging client like Windows Live Writer or it might be the publish button in Microsoft Word or it might be a real live HTML button that says ‘publish’. So before you even get to the publish part, we’ve got the possibility of the MetaWeblog API (which is XML-RPC, effectively XML over HTTP), Atom Publishing Protocol (again effectively XML over HTTP) or a plain HTTP (or HTTPS!) POST.

OK, so now your blog post has been published on your blog. What next? Probably unbeknownst to you, your blog post has been automatically submitted to one or more ping servers using XML-RPC (XML over HTTP). Because search engines got into the blogging business, you can even ping Google and Yahoo (curiously not Microsoft, why?). If you don’t want to hassle with a bunch of different sites, you can always use pingomatic.com, which will ping (as of 1/27/2008) twenty one different ping servers for you.

Oh, I forgot to mention. If you’re using TypePad, Livejournal or Vox, the information about your blog post isn’t sent to these ping servers using XML-RPC, it’s streamed as XML in real-time over HTTP to many of the same parties.

Great, your blog post has now been sent to everyone, you’re good right? Nope. Now comes the onslaught of spiders and bots, awoken by the ping you sent, who will request your feed (RSS / Atom over HTTP) and your blog post (HTML over HTTP) and your first born child again and again and again. And now that your blog post is published and assuming that you’ve published something of value, you’ll see real people stop by and comment on your blog post and maybe bookmark it in a site like del.icio.us or ma.gnolia.com, snipping a quote from your blog post and then publishing that snippet to their own blogs or to their bug tracker and now your blog post has replicated, it lives in small parts all over the web, each part getting published and spidered and syndicated and ripped again and again and again. It’s beautiful isn’t it?

Java, Commons HTTP Client and HTTP proxies

If you’re living at a giant corporation during the day and you want to browse the web you’re probably going through some sort of proxy to the outside world. Most of the time you don’t care, but if you’re writing a Java application that needs to access resources on the other side of said proxy (ie: the rest of the world), you’ll eventually end up over here. That wonderful document will hook you up with all your need to know about setting the proxy host, port and optionally a username and password for your proxy as long as you’re using URLConnection, HttpURLConnection or anything that deals with the class URL. If you’re really a go-getter you might even browse over here and read all about how to utilize those properties on the command line, in code or when you’re deployed inside of Tomcat.

Some of you won’t be so lucky: you’ll eventually want to use some advanced tools that abstract you away from having to fetch InputStreams to get your feeds and will instead depend on the Commons HTTP Client, which unfortunately (or fortunately depending on your point of view), doesn’t care about those nice little system properties that java.net.URL likes and instead goes off and uses sockets directly. No, instead you have to do something like this:

HttpClient client = new HttpClient();
HttpConnectionManager conManager = client.getHttpConnectionManager();
client.getHostConfiguration().setProxy("proxyserver.example.com", 8080);

and if you want to provide a username and password for said proxy:

HttpState state = new HttpState();
state.setProxyCredentials(null, null,
   new UsernamePasswordCredentials("username", "password"));
client.setState(state);

which is all fine and dandy but sometimes I just wish the world were simpler. Ya know?

Using ROME to get the body / summary of an item

I’ve been using ROME for a couple years now and I’m still learning new things. Today I was working on an issue in Clearspace where we give users the ability to show RSS / Atom feeds in a widget, optionally giving them the choice to show the full content of each item in the feed or just a summary of each item in the feed. The existing logic / pseudo-code looked something like this:

for (SyndEntry entry : feed.getEntries()) {
  if (showFullContent) {
    write(entry.getContents()[0].value);
  } else {
    write(entry.getDescription().value);
  }
}

The assumption was that description would return a summary and contents would return the full content. The problem is that Atom and RSS are spec’ed umm.. differently. RSS 2.0 says that ‘description’ is a synopsis of the item but then goes on in an example to show how the description can be much more than just a short plain text description. So then you’re left with descriptions that aren’t really a synopsis, it’s the full content… or it is sometimes and sometimes not. Then Atom came along with well defined atom:summary and atom:content elements which means ROME had to figure out a way to map description and content-encoded elements in RSS to atom:summary and atom:content. Dave Johnson summarized the mappings nicely in a blog post discussing the release of ROME 0.9, in short the mapping looks like this:

RSS <description> <--> SyndEntry.description <--> Atom <summary>
RSS <content:encoded> <--> SyndEntry.contents[0] <--> Atom <content>

Anyway, all this is to say that if you’re doing any work with SyndEntry, you’ll need to check both description and contents. Generally, if you’re looking for the full content, check the value of contents first. If that’s null, check the value of description. If you’re looking for a summary, check the value of description first BUT don’t assume that you’ll actually get a short summary. Use something like StringUtils.abbreviate(…) to make certain that you’ll get a short summary back and not the entire content.

instantFeeds 1.0.4

I rolled out a new version of instantFeeds tonight, you can read all about the new features / bug fixes here. The big feature is that now all notifications will include the title, link and summary of every item in the feed whose publication date is later than the date of the last notification the system sent. I had a number of people write in to tell me they were using it and that they could use that exact feature.

Thanks to jas osborne for the patch that got me started again!

instantFeeds 1.0.3

New version of instantFeeds: version 1.0.3. It includes a two new features: you can now turn off your notifications by sending the command ‘off’ (kind of like an out of office feature) and turn them back on by sending the command ‘on’ and the notification you get sent now includes an approximately 255 character summary of the latest item. Additionally, I fixed the package naming (Wildfire recently had to change it’s name to Openfire and all the package names had to be updated as well) issues.

As always, you can check out the release notes, the source repository or just skip to the good parts and download the plugin.

Communication Multiplexers

From an interview with one of the developers on the twitter.com team:

I think the real power of Twitter is its ability to channel over different mediums at the user’s whim. IM, SMS, email, and the web are just transports as far as Twitter is concerned. Generally, you have to go out and get information via whatever medium that information is on. With Twitter, information can come to you via whatever medium you prefer. Or, if you want some space, you can easily turn off the information tap with a simple “off” command. That’s powerful.

I linked to a blog post by Tim O’Reilly a couple days ago that summarized this feature by calling it a ‘communications multiplexer’… There are other companies that do interesting things in this space in different ways: rasasa.net, zaptxt.com, feedcrier.com, etc… It’s also one of the ways I’d like to evolve the instantFeeds plugin I wrote: be able to send an email, IM or an SMS or maybe even message into a web page: get only the information you want, delivered using the medium of your choice.