All posts by ajohnson

Links: 1-28-2008

deliciousposter 1.02 released and fun with html entities

I fixed a bug in deliciousposter that’s probably been wrecking havoc on anyone reading this site using an aggregator for a long time. You’re probably a nerd if you’re reading this post anyway, so I’ll bore you with the details. The deliciousposter project uses the delicious java library to get a list of posts from del.icio.us, creates a blog post using Velocity (which is really nasty now that I’ve been using Freemarker for the last year) and then uses the MetaWeblog API to publish the resulting blog post to a blog. So the data gets pushed to del.icio.us originally:
Aaron’s Stuff » DeliciousPoster
returned from del.icio.us in XML
Aaron’s Stuff » DeliciousPoster
to the del.icio.us java library, which decodes the XML so that you again have this:
Aaron’s Stuff » DeliciousPoster
but then you post that using XML-RPC and you end up with something like this:
Aaron?s Stuff ? DeliciousPoster
Why? Because you need to escape any HTML entities before sending them along via XML-RPC, I used Commons-Lang, which has a utility for escaping HTML entities:
StringEscapeUtils.escapeHtml(yourstring)
Think that’s nerdy? Wait until you have to do the same thing with titles in RSS.

The Data Life Cycle of a Blog Post

Cool flash infographic in the latest issue of Wired that shows what happens to your blog post after you click the ‘publish’ button (I’ll save you the hassle of actually viewing it: after you click the ‘publish’ button, exciting things like ping servers, data miners, search engines, text scrapers, aggregators, social bookmarking sites, online media, spam blogs and finally readers get involved). Since it’s Wired and not XML Journal, they stopped at the infographic, but man, it should would be cool to see all the ways that data massaged, reformatted, sliced and diced and transmitted, because there’s a lot that happens in that process. Just for the fun of it, I’m gonna walk through the scenarios I know about.

First, you click the publish button. But that might be a publish button on a desktop blogging client like Windows Live Writer or it might be the publish button in Microsoft Word or it might be a real live HTML button that says ‘publish’. So before you even get to the publish part, we’ve got the possibility of the MetaWeblog API (which is XML-RPC, effectively XML over HTTP), Atom Publishing Protocol (again effectively XML over HTTP) or a plain HTTP (or HTTPS!) POST.

OK, so now your blog post has been published on your blog. What next? Probably unbeknownst to you, your blog post has been automatically submitted to one or more ping servers using XML-RPC (XML over HTTP). Because search engines got into the blogging business, you can even ping Google and Yahoo (curiously not Microsoft, why?). If you don’t want to hassle with a bunch of different sites, you can always use pingomatic.com, which will ping (as of 1/27/2008) twenty one different ping servers for you.

Oh, I forgot to mention. If you’re using TypePad, Livejournal or Vox, the information about your blog post isn’t sent to these ping servers using XML-RPC, it’s streamed as XML in real-time over HTTP to many of the same parties.

Great, your blog post has now been sent to everyone, you’re good right? Nope. Now comes the onslaught of spiders and bots, awoken by the ping you sent, who will request your feed (RSS / Atom over HTTP) and your blog post (HTML over HTTP) and your first born child again and again and again. And now that your blog post is published and assuming that you’ve published something of value, you’ll see real people stop by and comment on your blog post and maybe bookmark it in a site like del.icio.us or ma.gnolia.com, snipping a quote from your blog post and then publishing that snippet to their own blogs or to their bug tracker and now your blog post has replicated, it lives in small parts all over the web, each part getting published and spidered and syndicated and ripped again and again and again. It’s beautiful isn’t it?

Links: 1-21-2008

Links: 1-20-2008

Links: 1-18-2008