Category Archives: XML

deliciousposter 1.02 released and fun with html entities

I fixed a bug in deliciousposter that’s probably been wrecking havoc on anyone reading this site using an aggregator for a long time. You’re probably a nerd if you’re reading this post anyway, so I’ll bore you with the details. The deliciousposter project uses the delicious java library to get a list of posts from del.icio.us, creates a blog post using Velocity (which is really nasty now that I’ve been using Freemarker for the last year) and then uses the MetaWeblog API to publish the resulting blog post to a blog. So the data gets pushed to del.icio.us originally:
Aaron’s Stuff » DeliciousPoster
returned from del.icio.us in XML
Aaron’s Stuff » DeliciousPoster
to the del.icio.us java library, which decodes the XML so that you again have this:
Aaron’s Stuff » DeliciousPoster
but then you post that using XML-RPC and you end up with something like this:
Aaron?s Stuff ? DeliciousPoster
Why? Because you need to escape any HTML entities before sending them along via XML-RPC, I used Commons-Lang, which has a utility for escaping HTML entities:
StringEscapeUtils.escapeHtml(yourstring)
Think that’s nerdy? Wait until you have to do the same thing with titles in RSS.

The Data Life Cycle of a Blog Post

Cool flash infographic in the latest issue of Wired that shows what happens to your blog post after you click the ‘publish’ button (I’ll save you the hassle of actually viewing it: after you click the ‘publish’ button, exciting things like ping servers, data miners, search engines, text scrapers, aggregators, social bookmarking sites, online media, spam blogs and finally readers get involved). Since it’s Wired and not XML Journal, they stopped at the infographic, but man, it should would be cool to see all the ways that data massaged, reformatted, sliced and diced and transmitted, because there’s a lot that happens in that process. Just for the fun of it, I’m gonna walk through the scenarios I know about.

First, you click the publish button. But that might be a publish button on a desktop blogging client like Windows Live Writer or it might be the publish button in Microsoft Word or it might be a real live HTML button that says ‘publish’. So before you even get to the publish part, we’ve got the possibility of the MetaWeblog API (which is XML-RPC, effectively XML over HTTP), Atom Publishing Protocol (again effectively XML over HTTP) or a plain HTTP (or HTTPS!) POST.

OK, so now your blog post has been published on your blog. What next? Probably unbeknownst to you, your blog post has been automatically submitted to one or more ping servers using XML-RPC (XML over HTTP). Because search engines got into the blogging business, you can even ping Google and Yahoo (curiously not Microsoft, why?). If you don’t want to hassle with a bunch of different sites, you can always use pingomatic.com, which will ping (as of 1/27/2008) twenty one different ping servers for you.

Oh, I forgot to mention. If you’re using TypePad, Livejournal or Vox, the information about your blog post isn’t sent to these ping servers using XML-RPC, it’s streamed as XML in real-time over HTTP to many of the same parties.

Great, your blog post has now been sent to everyone, you’re good right? Nope. Now comes the onslaught of spiders and bots, awoken by the ping you sent, who will request your feed (RSS / Atom over HTTP) and your blog post (HTML over HTTP) and your first born child again and again and again. And now that your blog post is published and assuming that you’ve published something of value, you’ll see real people stop by and comment on your blog post and maybe bookmark it in a site like del.icio.us or ma.gnolia.com, snipping a quote from your blog post and then publishing that snippet to their own blogs or to their bug tracker and now your blog post has replicated, it lives in small parts all over the web, each part getting published and spidered and syndicated and ripped again and again and again. It’s beautiful isn’t it?

Debugging SOAP / XFire with ethereal

I’ve spent way more time than I should have the last couple weeks working to help migrate a website built against Jive Forums to run against a Clearspace X instance. As part of the migration, one of the things I did was to move all the data syndication that had been done with RSS and custom namespaces to use the Clearspace SOAP API, which is built on a technology called XFire. The first problem I ran into was that production website was configured so that requests to http://example.com were redirected to http://www.example.com/, which resulted in errors like this in the logs:

Jul 5, 2007 11:30:11 PM org.apache.commons.httpclient.HttpMethodDirector isRedirectNeeded
INFO: Redirect requested but followRedirects is disabled

That error was pretty easy to fix (swap in http://www.example.com in place of http://example.com), but the next thing I ran into was way less intuitive. When I invoked a certain service, I’d get a stack trace that looked like this:

Exception in thread "main" org.codehaus.xfire.XFireRuntimeException: Could not invoke service.. 
Nested exception is org.codehaus.xfire.fault.XFireFault: Unexpected character '-' (code 45) in prolog; expected '<'
 at [row,col {unknown-source}]: [2,1]
org.codehaus.xfire.fault.XFireFault: Unexpected character '-' (code 45) in prolog; expected '<'
 at [row,col {unknown-source}]: [2,1]
	at org.codehaus.xfire.fault.XFireFault.createFault(XFireFault.java:89)
	at org.codehaus.xfire.client.Client.onReceive(Client.java:386)

which was troubling because the exact same SOAP method invocation worked fine on both my local machine and in the test environment. What was different? Two things: the production system was running on Java 6 and the production system was configured to run behind an Apache HTTP server proxied by mod_caucho versus no Apache HTTP server / proxy in development or on my machine. I needed to see what was going on between the server and the client (one of the things that makes SOAP so hard is that you can't just GET a URL to see what's being returned) so I fired up ethereal at the behest of one of my coworkers. I kicked off a couple of SOAP requests with ethereal running, recorded the packets and then analyzed the capture. Said coworker then pointed out the key to debugging HTTP requests with ethereal: right click on the TCP packet you're interested in and then click 'Follow TCP Stream'. The invocation response looked like this when run against the development environment:

HTTP/1.1 200 OK
Date: Mon, 02 Jul 2007 21:59:30 GMT
Server: Resin/3.0.14
Content-Type: multipart/related; type="application/xop+xml"; start=""; start-info="text/xml"; .boundary="----=_Part_5_25686393.1183413571061"
Connection: close
Transfer-Encoding: chunked

1dce

------=_Part_5_25686393.1183413571061
Content-Type: application/xop+xml; charset=UTF-8; type="text/xml"
Content-Transfer-Encoding: 8bit
Content-ID: 
...

and looked like this when invoked against the production instance:

HTTP/1.1 200 OK
Date: Mon, 02 Jul 2007 21:41:56 GMT
Server: Apache/2.0.52 (Red Hat)
Vary: Accept-Encoding,User-Agent
Cache-Control: max-age=0
Expires: Mon, 02 Jul 2007 21:41:56 GMT
Transfer-Encoding: chunked
Content-Type: text/plain; charset=UTF-8
X-Pad: avoid browser bug

24e

------=_Part_29_31959705.1183412516805
Content-Type: application/xop+xml; charset=UTF-8; type="text/xml"
Content-Transfer-Encoding: 8bit
Content-ID: 
...

Notice the different content type returned by the production server? So then the mystery became not 'what?' but 'who?' I googled around for a bit and found a bug filed against JIRA that had all the same symptoms as the problem I was running into: the solution posted in the comments of the bug said that the problem was with mod_caucho. I worked with the ISP that hosts the production instance of Clearspace, got them to remove mod_caucho and use mod_proxy to isolate that piece of the puzzle and sure enough, the problem went away. Our ISP recommended that we not settle for mod_proxy for the entire site and instead wrote up a nifty solution using mod_rewrite and mod_proxy, which I've pasted below:

 RewriteRule ^/clearspace/rpc/soap(/?(.*))$ to://www.example.com:8080/clearspace/rpc/soap$1
 RewriteRule ^to://([^/]+)/(.*)    http://$1/$2   [E=SERVER:$1,P,L]
 ProxyPassReverse /community/rpc/soap/ http://www.example.com/clearspace/rpc/soap/

Hope that helps someone down the road!

ROME and wfw namespace elements

I created a ROME parser and generator for wfw:comment and wfw:commentRss elements today. You can read all about it here and download the source code here.

Not sure what the wfw:comment or wfw:commentRss elements are for? Imagine you’re reading my blog in desktop aggregator and you want to post a comment to my blog. The aggregator has no way of knowing where your comments should be HTTP posted too, so you have to open up another tab in Firefox and go to my blog to enter a comment. wfw:comment is an element that provides the HTTP post endpoint in the feed so that aggregators and widget developers can comment inline, just like this. wfw:commentRss is an element that provides you (or your favorite aggregator) with a link to the comments for the post you’re currently viewing.

ROME and custom Generator elements

The majority of the articles I’ve seen on using ROME to create Atom and RSS feeds don’t show you how to customize the optional ‘generator’ element that both Atom and RSS support. It’s really easy. I’m assuming that you’ve already created and populated a SyndFeed instance:

SyndFeed feed = ..
WireFeedOutput feedOutput = new WireFeedOutput();
WireFeed wireFeed = feed.createWireFeed();
if (wireFeed.getFeedType().startsWith("atom")) {
  Feed atomFeed = (Feed)wireFeed;
  Generator gen = new Generator();
  gen.setUrl("http://yoursite.com/");
  gen.setValue("Your Site");
  gen.setVersion("1.0");
  atomFeed.setGenerator(gen);
  feedOutput.output(atomFeed, ...);
} else {
  Channel rssFeed = (Channel)wireFeed;
  rssFeed.setGenerator("Your Site 1.0 (http://yoursite.com/)");
  feedOutput.output(rssFeed, ...);
}

Make a decision about which feed you’re going to produce, cast to the appropriate implementation of WireFeed (Channel for RSS and Feed for Atom), and then use the appropriate setters on each of those respective classes.

RSS, Processing Instructions and Firefox 2.0

A couple weeks ago I posted an article that describes how you can use XML stylesheets with ROME to create RSS / Atom feeds that are ‘user friendly’ (like the ones that FeedBurner produces). Firefox 2.0, released just this past week, has a new feature which renders my article moot. The “Previewing and subscribing to Web feeds” feature, is a good one for most non technical users: click on a link that results in an RSS feed and Firefox will recognize the RSS content type and show you the RSS styled with their own style sheet even if the feed has provided a style sheet. The current workaround, should you want your visitors to see your style sheet instead of the default Firefox 2.0 stylesheet, is to

“…put in a comment ranting about the evils of sniffing web content and overriding the desires of web developers which is long enough to move “<rss” or “<feed” out of the first 512 bytes, since that’s all we sniff.”
(source)

There’s an extremely lively discussion going on in the mozilla.dev.apps.firefox newsgroup and on bugzilla about this behavior.

More here, here, here, here and here.

XSL / CSS Processing Instructions using ROME

Have you seen the way that the smart guys at FeedBurner display RSS feeds in a browser (here’s a sample if you haven’t)? If you’re like me, the first time you see a feed they manage, you’ll probably think that you’re viewing a page that contains a link to an RSS or ATOM feed, not the actual feed. In fact what you’re seeing is the feed transformed by your browser using XSL and CSS. Take a peek at the source and you’ll see that the XSL and CSS transformations are produced by what are technically called processing instructions. I won’t go into the work that they’ve done to create that look (but if you poke around the source it’s not trivial), but the inclusion of the processing instruction… now that’s something I can help you out with. I’ve used ROME on a couple different projects now because it is trivial to create RSS, RSS2 or ATOM feeds with only a couple lines of code. There are a number of tutorials up on the ROME site that show you how to create a feed and then write to a String, a File or a Writer which you should check out if you don’t have any experience with ROME. Assuming that you do however and given that you have a feed (technically a SyndFeed in ROME), you would write it out in a servlet like this:

SyndFeedOutput output = new SyndFeedOutput();
output.output(feed, response.getWriter());

ROME abstracts you from having to work directly with an XML document, which is handy most of the time. But if you want to add a processing instruction to style your feed, it takes a little fiddling with the source. If you peek at the source of SyndFeedOutput, you’ll see that it wraps WireFeedOutput. WireFeedOutput uses a JDOM Document and the XMLOutputter class to create write the feed. Conveniently, the Document class has methods for adding processing instructions, in fact it has a ProcessingInstruction class. Putting all these things together, you’d get this if you were creating the RSS feeds for FeedBurner using ROME:

WireFeedOutput feedOutput = new WireFeedOutput();
Document doc = feedOutput.outputJDom(feed.createWireFeed());
// create the XSL processing instruction
Map xsl = new HashMap();
xsl.put("href", "http://feeds.feedburner.com/~d/styles/rss2full.xsl");
xsl.put("type", "text/xsl");
xsl.put("media", "screen");
ProcessingInstruction pXsl = new ProcessingInstruction("xml-stylesheet", xsl);
doc.addContent(0, pXsl);
// create the CSS processing instruction
Map css = new HashMap();
css.put("href", "http://feeds.feedburner.com/~d/styles/itemcontent.css");
css.put("type", "text/css");
css.put("media", "screen");
ProcessingInstruction pCss = new ProcessingInstruction("xml-stylesheet", css);
doc.addContent(1, pCss);
// write the document to the servlet response
XMLOutputter outputter = new XMLOutputter(format);
outputter.output(doc,response.getWriter());

So now that I’ve made that easy for you, who’s going to show me the website that has a bunch of easy to use CSS and XSL templates for transforming RSS and ATOM feeds? Better yet, when are browsers going to have this kind of logic baked in so that my mom doesn’t have to look at an RSS feed that looks like a bunch of gobbly gook XML?

JSON: Making Content Syndication easier

At work we’ve been having some discussions about sharing content between two websites: the natural first option was an XML solution, in this case RSS. Site A would subscribe to the RSS feeds of the site B, periodically retrieving the updated feeds, caching the contents of each feed for a specified period of time all the while displaying the resulting content on various parts of site A.

A couple months ago (December 2005 to be exact), Yahoo started supporting JSON (a lightweight data interchange format which stands for JavaScript Object Notation), as optional result format for some of it’s web services. The most common thing said about JSON is that it’s better than XML, usually meaning that it’s easier to parse and not as verbose, here’s a well written comparison of XML and JSON if you don’t believe me. While the comparisons of simplicity, openness and interoperability are useful, I think JSON really stands out when you’re working in a browser. Going back to the example I used above where site A needs to display content from site B, as I see it, this a sample runtime / flow that bits travel through in order to make the syndication work:
every_n_seconds() --> retrieve_feed() --> store_feed_entries() and then per request to site A:
make_page() --> get_feed_entries() --> parse_entries() --> display_entries(). There are a number of libraries built in Java for creating and parsing RSS, some for fetching RSS and you there’s even a JSP taglib for displaying RSS. But even with all the libraries, there’s still a good amount of code to write and a number of moving parts you’ll need to maintain. If you do the syndication on the client side using JSON, there are no moving parts. To display just the title of each one of my del.icio.us posts as an example, you would end up with something like this:

<script type="text/javascript" src="http://del.icio.us/feeds/json/ajohnson1200"></script>
<script type="text/javascript">
for (var i=0, post; post = Delicious.posts[i]; i++) {
  document.write(post.d + '<br />');
}
</script>

I’m comparing apples to oranges (server side RSS retrieval, storage, parse and display against client side JSON include) but there are a couple of non obvious advantages and disadvantages:

  1. Caching: If used on a number of pages, syndicated JSON content can reduce the number of bits a browser has to download to fully render a page. For example, let’s say (for arguments sake) that we have an RSS feed that is 17k in size and a corresponding JSON feed of the same size (even though RSS would inevitably be bigger). Using the server side RSS syndication, the browser will have to download the rendered syndicated content (again let’s say it’s 17k). Using the JSON syndicated feed across a number of page views, the browser would download the 17k JSON feed once and then not again (assuming the server has been configured to send a 304) until the feed has a new item. Winner: JSON / client
  2. Rendering: Of course, having the browser parse and render a 17K JSON feed wouldn’t be trivial. From a pure speed standpoint, the server could do the parse / generate once and then used an HTML rendering of the feed from cache from then on. Winner: RSS / server
  3. Searching: Using JSON on the client, site A (which is syndicating content from site B), wouldn’t have any way of searching the content, outside of retrieving / parsing/ storing on the server. Also, spiders wouldn’t see the syndicated content from site B on site A unlike the server side RSS syndication where the syndicated content would look no different to a spider than the other content on site A. Winner: RSS / server
  4. Ubiquity: JSON ‘only’ works if the browser has JavaScript enabled, which I’m guessing the large majority of users do have JavaScript enabled. But certain environments won’t and phones, set top boxes and anything else that runs in a browser but not on a PC may not have JavaScript, which means they won’t see the syndicated content. Server side generated content will be available across any platform that understands HTML. Winner: RSS / server

So wrapping up, when should you use JSON on the client and when should you use RSS on the server? If you need to syndicate a small amount of content to non programmers who can cut and paste (or programmers who are adept at JavaScript), JSON seems like the way to go. It’s trivial to get something up and running, the browser will cache the feed you create and your users will see the new content as soon as it becomes available in your JSON feed.

If you’ve read this far, you should go on and check out the examples on developer.yahoo.com and on del.icio.us. Also, if you’re a Java developer, you should head on over to sourceforge.net to take a look at the JSON-lib, which makes it wicked easy to create JSON from lists, arrays and beans.

FluentInterface

A couple of weeks ago on the DWR users list, in the context of needing to wire up DWR without using an XML file, Joe Walker pointed to a blog posting by Martin Fowler. In it, Martin discusses an interface style called a ‘fluent interface’. It’s a little difficult to describe in words (so check it out in action on above mentioned blog post) but I think Piers Cawley described it best when he described the style as “…essentially interfaces that do a good job of removing hoopage.” Update: Geert Bevin uses this style in the RIFE framework and was calling the it “chainable builder methods” before Martin came along with the ‘fluent interface’ term.

Back to DWR.  I spent the last couple days working on a ‘fluent’ way of configuring DWR which obviously then wouldn’t require dwr.xml, the result of which is available here. In short, given an XML configuration file that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dwr PUBLIC "-//GetAhead Limited//DTD Direct Web Remoting
1.0//EN" "http://www.getahead.ltd.uk/dwr/dwr10.dtd">
<dwr>
  <init>
    <converter id="testbean" class="uk.ltd.getahead.testdwr.TestBean2Converter"/>
  </init>
  <allow>
    <create creator="new" javascript="Test" scope="application">
      <param name="class" value="uk.ltd.getahead.testdwr.Test"/>
    </create>
    <create creator="new" javascript="JDate">
      <param name="class" value="java.util.Date"/>
      <exclude method="getHours"/>
      <auth method="getMinutes" role="admin"/>
      <auth method="getMinutes" role="devel"/>
    </create>
    <convert converter="bean" match="$Proxy*"/>
    <convert converter="testbean" match="uk.ltd.getahead.testdwr.TestBean"/>
    <convert converter="bean" match="uk.ltd.getahead.testdwr.ObjB"/>
    <convert converter="object" match="uk.ltd.getahead.testdwr.ObjA">
      <param name="force" value="true"/>
    </convert>
  </allow>
  <signatures>
  <![CDATA[
  import java.util.*;
  import uk.ltd.getahead.testdwr.*;
  Test.testBeanSetParam(Set<TestBean>);
  Test.testBeanListParam(List<TestBean>);
  Test.charTestBeanMapParam(Map<Character, TestBean>);
  Test.stringStringMapParam(Map<String, String>);
  Test.stringStringHashMapParam(HashMap<String, String>);
  Test.stringStringTreeMapParam(TreeMap<String, String>);
  Test.stringCollectionParam(Collection<String>);
  Test.stringListParam(List<String>);
  Test.stringLinkedListParam(LinkedList<String>);
  Test.stringArrayListParam(ArrayList<String>);
  Test.stringSetParam(Set<String>);
  Test.stringHashSetParam(HashSet<String>);
  Test.stringTreeSetParam(TreeSet<String>);
  ]]>
  </signatures>
</dwr>

you can instead configure DWR using the FluentConfiguration class like this:

FluentConfiguration fluentconfig = (FluentConfiguration)configuration;
fluentconfig
  .withConverterType("testbean", "uk.ltd.getahead.testdwr.TestBean2Converter")
  .withCreator("new", "Test")
    .addParam("scope", "application")
    .addParam("class", "uk.ltd.getahead.testdwr.Test")
  .withCreator("new", "JDate")
    .addParam("class", "java.util.Date")
    .exclude("getHours")
    .withAuth("getMinutes", "admin")
    .withAuth("getMinutes", "devel")
  .withConverter("bean", "$Proxy*")
  .withConverter("testbean", "uk.ltd.getahead.testdwr.TestBean")
  .withConverter("bean", "uk.ltd.getahead.testdwr.ObjB")
  .withConverter("object", "uk.ltd.getahead.testdwr.ObjA")
    .addParam("force", "true")
  .withSignature()
    .addLine("import java.util.*;")
    .addLine("import uk.ltd.getahead.testdwr.*;")
    .addLine("Test.testBeanSetParam(Set);")
    .addLine("Test.testBeanListParam(List);")
    .addLine("Test.charTestBeanMapParam(Map);")
    .addLine("Test.stringStringMapParam(Map);")
    .addLine("Test.stringStringHashMapParam(HashMap);")
    .addLine("Test.stringStringTreeMapParam(TreeMap);")
    .addLine("Test.stringCollectionParam(Collection);")
    .addLine("Test.stringListParam(List);")
    .addLine("Test.stringLinkedListParam(LinkedList);")
    .addLine("Test.stringArrayListParam(ArrayList);")
    .addLine("Test.stringSetParam(Set);")
    .addLine("Test.stringHashSetParam(HashSet);")
    .addLine("Test.stringTreeSetParam(TreeSet);")
  .finished();

If you’re interested in using this in your DWR project, you need only to:

  • create a class that extends DWRServlet (example: check out FluentDWRServlet.java in the zip file) and use that class as your DWR servlet
  • add a configuration param in web.xml called uk.ltd.getahead.dwr.Configuration and set the value to net.cephas.dwr.FluentConfiguration
  • add a configuration param in web.xml called skipDefaultConfig and set the value to true

    <servlet>
      <servlet-name>dwr</servlet-name>
      <servlet-class>net.cephas.dwr.FluentDWRServlet</servlet-class>
        <init-param>
          <param-name>uk.ltd.getahead.dwr.Configuration</param-name>
          <param-value>net.cephas.dwr.FluentConfiguration</param-value>
        </init-param>
        <init-param>
          <param-name>skipDefaultConfig</param-name>
          <param-value>true</param-value>
        </init-param>
    </servlet>

  • and then override the configure method in the servlet and use the fluent style of configuration I used above.

    Send me an email if you have any questions!

  • XML characters, smart quotes and Apache XML-RPC

    I’ve been eating my own dogfood with the deliciousposter project (as you can see from my daily links). A couple days ago I posted a some links to del.icio.us and expected them to show up automatically the next day… except they didn’t. I traced it down to an errant smart quote that I copied from the Internet Alchemy Talis, Web 2.0 and All That post, which caused the Apache XML-RPC library to throw this error:

    java.io.IOException: Invalid character data corresponding to XML entity ’

    I worked under the assumption that the smart quote was an invalid XML character for quite awhile, but it looks like it actually is according to the XML 1.1 specification, the following characters are allowed in an XML document:

    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

    I then checked the source code for the XmlWriter which has this method for writing character data:

    ...
    if (c < 0x20 || c > 0xff) {
      // Though the XML-RPC spec allows any ASCII
      // characters except '<' and '&', the XML spec
      // does not allow this range of characters,
      // resulting in a parse error from most XML
      // parsers.
      throw new XmlRpcClientException("Invalid character data " +
      "corresponding to XML entity &#" +
      String.valueOf((int) c) + ';', null);
    } else ..

    which turns out to be a tad aggressive. It also turns out that the above code snippet and the version of the Apache XML-RPC library I was using are out of date. The chardata(String text) has been updated in the latest version of the Apache XMl-RPC library to include a new method called isValidXMLChar(char c) which is much more lenient:

    if (c == '\n') return true;
    if (c == '\r') return true;
    if (c == '\t') return true;
    if (c
    and not coincidentally, is compliant with the specification.

    I'll be updating deliciousposter to use the latest version of the Apache XML-RPC library soon. In the meantime, if you're using the Apache XML-RPC library, you should probably download the latest version to take advantage of the new XML character validation method.