Category Archives: Content Management

The Referer header, intranets and privacy

I’ve discussed meaningful URL’s a number of times on this site: one of the biggest benefits of a good blog URL is that you can infer who posted the article, when it was posted and what the blog post is about. For the most part this is all ‘a good thing’. But when you’re blogging on an intranet and you create a blog post that results in a URL like this:

http://intranet.example.com/blogs/aaron/2007/02/07/our-secret-widget-is-going-to-kill-our-competition

and then in the blog post you put a couple links to your competition and embed a picture of their latest product, you’re potentially letting secrets through the firewall without evening knowing it. See, HTTP has this really nice mechanism for specifying both a) what page an image is loading in and b) what page the user was on when they clicked on a link to visit the next page. It’s called the HTTP referer and it’s commonly used for good: web statistics packages (like Google Analytics or AWStats) use the referer header to show you click paths through your site and to show you what other websites are linking to you. A typical request in an Apache HTTPD log file might look something like this:

86.105.195.89 - - [06/Feb/2007:01:54:32 -0500] "GET /blogs/aaron/ HTTP/1.1" 200 34659 "http://intranet.example.com/blogs" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1; .NET CLR 2.0.50727) Gecko/20061204 Firefox/2.0.0.1"

but back to the point at hand: if you’re using blogs or wikis or anything that might produce a clean, understandable, meaningful URL and you or your company are serious about security, you’ll want to make sure that HTTP Referers are blocked because you really don’t want the president of your company breathing down your neck on a Monday morning because your competition just called… and they know. Here’s how:

  • Force anyone / everyone reading your internal site to use a Firefox plugin called RefControl, which allows you to control what gets sent in the referer field per website. Unless you’re the IT guy and you can force people to use this plugin, it’s doubtful this would work.
  • Force all of your outgoing links through what’s called a dereferer. Again, this is unwieldy, can probably be subverted and may not work for images. (you can do the same thing by modifying your Firefox config, but the plugin is easier)
  • Use HTTPS for all the pages on your intranet because RFC 2616 states that:

    Clients SHOULD NOT include a Referer header field in a (non-secure) HTTP request if the referring page was transferred with a secure protocol.

    which means that even if someone does create a link to your competition’s website on the intranet, your competition won’t find out.

On a semi-related note, here are a couple things I learned from reading this article by Eric Lawrence (creator of the fine HTTP Fiddler Tool for Windows):

  • Fiddler has a really cool diff feature where you can select two sessions, right click and select WinDiff from the menu
  • somehow he’s got Firefox hooked up to Fiddler… I gotta learn how.
  • example.com is reserved by RFC2606 specifically for the purpose of blog posts like this. Try the link. Who knew?

RSS/Atom feeds, Last Modified and Etags

Sometime last week I read this piece by Sam Ruby, which summarized says this:

…don’t send Etag and Last-Modified headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some processing.

The product I’ve been working on at work (which I should be able to start talking about soon which I can talk about now) for the last couple months uses feeds (either Atom, RSS 1.0 or RSS 2.0, your choice) extensively but didn’t have Etag or Last-Modified support so I spent a couple hours working on it this past weekend. We’re using ROME, so the code ended up looking something like this:

HttpServletRequest request = ...
HttpServletResponse response = ....
SyndFeed feed = ...
if (!isModified(request, feed)) {
  response.setStatus(HttpServletResponse.SC_NOT_MODIFIED);
} else {
  long publishDate = feed.getPublishedDate().getTime();
  response.setDateHeader("Last-Modified", publishDate);
  response.setHeader("Etag", getEtag(feed));
}
...
private String getEtag(SyndFeed feed) {
  return "\"" + String.valueOf(feed.getPublishedDate().getTime()) + "\"";
}
...
private boolean isModified(HttpServletRequest request, SyndFeed feed) {
  if (request.getHeader("If-Modified-Since") != null && request.getHeader("If-None-Match") != null) {
  String feedTag = getEtag(feed);
    String eTag = request.getHeader("If-None-Match");
    Calendar ifModifiedSince = Calendar.getInstance();
    ifModifiedSince.setTimeInMillis(request.getDateHeader("If-Modified-Since"));
    Calendar publishDate = Calendar.getInstance();
    publishDate.setTime(feed.getPublishedDate());
    publishDate.set(Calendar.MILLISECOND, 0);
    int diff = ifModifiedSince.compareTo(publishDate);
    return diff != 0 || !eTag.equalsIgnoreCase(feedTag);
  } else {
    return true;
  }
}

There are only a two gotchas in the code:

  1. The value of the Etag must be quoted, hence the getEtag(...) method above returning a string wrapped in quotes. Not hard to do, but easy to miss.
  2. The first block of code above uses the setDateHeader(String name, long date) to set the ‘Last-Modified’ HTTP header, which conveniently takes care of formatting the given date according to the RFC 822 specification for dates and times. The published date comes from ROME. Here’s where it gets tricky: if the client returns the ‘If-Modified-Since’ header and you retrieve said date from the request using getDateHeader(String name), you’ll get a Date in the GMT timezone, which means if you want to compare the date you’ll have to get the date into your own timezone. That’s relatively easy to do by creating a Calendar instance and setting the time of the instance to the value you retrieved from the header. The Calendar instance will transparently take care of the timezone change for you. But there’s still one thing left: the date specification for RFC 822 doesn’t specify a millisecond so if the long value you hand to setDateHeader(long date) method contains a millisecond value and you then try to use the same value to compare against the ‘If-Modified-Since’ header, you’ll never get a match. The easy way around that is to manually set the millisecond bits on the date you get back from the ‘If-Modified-Since’ header to zero.

If you’re interested, there are a number of other blogs / articles about Etags and Last-Modified headers:

Using Lucene and MoreLikeThis to show Related Content

If you read this blog, you probably paid a smidgen of attention to the Web 2.0 Conference held last week in San Francisco. Sphere was one of the companies that presented and they launched a product called the “Sphere It Contextual Widget for blogs“, which is JavaScript widget you can add to your blog or content focused site that displays contextually similar blogs and blog posts for the reader. I’ve always wanted to try to do something similar (no pun intended) using Lucene, so I spent a couple hours this weekend banging around on it.

The first step was to get my WordPress content (which is stored in MySQL) into Lucene. A couple lines of code later I had a Lucene index full of all 857 (as of 11/14/2006) posts including the blog post ID, subject, body, date and permalink. Next, I checked out and compiled the Lucene similarity contrib, whose most important asset is the MoreLikeThis class (written in part by co-worker Bruce Ritchie). You provide an instance of MoreLikeThis a document to parse, an index to search and the fields in the index you want to compare against the given document and then execute a Lucene search just like you normally would:

Reader reader = ...;
IndexReader index = IndexReader.open(indexfile);
IndexSearcher searcher = new IndexSearcher(index);
MoreLikeThis mlt = new MoreLikeThis(index);
mlt.setFieldNames(new String[] {"subject", "body"});
Query query = mlt.like(reader);
Hits hits = is.search(query);

I’ll skip all the glue and say that I wired all this up into a servlet that spits out JSON:

Map entries = getRelatedEntries(postID, body);
JSONObject json = JSONObject.fromObject( entries );
response.setContentType("text/javascript");
response.getWriter().write("Related = {}; Related.posts = " + json.toString());

and then used client side JavaScript and some PHP to put it all together:

<h5>Related Content</h5>
<script type="text/javascript"
  src="http://cephas.net/blog/related.js?post=<?php the_ID(); ?>">
</script>
<script type="text/javascript">
for (post in Related.posts) {
document.write('<li><a href="' + Related.posts[post] + '">' + post + '</a></li>');
}
</script>

I’ve been cruising around the blog and so far, I think that MoreLikeThis works really well. For the most part, the posts that I would expect to be related, are related. There are a couple posts which seem to pop to the top of the ‘related content’ feed that I’ll have to fix and I would like to boost the terms in the subject of the original document, but other than that, I’m happy with it.

Back to sphere, and specifically to Brady’s post about it on the Radar blog:

Top-Left Corner: Recent, similar blog posts from other blogs.
Bottom-Left Corner: Recommended blogs that are selected by the site-owner. This is very handy for blog networks.
Top-Right Corner: Similar posts from that blog
Bottom-Right Corner: Ad, currently served by FM Pub.

Given a week, I’m guessing that you could use the Google API to do the top-left corner, hardcode the content in the bottom left, use MoreLikeThis in the top right and the bottom right you’d want to do yourself anyway. So if you were a publisher looking for more page views, why would you even consider the Sphere widget?

JSON: Making Content Syndication easier

At work we’ve been having some discussions about sharing content between two websites: the natural first option was an XML solution, in this case RSS. Site A would subscribe to the RSS feeds of the site B, periodically retrieving the updated feeds, caching the contents of each feed for a specified period of time all the while displaying the resulting content on various parts of site A.

A couple months ago (December 2005 to be exact), Yahoo started supporting JSON (a lightweight data interchange format which stands for JavaScript Object Notation), as optional result format for some of it’s web services. The most common thing said about JSON is that it’s better than XML, usually meaning that it’s easier to parse and not as verbose, here’s a well written comparison of XML and JSON if you don’t believe me. While the comparisons of simplicity, openness and interoperability are useful, I think JSON really stands out when you’re working in a browser. Going back to the example I used above where site A needs to display content from site B, as I see it, this a sample runtime / flow that bits travel through in order to make the syndication work:
every_n_seconds() --> retrieve_feed() --> store_feed_entries() and then per request to site A:
make_page() --> get_feed_entries() --> parse_entries() --> display_entries(). There are a number of libraries built in Java for creating and parsing RSS, some for fetching RSS and you there’s even a JSP taglib for displaying RSS. But even with all the libraries, there’s still a good amount of code to write and a number of moving parts you’ll need to maintain. If you do the syndication on the client side using JSON, there are no moving parts. To display just the title of each one of my del.icio.us posts as an example, you would end up with something like this:

<script type="text/javascript" src="http://del.icio.us/feeds/json/ajohnson1200"></script>
<script type="text/javascript">
for (var i=0, post; post = Delicious.posts[i]; i++) {
  document.write(post.d + '<br />');
}
</script>

I’m comparing apples to oranges (server side RSS retrieval, storage, parse and display against client side JSON include) but there are a couple of non obvious advantages and disadvantages:

  1. Caching: If used on a number of pages, syndicated JSON content can reduce the number of bits a browser has to download to fully render a page. For example, let’s say (for arguments sake) that we have an RSS feed that is 17k in size and a corresponding JSON feed of the same size (even though RSS would inevitably be bigger). Using the server side RSS syndication, the browser will have to download the rendered syndicated content (again let’s say it’s 17k). Using the JSON syndicated feed across a number of page views, the browser would download the 17k JSON feed once and then not again (assuming the server has been configured to send a 304) until the feed has a new item. Winner: JSON / client
  2. Rendering: Of course, having the browser parse and render a 17K JSON feed wouldn’t be trivial. From a pure speed standpoint, the server could do the parse / generate once and then used an HTML rendering of the feed from cache from then on. Winner: RSS / server
  3. Searching: Using JSON on the client, site A (which is syndicating content from site B), wouldn’t have any way of searching the content, outside of retrieving / parsing/ storing on the server. Also, spiders wouldn’t see the syndicated content from site B on site A unlike the server side RSS syndication where the syndicated content would look no different to a spider than the other content on site A. Winner: RSS / server
  4. Ubiquity: JSON ‘only’ works if the browser has JavaScript enabled, which I’m guessing the large majority of users do have JavaScript enabled. But certain environments won’t and phones, set top boxes and anything else that runs in a browser but not on a PC may not have JavaScript, which means they won’t see the syndicated content. Server side generated content will be available across any platform that understands HTML. Winner: RSS / server

So wrapping up, when should you use JSON on the client and when should you use RSS on the server? If you need to syndicate a small amount of content to non programmers who can cut and paste (or programmers who are adept at JavaScript), JSON seems like the way to go. It’s trivial to get something up and running, the browser will cache the feed you create and your users will see the new content as soon as it becomes available in your JSON feed.

If you’ve read this far, you should go on and check out the examples on developer.yahoo.com and on del.icio.us. Also, if you’re a Java developer, you should head on over to sourceforge.net to take a look at the JSON-lib, which makes it wicked easy to create JSON from lists, arrays and beans.

New design

I got really bored with the old design of this site and all the cool kids seem to be using WordPress these days so last weekend I exported all 900 or so entries from Movable Type and imported them into WordPress, installed the ScribbishWP Theme and wrote a servlet filter to map my /blog/year/month/day/entry_name.html Movable Type permalinks to the WordPress style which is /blog/year/month/day/entry-name/. Oh, and comments are back on.

Enjoy!

WebWork and meaningful URLs

Personal pet peeve: meaningful URLs (which tonight I found out go by many names: pretty URLs, RESTian URLs, SES URLs, hackable URLs, etc…). At work, we use WebWork extensively but up until this point we haven’t made an effort to create meaningful URL’s. As with any well designed framework, it turns out that there are a couple of ways you can create meaningful URL’s, with different levels of meaningfulness.

Version 2.2 of WebWork introduced the ActionMapper interface and a class called RestfulActionMapper, which gives you the ability to create URLs that might look something like this:

http://bookstore.com/books/category/java/keyword/webwork

instead of the more common:

http://bookstore.com/books.jspa?category=java&keyword=webwork

The nice thing about the RestfulActionMapper implementation is that you don’t have to write any code to parse the URL: you set up your WebWork actions with the appropriate setters and the RestfulActionMapper handles the rest. The downside is that this still isn’t really a truly hackable URL. For example, although this URL:

http://bookstore.com/books/category/java/keyword

and this URL:

http://bookstore.com/books/category

would probably work, they don’t really make sense. Why are ‘keyword’ and ‘category’ hanging around at the end? Both of the words are extra information required by the implementation that don’t add any value to the user.

The second way you can create meaningful URLs is by creating your own ActionMapper. You can get a good start by checking out the source code for the DefaultActionMapper and the RestfulActionMapper. To set properties on your action instances, you’ll want to create a HashMap,, add the appropriate properties from your URL to the map and then either create and return a new ActionMapping using the action name and map or call the setParams() method on an existing mapping. The end result is that you should be able to create and use meaningful URL that looks like this:

http://bookstore.com/books/java/webwork

Also of note:

Creating RSS using Java

I wanted to create RSS feeds for karensrecipes.com using Java. I did my ‘research‘, came to this page: Ben Hammersley.com: Java RSS libraries and then used the RSS4j library to create a servlet that serves up dynamic RSS feeds of the 10 most recently created recipes per category (samples: Breakfast, Soup, Barbeque..).

They syntax is pretty simple, you get an RssDocument and set which version you want to use (RSS 1.0, .9 or .91):

RssDocument doc = new RssDocument();
doc.setVersion(RssDocument.VERSION_10);

and then create a RssChannel object and add that to the RssDocument:

RssChannel channel = new RssChannel();
channel.setChannelTitle("Karens Recipes | Most Recent");
channel.setChannelLink("http://www.karensrecipes.com/3/Soup/default.jsp");
channel.setChannelDescription("The 10 most recently added recipes in the soup category.");
channel.setChannelUri("http://www.karensrecipes.com/rss/?categoryid=3");
doc.addChannel(channel);

Next, you’ll retrieve the items using a database, the file system, etc… and add each item as a RssChannelItem:

// connect to the datasource
// iterate over something (db? vector?...)
RssChannelItem item = new RssChannelItem();
item.setItemTitle(label);
item.setItemLink(link);
item.setItemDescription(description);
channel.addItem(item);

and then finally, using the RssGenerator class, call the generateRss() method, in this case I’m sending the output to a Servlet PrintWriter:

PrintWriter out = response.getWriter();
RssGenerator.generateRss(doc, out);

You could just as easily write it to a file:

File file = new File("/opt/data/rss.xml");
try{
RssGenerator.generateRss(doc, file);
System.out.println("RSS file written.");
}
catch(RssGenerationException e){
e.printStackTrace();
}

Simple. Easy to use.

Open Source Content Management Conference

I got cleared to go to the Cambridge OSCOM today. I’m going to sign up tomorrow. Since we’re debuting our own content management product soon (we’re having a party in June btw, you’re invited! email me if you’d like to come!), I’ll be scouring feature lists and technologies that we can integrate and/or inter-operate with. Specifically, I’ll be interested in the following sessions:

Building a CMS Client With Mozilla

Collaborative Mapping on the Semantic Web

Bebop: Requirements and Design for a Web UI Component Library

Versioning Structured Content in a Content Management Application

Extending CMS with Web Services

From the information pile to social, knowledge exchanging bots