Category Archives: Content Management

Content Management, Software Development, Systems Administration

The Referer header, intranets and privacy

February 6, 2007 ajohnson Leave a comment

I’ve discussed meaningful URL’s a number of times on this site: one of the biggest benefits of a good blog URL is that you can infer who posted the article, when it was posted and what the blog post is about. For the most part this is all ‘a good thing’. But when you’re blogging on an intranet and you create a blog post that results in a URL like this:

http://intranet.example.com/blogs/aaron/2007/02/07/our-secret-widget-is-going-to-kill-our-competition

and then in the blog post you put a couple links to your competition and embed a picture of their latest product, you’re potentially letting secrets through the firewall without evening knowing it. See, HTTP has this really nice mechanism for specifying both a) what page an image is loading in and b) what page the user was on when they clicked on a link to visit the next page. It’s called the HTTP referer and it’s commonly used for good: web statistics packages (like Google Analytics or AWStats) use the referer header to show you click paths through your site and to show you what other websites are linking to you. A typical request in an Apache HTTPD log file might look something like this:

86.105.195.89 - - [06/Feb/2007:01:54:32 -0500] "GET /blogs/aaron/ HTTP/1.1" 200 34659 "http://intranet.example.com/blogs" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1; .NET CLR 2.0.50727) Gecko/20061204 Firefox/2.0.0.1"

but back to the point at hand: if you’re using blogs or wikis or anything that might produce a clean, understandable, meaningful URL and you or your company are serious about security, you’ll want to make sure that HTTP Referers are blocked because you really don’t want the president of your company breathing down your neck on a Monday morning because your competition just called… and they know. Here’s how:

Force anyone / everyone reading your internal site to use a Firefox plugin called RefControl, which allows you to control what gets sent in the referer field per website. Unless you’re the IT guy and you can force people to use this plugin, it’s doubtful this would work.
Force all of your outgoing links through what’s called a dereferer. Again, this is unwieldy, can probably be subverted and may not work for images. (you can do the same thing by modifying your Firefox config, but the plugin is easier)
Use HTTPS for all the pages on your intranet because RFC 2616 states that:

Clients SHOULD NOT include a Referer header field in a (non-secure) HTTP request if the referring page was transferred with a secure protocol.

which means that even if someone does create a link to your competition’s website on the intranet, your competition won’t find out.

On a semi-related note, here are a couple things I learned from reading this article by Eric Lawrence (creator of the fine HTTP Fiddler Tool for Windows):

Fiddler has a really cool diff feature where you can select two sessions, right click and select WinDiff from the menu
somehow he’s got Firefox hooked up to Fiddler… I gotta learn how.
example.com is reserved by RFC2606 specifically for the purpose of blog posts like this. Try the link. Who knew?

Content Management, J2EE, Software Development, Syndication, Systems Administration

RSS/Atom feeds, Last Modified and Etags

December 4, 2006 ajohnson Leave a comment

Sometime last week I read this piece by Sam Ruby, which summarized says this:

…donâ€™t send Etag and Last-Modified headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some processing.

The product I’ve been working on at work (~~which I should be able to start talking about soon~~ which I can talk about now) for the last couple months uses feeds (either Atom, RSS 1.0 or RSS 2.0, your choice) extensively but didn’t have Etag or Last-Modified support so I spent a couple hours working on it this past weekend. We’re using ROME, so the code ended up looking something like this:

HttpServletRequest request = ...
HttpServletResponse response = ....
SyndFeed feed = ...
if (!isModified(request, feed)) {
  response.setStatus(HttpServletResponse.SC_NOT_MODIFIED);
} else {
  long publishDate = feed.getPublishedDate().getTime();
  response.setDateHeader("Last-Modified", publishDate);
  response.setHeader("Etag", getEtag(feed));
}
...
private String getEtag(SyndFeed feed) {
  return "\"" + String.valueOf(feed.getPublishedDate().getTime()) + "\"";
}
...
private boolean isModified(HttpServletRequest request, SyndFeed feed) {
  if (request.getHeader("If-Modified-Since") != null && request.getHeader("If-None-Match") != null) {
  String feedTag = getEtag(feed);
    String eTag = request.getHeader("If-None-Match");
    Calendar ifModifiedSince = Calendar.getInstance();
    ifModifiedSince.setTimeInMillis(request.getDateHeader("If-Modified-Since"));
    Calendar publishDate = Calendar.getInstance();
    publishDate.setTime(feed.getPublishedDate());
    publishDate.set(Calendar.MILLISECOND, 0);
    int diff = ifModifiedSince.compareTo(publishDate);
    return diff != 0 || !eTag.equalsIgnoreCase(feedTag);
  } else {
    return true;
  }
}

There are only a two gotchas in the code:

The value of the Etag must be quoted, hence the getEtag(...) method above returning a string wrapped in quotes. Not hard to do, but easy to miss.
The first block of code above uses the setDateHeader(String name, long date) to set the ‘Last-Modified’ HTTP header, which conveniently takes care of formatting the given date according to the RFC 822 specification for dates and times. The published date comes from ROME. Here’s where it gets tricky: if the client returns the ‘If-Modified-Since’ header and you retrieve said date from the request using getDateHeader(String name), you’ll get a Date in the GMT timezone, which means if you want to compare the date you’ll have to get the date into your own timezone. That’s relatively easy to do by creating a Calendar instance and setting the time of the instance to the value you retrieved from the header. The Calendar instance will transparently take care of the timezone change for you. But there’s still one thing left: the date specification for RFC 822 doesn’t specify a millisecond so if the long value you hand to setDateHeader(long date) method contains a millisecond value and you then try to use the same value to compare against the ‘If-Modified-Since’ header, you’ll never get a match. The easy way around that is to manually set the millisecond bits on the date you get back from the ‘If-Modified-Since’ header to zero.

If you’re interested, there are a number of other blogs / articles about Etags and Last-Modified headers:

Blogs, Content Management, Lucene, Syndication

Using Lucene and MoreLikeThis to show Related Content

November 14, 2006 ajohnson 4 Comments

If you read this blog, you probably paid a smidgen of attention to the Web 2.0 Conference held last week in San Francisco. Sphere was one of the companies that presented and they launched a product called the “Sphere It Contextual Widget for blogs“, which is JavaScript widget you can add to your blog or content focused site that displays contextually similar blogs and blog posts for the reader. I’ve always wanted to try to do something similar (no pun intended) using Lucene, so I spent a couple hours this weekend banging around on it.

The first step was to get my WordPress content (which is stored in MySQL) into Lucene. A couple lines of code later I had a Lucene index full of all 857 (as of 11/14/2006) posts including the blog post ID, subject, body, date and permalink. Next, I checked out and compiled the Lucene similarity contrib, whose most important asset is the MoreLikeThis class (written in part by co-worker Bruce Ritchie). You provide an instance of MoreLikeThis a document to parse, an index to search and the fields in the index you want to compare against the given document and then execute a Lucene search just like you normally would:

Reader reader = ...;
IndexReader index = IndexReader.open(indexfile);
IndexSearcher searcher = new IndexSearcher(index);
MoreLikeThis mlt = new MoreLikeThis(index);
mlt.setFieldNames(new String[] {"subject", "body"});
Query query = mlt.like(reader);
Hits hits = is.search(query);

I’ll skip all the glue and say that I wired all this up into a servlet that spits out JSON:

Map entries = getRelatedEntries(postID, body);
JSONObject json = JSONObject.fromObject( entries );
response.setContentType("text/javascript");
response.getWriter().write("Related = {}; Related.posts = " + json.toString());

and then used client side JavaScript and some PHP to put it all together:

<h5>Related Content</h5>
<script type="text/javascript"
  src="http://cephas.net/blog/related.js?post=<?php the_ID(); ?>">
</script>
<script type="text/javascript">
for (post in Related.posts) {
document.write('<li><a href="' + Related.posts[post] + '">' + post + '</a></li>');
}
</script>

I’ve been cruising around the blog and so far, I think that MoreLikeThis works really well. For the most part, the posts that I would expect to be related, are related. There are a couple posts which seem to pop to the top of the ‘related content’ feed that I’ll have to fix and I would like to boost the terms in the subject of the original document, but other than that, I’m happy with it.

Back to sphere, and specifically to Brady’s post about it on the Radar blog:

Top-Left Corner: Recent, similar blog posts from other blogs.
Bottom-Left Corner: Recommended blogs that are selected by the site-owner. This is very handy for blog networks.
Top-Right Corner: Similar posts from that blog
Bottom-Right Corner: Ad, currently served by FM Pub.

Given a week, I’m guessing that you could use the Google API to do the top-left corner, hardcode the content in the bottom left, use MoreLikeThis in the top right and the bottom right you’d want to do yourself anyway. So if you were a publisher looking for more page views, why would you even consider the Sphere widget?

Content Management, J2EE, JavaScript, Open Source, Software Development, XML

JSON: Making Content Syndication easier

August 21, 2006 ajohnson 3 Comments

At work we’ve been having some discussions about sharing content between two websites: the natural first option was an XML solution, in this case RSS. Site A would subscribe to the RSS feeds of the site B, periodically retrieving the updated feeds, caching the contents of each feed for a specified period of time all the while displaying the resulting content on various parts of site A.

A couple months ago (December 2005 to be exact), Yahoo started supporting JSON (a lightweight data interchange format which stands for JavaScript Object Notation), as optional result format for some of it’s web services. The most common thing said about JSON is that it’s better than XML, usually meaning that it’s easier to parse and not as verbose, here’s a well written comparison of XML and JSON if you don’t believe me. While the comparisons of simplicity, openness and interoperability are useful, I think JSON really stands out when you’re working in a browser. Going back to the example I used above where site A needs to display content from site B, as I see it, this a sample runtime / flow that bits travel through in order to make the syndication work:
every_n_seconds() --> retrieve_feed() --> store_feed_entries() and then per request to site A:
make_page() --> get_feed_entries() --> parse_entries() --> display_entries(). There are a number of libraries built in Java for creating and parsing RSS, some for fetching RSS and you there’s even a JSP taglib for displaying RSS. But even with all the libraries, there’s still a good amount of code to write and a number of moving parts you’ll need to maintain. If you do the syndication on the client side using JSON, there are no moving parts. To display just the title of each one of my del.icio.us posts as an example, you would end up with something like this:
<script type="text/javascript" src="http://del.icio.us/feeds/json/ajohnson1200"></script> <script type="text/javascript"> for (var i=0, post; post = Delicious.posts[i]; i++) { document.write(post.d + '<br />'); } </script>
I’m comparing apples to oranges (server side RSS retrieval, storage, parse and display against client side JSON include) but there are a couple of non obvious advantages and disadvantages:

Caching: If used on a number of pages, syndicated JSON content can reduce the number of bits a browser has to download to fully render a page. For example, let’s say (for arguments sake) that we have an RSS feed that is 17k in size and a corresponding JSON feed of the same size (even though RSS would inevitably be bigger). Using the server side RSS syndication, the browser will have to download the rendered syndicated content (again let’s say it’s 17k). Using the JSON syndicated feed across a number of page views, the browser would download the 17k JSON feed once and then not again (assuming the server has been configured to send a 304) until the feed has a new item. Winner: JSON / client
Rendering: Of course, having the browser parse and render a 17K JSON feed wouldn’t be trivial. From a pure speed standpoint, the server could do the parse / generate once and then used an HTML rendering of the feed from cache from then on. Winner: RSS / server
Searching: Using JSON on the client, site A (which is syndicating content from site B), wouldn’t have any way of searching the content, outside of retrieving / parsing/ storing on the server. Also, spiders wouldn’t see the syndicated content from site B on site A unlike the server side RSS syndication where the syndicated content would look no different to a spider than the other content on site A. Winner: RSS / server
Ubiquity: JSON ‘only’ works if the browser has JavaScript enabled, which I’m guessing the large majority of users do have JavaScript enabled. But certain environments won’t and phones, set top boxes and anything else that runs in a browser but not on a PC may not have JavaScript, which means they won’t see the syndicated content. Server side generated content will be available across any platform that understands HTML. Winner: RSS / server

So wrapping up, when should you use JSON on the client and when should you use RSS on the server? If you need to syndicate a small amount of content to non programmers who can cut and paste (or programmers who are adept at JavaScript), JSON seems like the way to go. It’s trivial to get something up and running, the browser will cache the feed you create and your users will see the new content as soon as it becomes available in your JSON feed.

If you’ve read this far, you should go on and check out the examples on developer.yahoo.com and on del.icio.us. Also, if you’re a Java developer, you should head on over to sourceforge.net to take a look at the JSON-lib, which makes it wicked easy to create JSON from lists, arrays and beans.

Content Management, Personal, Systems Administration

New design

August 7, 2006 ajohnson 2 Comments

I got really bored with the old design of this site and all the cool kids seem to be using WordPress these days so last weekend I exported all 900 or so entries from Movable Type and imported them into WordPress, installed the ScribbishWP Theme and wrote a servlet filter to map my /blog/year/month/day/entry_name.html Movable Type permalinks to the WordPress style which is /blog/year/month/day/entry-name/. Oh, and comments are back on.

Enjoy!

Content Management, Interface Design, Open Source, WebWork

WebWork and meaningful URLs

August 1, 2006 ajohnson 2 Comments

Personal pet peeve: meaningful URLs (which tonight I found out go by many names: pretty URLs, RESTian URLs, SES URLs, hackable URLs, etc…). At work, we use WebWork extensively but up until this point we haven’t made an effort to create meaningful URL’s. As with any well designed framework, it turns out that there are a couple of ways you can create meaningful URL’s, with different levels of meaningfulness.

Version 2.2 of WebWork introduced the ActionMapper interface and a class called RestfulActionMapper, which gives you the ability to create URLs that might look something like this:
http://bookstore.com/books/category/java/keyword/webwork
instead of the more common:
http://bookstore.com/books.jspa?category=java&keyword=webwork
The nice thing about the RestfulActionMapper implementation is that you don’t have to write any code to parse the URL: you set up your WebWork actions with the appropriate setters and the RestfulActionMapper handles the rest. The downside is that this still isn’t really a truly hackable URL. For example, although this URL:
http://bookstore.com/books/category/java/keyword
and this URL:
http://bookstore.com/books/category
would probably work, they don’t really make sense. Why are ‘keyword’ and ‘category’ hanging around at the end? Both of the words are extra information required by the implementation that don’t add any value to the user.

The second way you can create meaningful URLs is by creating your own ActionMapper. You can get a good start by checking out the source code for the DefaultActionMapper and the RestfulActionMapper. To set properties on your action instances, you’ll want to create a HashMap,, add the appropriate properties from your URL to the map and then either create and return a new ActionMapping using the action name and map or call the setParams() method on an existing mapping. The end result is that you should be able to create and use meaningful URL that looks like this:
http://bookstore.com/books/java/webwork
Also of note:

User-Centered URL Design
URL as UI
What is interesting about LibraryLink
Representational State Transfer
Cool URIs don’t change
Updated 8/1/2006: Great quote by David Gelernter via Bill Dehora: “If you have three pet dogs, give them names. If you have 10000 head of cattle, don’t bother.”

Content Management, J2EE, Open Source, XML

Creating RSS using Java

September 7, 2003 ajohnson 6 Comments

I wanted to create RSS feeds for karensrecipes.com using Java. I did my ‘research‘, came to this page: Ben Hammersley.com: Java RSS libraries and then used the RSS4j library to create a servlet that serves up dynamic RSS feeds of the 10 most recently created recipes per category (samples: Breakfast, Soup, Barbeque..).

They syntax is pretty simple, you get an RssDocument and set which version you want to use (RSS 1.0, .9 or .91):
RssDocument doc = new RssDocument(); doc.setVersion(RssDocument.VERSION_10);
and then create a RssChannel object and add that to the RssDocument:
RssChannel channel = new RssChannel(); channel.setChannelTitle("Karens Recipes | Most Recent"); channel.setChannelLink("http://www.karensrecipes.com/3/Soup/default.jsp"); channel.setChannelDescription("The 10 most recently added recipes in the soup category."); channel.setChannelUri("http://www.karensrecipes.com/rss/?categoryid=3"); doc.addChannel(channel);
Next, you’ll retrieve the items using a database, the file system, etc… and add each item as a RssChannelItem:
// connect to the datasource // iterate over something (db? vector?...) RssChannelItem item = new RssChannelItem(); item.setItemTitle(label); item.setItemLink(link); item.setItemDescription(description); channel.addItem(item);
and then finally, using the RssGenerator class, call the generateRss() method, in this case I’m sending the output to a Servlet PrintWriter:
PrintWriter out = response.getWriter(); RssGenerator.generateRss(doc, out);
You could just as easily write it to a file:
File file = new File("/opt/data/rss.xml"); try{ RssGenerator.generateRss(doc, file); System.out.println("RSS file written."); } catch(RssGenerationException e){ e.printStackTrace(); }


Simple.  Easy to use.




	
	
				
			Content Management
		
			Open Source Content Management Conference
		
			May 12, 2003 ajohnson			Leave a comment
						

	


		
		I got cleared to go to the Cambridge OSCOM today. I’m going to sign up tomorrow. Since we’re debuting our own content management product soon (we’re having a party in June btw, you’re invited! email me if you’d like to come!), I’ll be scouring feature lists and technologies that we can integrate and/or inter-operate with.  Specifically, I’ll be interested in the following sessions:
Building a CMS Client With Mozilla
Collaborative Mapping on the Semantic Web
Bebop: Requirements and Design for a Web UI Component Library
Versioning Structured Content in a Content Management Application
Extending CMS with Web Services
From the information pile to social, knowledge exchanging bots
	

	
	



	
	
				
			Content Management
		
			Open Source Content Management conference blog
		
			April 18, 2003 ajohnson			Leave a comment
						

	


		
		The Open Source Content Management conference has blog specifically for the Cambridge event, giving those attending (or just interested) the ability to discuss each program.  Great idea!
	

	
	



	
	
				
			Content Management
		
			Open Source Content Management Conference registration
		
			April 15, 2003 ajohnson			Leave a comment
						

	


		
		Registration is now open for the Open Source Content Management Conference happening May 28-30 in Cambridge.  It’s only $200 for 3 days…
	

	
	

		
		Posts navigation
		
			1
2
Next →




		Now with 50% less caffeine!
	
	
		
		What’s Going On Here?
			My name is Aaron Johnson and I created this blog both for me (mostly) and sometimes you. I've been saving my delicious pinboard.in links here and blogging since 2002. During the week (and at night and some weekends and well.. most of the time), I work in engineering product management look after engineering at a software company in Portland, Oregon. When I'm not working, I'm hanging out with my amazing wife, our dinosaur Star Wars loving son three boys,   and five chickens, and giant dog in the burbs outside of Portland, Oregon.
		
See Also
			

Pinboard
Instagram
Bookboard
LinkedIn
Strava
Twitter

		
Monthly Archives

			
					January 2026 (1)
	October 2024 (1)
	September 2024 (1)
	August 2024 (1)
	June 2024 (1)
	May 2024 (1)
	April 2024 (1)
	March 2024 (1)
	February 2024 (1)
	January 2024 (1)
	December 2023 (1)
	November 2023 (1)
	October 2023 (1)
	September 2023 (1)
	July 2023 (1)
	March 2023 (1)
	February 2023 (1)
	January 2023 (1)
	November 2022 (1)
	October 2021 (1)
	September 2021 (1)
	July 2021 (1)
	June 2021 (2)
	May 2021 (1)
	April 2021 (1)
	February 2021 (3)
	January 2021 (2)
	December 2020 (3)
	November 2020 (3)
	October 2020 (4)
	August 2020 (2)
	July 2020 (3)
	June 2020 (3)
	May 2020 (5)
	April 2020 (4)
	March 2020 (2)
	February 2020 (3)
	January 2020 (4)
	December 2019 (4)
	November 2019 (2)
	October 2019 (4)
	September 2019 (2)
	August 2019 (7)
	July 2019 (3)
	June 2019 (3)
	May 2019 (1)
	April 2019 (4)
	March 2019 (6)
	February 2019 (5)
	January 2019 (4)
	December 2018 (3)
	November 2018 (8)
	October 2018 (2)
	September 2018 (5)
	August 2018 (5)
	July 2018 (4)
	May 2018 (2)
	April 2018 (7)
	March 2018 (5)
	February 2018 (3)
	January 2018 (5)
	December 2017 (5)
	November 2017 (4)
	October 2017 (8)
	September 2017 (2)
	August 2017 (3)
	June 2017 (3)
	May 2017 (2)
	April 2017 (1)
	January 2017 (10)
	December 2016 (4)
	August 2016 (1)
	July 2016 (3)
	June 2016 (5)
	May 2016 (7)
	April 2016 (2)
	March 2016 (7)
	February 2016 (4)
	January 2016 (7)
	December 2015 (2)
	November 2015 (9)
	October 2015 (4)
	September 2015 (8)
	August 2015 (1)
	July 2015 (4)
	June 2015 (5)
	May 2015 (4)
	April 2015 (12)
	March 2015 (5)
	February 2015 (6)
	January 2015 (7)
	December 2014 (6)
	November 2014 (9)
	October 2014 (14)
	September 2014 (9)
	August 2014 (5)
	July 2014 (5)
	June 2014 (8)
	May 2014 (4)
	April 2014 (2)
	March 2014 (2)
	February 2014 (3)
	January 2014 (10)
	December 2013 (2)
	November 2013 (3)
	October 2013 (5)
	September 2013 (5)
	August 2013 (3)
	July 2013 (4)
	June 2013 (4)
	May 2013 (6)
	April 2013 (4)
	March 2013 (3)
	February 2013 (5)
	January 2013 (7)
	December 2012 (1)
	November 2012 (4)
	October 2012 (5)
	September 2012 (3)
	August 2012 (3)
	July 2012 (7)
	June 2012 (5)
	May 2012 (3)
	April 2012 (5)
	March 2012 (5)
	February 2012 (9)
	January 2012 (9)
	December 2011 (10)
	November 2011 (6)
	October 2011 (6)
	September 2011 (5)
	August 2011 (5)
	July 2011 (8)
	June 2011 (13)
	May 2011 (3)
	April 2011 (10)
	March 2011 (6)
	February 2011 (2)
	January 2011 (4)
	December 2010 (8)
	November 2010 (12)
	October 2010 (9)
	September 2010 (6)
	August 2010 (4)
	July 2010 (8)
	June 2010 (9)
	May 2010 (4)
	April 2010 (9)
	March 2010 (6)
	February 2010 (9)
	January 2010 (10)
	December 2009 (10)
	November 2009 (10)
	October 2009 (6)
	September 2009 (10)
	August 2009 (13)
	July 2009 (12)
	June 2009 (11)
	May 2009 (8)
	April 2009 (4)
	March 2009 (7)
	February 2009 (2)
	January 2009 (3)
	December 2008 (4)
	November 2008 (7)
	October 2008 (10)
	September 2008 (6)
	August 2008 (7)
	July 2008 (9)
	June 2008 (15)
	May 2008 (9)
	April 2008 (10)
	March 2008 (8)
	February 2008 (6)
	January 2008 (15)
	December 2007 (10)
	November 2007 (9)
	October 2007 (6)
	September 2007 (9)
	August 2007 (12)
	July 2007 (9)
	June 2007 (6)
	May 2007 (8)
	April 2007 (10)
	March 2007 (14)
	February 2007 (12)
	January 2007 (17)
	December 2006 (11)
	November 2006 (11)
	October 2006 (8)
	September 2006 (11)
	August 2006 (14)
	July 2006 (11)
	June 2006 (13)
	May 2006 (11)
	April 2006 (8)
	March 2006 (5)
	February 2006 (7)
	January 2006 (8)
	December 2005 (6)
	November 2005 (6)
	October 2005 (9)
	September 2005 (3)
	August 2005 (11)
	July 2005 (12)
	June 2005 (11)
	May 2005 (4)
	April 2005 (5)
	March 2005 (8)
	February 2005 (5)
	January 2005 (3)
	December 2004 (6)
	November 2004 (7)
	October 2004 (4)
	September 2004 (9)
	August 2004 (5)
	July 2004 (10)
	June 2004 (12)
	May 2004 (4)
	April 2004 (13)
	March 2004 (10)
	February 2004 (9)
	January 2004 (13)
	December 2003 (8)
	November 2003 (9)
	October 2003 (17)
	September 2003 (28)
	August 2003 (21)
	July 2003 (24)
	June 2003 (31)
	May 2003 (43)
	April 2003 (30)
	March 2003 (48)
	February 2003 (45)
	January 2003 (43)
	December 2002 (28)
	November 2002 (30)
	October 2002 (34)
	September 2002 (41)
	August 2002 (35)
	July 2002 (20)
	June 2002 (1)