- inessential.com: Weblog: NetNewsWire and other desktop apps: your RSS router
Quote: "… news comes in to NetNewsWire, and then you route stuff to wherever it should go."
(categories: rssrouting aggregation aggregators feeds netnewswire ) - Regular Expression Matching Can Be Simple And Fast
A dive into the deep end of regular expressions.
(categories: regex algorithms regexp performance ) - testing html encoding description » testing » teestingtest
testing html encoding extended test—ttest
(categories: testing )
Monthly Archives: January 2008
deliciousposter 1.02 released and fun with html entities
I fixed a bug in deliciousposter that’s probably been wrecking havoc on anyone reading this site using an aggregator for a long time. You’re probably a nerd if you’re reading this post anyway, so I’ll bore you with the details. The deliciousposter project uses the delicious java library to get a list of posts from del.icio.us, creates a blog post using Velocity (which is really nasty now that I’ve been using Freemarker for the last year) and then uses the MetaWeblog API to publish the resulting blog post to a blog. So the data gets pushed to del.icio.us originally:
Aaron’s Stuff » DeliciousPoster
returned from del.icio.us in XML
Aaron’s Stuff » DeliciousPoster
to the del.icio.us java library, which decodes the XML so that you again have this:
Aaron’s Stuff » DeliciousPoster
but then you post that using XML-RPC and you end up with something like this:
Aaron?s Stuff ? DeliciousPoster
Why? Because you need to escape any HTML entities before sending them along via XML-RPC, I used Commons-Lang, which has a utility for escaping HTML entities:
StringEscapeUtils.escapeHtml(yourstring)
Think that’s nerdy? Wait until you have to do the same thing with titles in RSS.
The Data Life Cycle of a Blog Post
Cool flash infographic in the latest issue of Wired that shows what happens to your blog post after you click the ‘publish’ button (I’ll save you the hassle of actually viewing it: after you click the ‘publish’ button, exciting things like ping servers, data miners, search engines, text scrapers, aggregators, social bookmarking sites, online media, spam blogs and finally readers get involved). Since it’s Wired and not XML Journal, they stopped at the infographic, but man, it should would be cool to see all the ways that data massaged, reformatted, sliced and diced and transmitted, because there’s a lot that happens in that process. Just for the fun of it, I’m gonna walk through the scenarios I know about.
First, you click the publish button. But that might be a publish button on a desktop blogging client like Windows Live Writer or it might be the publish button in Microsoft Word or it might be a real live HTML button that says ‘publish’. So before you even get to the publish part, we’ve got the possibility of the MetaWeblog API (which is XML-RPC, effectively XML over HTTP), Atom Publishing Protocol (again effectively XML over HTTP) or a plain HTTP (or HTTPS!) POST.
OK, so now your blog post has been published on your blog. What next? Probably unbeknownst to you, your blog post has been automatically submitted to one or more ping servers using XML-RPC (XML over HTTP). Because search engines got into the blogging business, you can even ping Google and Yahoo (curiously not Microsoft, why?). If you don’t want to hassle with a bunch of different sites, you can always use pingomatic.com, which will ping (as of 1/27/2008) twenty one different ping servers for you.
Oh, I forgot to mention. If you’re using TypePad, Livejournal or Vox, the information about your blog post isn’t sent to these ping servers using XML-RPC, it’s streamed as XML in real-time over HTTP to many of the same parties.
Great, your blog post has now been sent to everyone, you’re good right? Nope. Now comes the onslaught of spiders and bots, awoken by the ping you sent, who will request your feed (RSS / Atom over HTTP) and your blog post (HTML over HTTP) and your first born child again and again and again. And now that your blog post is published and assuming that you’ve published something of value, you’ll see real people stop by and comment on your blog post and maybe bookmark it in a site like del.icio.us or ma.gnolia.com, snipping a quote from your blog post and then publishing that snippet to their own blogs or to their bug tracker and now your blog post has replicated, it lives in small parts all over the web, each part getting published and spidered and syndicated and ripped again and again and again. It’s beautiful isn’t it?
Links: 1-25-2008
- Apache Mahout – Overview
Mahout’s goal is to build scalable .. machine learning libraries. Initially, we are interested in building out the ten machine learning libraries detailed in http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf using Hadoop.
(categories: datamining apache hadoop java mapreduce opensource mahout machinelearning algorithms )
Links: 1-21-2008
- Persai Research
Feed corpus: 124,460 unique RSS/Atom feeds
(categories: persai feeds rss research data bigdata ) - The Rest of the Rest of Us
Quote: “Poverty is the most restricting force in American life. It’s become somewhat unfashionable to point this out, but I don’t see how it could be otherwise. Given the choice between being born poor and being born female, which would you choose?”
(categories: poverty consumerism oreily socialjustice education digital-divide ) - What We Have vs. What We Want
Quote: “Some luck lies in not getting what you thought you wanted but getting what you have, which you may be smart enough to see is what you would have wanted if you didn’t have it.” (from Garrison Keillor)
(categories: satisfaction happiness consumerism materialism )
Links: 1-20-2008
- Michael Gartenberg – The Real Importance of iPhone Update 1.1.3
Quote: “… the 1.1.3 update .. wasn’t important for features per se, it’s important because it shows Apple can give the iPhone extended life by delivering upgrades through the iTunes utility that change and enhance the iPhone and the way it works.”
(categories: iphone upgrades saas customization ) - LingPipe Home
LingPipe is a suite of Java libraries for the linguistic analysis of human language.
(categories: datamining analysis java language ) - Web Data Mining, book by Bing Liu
Exploring Hyperlinks, Contents and Usage Data
(categories: data datamining research book books )
Links: 1-19-2008
- How Does FeedDemon Calculate Attention?
Nick Bradbury posts the algorithmthat Feed Demon uses for determining a feed’s attention rank.
(categories: feedemon attention attentionstream algo algorithms feeds rss syndication )
Links: 1-18-2008
- (theinfo)
“… a site for large data sets and the people who love them”
(categories: bigdata data megadata datasets visualization analytics analysis ) - Themes – Google Code
Description: The iGoogle Themes API allows you to personalize iGoogle by modifying the page’s design. Your theme can modify the header and footer images, text colors, link colors, gadget frames, and more
(categories: api themes google customization )
Links: 1-15-2008
- Tristan O?Tierney » FlickrBooth
Cool software for PhotoBooth that enables you to automatically post your photobooth pictures to Flickr.
(categories: photobooth flickrbooth flickr camera free mac )
Links: 1-11-2008
- Manageability – How To Build Damn Good Software
Graham Glass (Mind Electric fame) and John Wiegand and Erich Gamma (both of Eclipse fame) share a few morsels of wisdom on how to build good software.
(categories: management software development java process methodology ) - SVNKit :: Subversion for Java
SVNKit is a pure Java Subversion client library.
(categories: subversion java svn ) - Re: Best Practices for Distributing Lucene Indexing and Searching
Be interesting to see if you could combine this with Hadoop to do distributed indexing.
(categories: lucene hadoop indexing search technorati distributed java clustering ) - What’s That Noise?! [Ian Kallen’s Weblog]
Some interesting stuff about MySQL, the ReplicationConnection class and scaling MySQL query loads.
(categories: mysql replication scalability performance ) - A Journey In Social Media: Clearspace Alternatives
Quote: “… let me make this simple. To the best of our knowledge, there is no viable alternative in the marketplace to Clearspace from Jive Software.”
(categories: clearspace clearspacerocks chuckhollis emc sharepoint ) - one small voice » Blog Archive » XMPP in TiVo
Tivo uses XMPP to send messages from central office to each machine now instead of polling.
(categories: xmpp polling messaging )